CN108416434B - Circuit structure for accelerating convolutional layer and full-connection layer of neural network - Google Patents

Circuit structure for accelerating convolutional layer and full-connection layer of neural network Download PDF

Info

Publication number
CN108416434B
CN108416434B CN201810120895.0A CN201810120895A CN108416434B CN 108416434 B CN108416434 B CN 108416434B CN 201810120895 A CN201810120895 A CN 201810120895A CN 108416434 B CN108416434 B CN 108416434B
Authority
CN
China
Prior art keywords
matrix
layer
weight
data
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810120895.0A
Other languages
Chinese (zh)
Other versions
CN108416434A (en
Inventor
韩军
蔡宇杰
曾晓洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201810120895.0A priority Critical patent/CN108416434B/en
Publication of CN108416434A publication Critical patent/CN108416434A/en
Application granted granted Critical
Publication of CN108416434B publication Critical patent/CN108416434B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The invention belongs to the technical field of integrated circuit design, and particularly relates to a circuit structure capable of accelerating a convolution layer and a full connection layer simultaneously. The circuit structure of the invention comprises five parts: the system comprises a characteristic/weight prefetching module for data reading, a local cache for improving the data reuse rate, a matrix operation unit for realizing matrix multiplication, a temporary data accumulation module for accumulating temporary output results and an output control module for data write-back. The circuit uses a special mapping method to map the operation of the convolution layer and the operation of the full connection layer to a matrix operation unit with a fixed size. The circuit adjusts the memory arrangement mode of the characteristics and the weight, thereby greatly improving the memory access efficiency of the circuit. Meanwhile, the scheduling of the circuit module adopts a pipeline mechanism, so that all hardware units in each clock cycle are in a working state, the utilization rate of the hardware units is improved, and the working efficiency of the circuit is improved.

Description

Circuit structure for accelerating convolutional layer and full-connection layer of neural network
Technical Field
The invention belongs to the technical field of integrated circuit design, and particularly relates to a circuit structure for accelerating a convolution layer and a full connection layer of a neural network.
Background
In the last 60 th century, Hubel et al proposed the concept of the receptive field through the study of the visual cortical cells of cats, and in the 80 th century, Fukushima proposed the concept of the neurocognitive machine on the basis of the receptive field concept, which can be regarded as the first implementation network of the convolutional neural network, the neurocognitive machine decomposed a visual pattern into a plurality of sub-patterns (features), and then entered the feature plane connected by hierarchical connection, it tried to model the visual system, so that it could complete the recognition even if the object had displacement or slight deformation.
Convolutional neural networks are a variant of the multi-layered perceptron. Developed by the biologists huboer and viser at an early stage of research on the visual cortex of cats. The cells of the visual cortex present a complex architecture. These cells are very sensitive to a sub-region of the visual input space, we call the receptive field, and are tiled in this way over the entire field of view area. These cells can be divided into two basic types, simple cells and complex cells. Simple cells respond maximally to marginal stimulation patterns from within the receptive field. Complex cells have a larger receptive domain that is locally invariant to stimuli from an exact site. The convolutional neural network structure includes: convolutional layer, downsampling layer, full connection layer. Each layer has a plurality of feature maps, each feature map extracting a feature of the input through a convolution filter, each feature map having a plurality of neurons.
Because of the huge calculation amount, the convolutional neural network is difficult to perform local operation on the mobile terminal at present, and is mostly realized by a cloud computing mode. While the amount of operations of the convolutional neural network is more than ninety percent of the computation of the convolutional layer and the fully-connected layer, a separate accelerating circuit is usually designed for the two operations, thereby introducing extra chip area.
The invention provides a circuit structure capable of accelerating convolution layers and full connection layers simultaneously, which can be mapped to the same matrix operation unit (array of a multiplier and an adder) by a method of reordering the characteristics and weights of each layer of a neural network. Therefore, the multiplexing efficiency of hardware is improved, the chip area is reduced, and the circuit can obtain higher operation throughput rate in unit area.
Disclosure of Invention
The invention aims to provide a circuit structure capable of accelerating a convolution layer and a full connection layer simultaneously aiming at the operation acceleration of the convolution layer and the full connection layer of a neural network so as to improve the hardware multiplexing efficiency and reduce the chip area.
The circuit structure for accelerating the convolution layer and the full-connection layer of the neural network provided by the invention can map the convolution layer and the full-connection layer to the same matrix operation unit by a method of expanding operation; and by a method of reordering the characteristics and the weights of each layer of the neural network, the access performance loss caused by the discontinuity of the read addresses of the characteristics and the weights after expansion is reduced.
The circuit structure provided by the invention comprises a characteristic/weight prefetching module, a local cache, a matrix operation unit, a temporary data accumulation module and an output control module; wherein:
the feature/weight prefetch module is to fetch and place new feature and weight data from an external memory (DRAM) into a local cache while replacing old, unused data. Except the first layer of characteristics of the neural network, all other characteristics and weights are rearranged according to a certain mode, and the first layer of characteristics are also rearranged according to a certain mode, which is realized by software; the feature/weight prefetch module does not need to implement the rearrangement function;
the local cache is used for caching input data required by the matrix operation unit. Whether the convolution layer or the full connection layer is adopted, a large amount of data multiplexing exists in the operation, so that the data which can be multiplexed is stored in the local cache, and the access amount to an external memory is reduced;
the matrix operation unit is an array of a multiplier and an adder and is used for realizing the operation of the matrix. After the features and the weights are rearranged, the operation of the convolution layer and the full connection layer is mapped into a series of matrix operations, and the matrix operations are realized by calling a matrix operation module for multiple times;
and the temporary data accumulation module is used for accumulating the data sent by the matrix operation module. After multiple times of accumulation, the accumulated result (the input characteristic of the next layer of network) is sent to an output control module;
and the output control module is responsible for sequentially writing the accumulated results back to the external memory according to the same rearrangement mode.
In mapping convolutional layer operations to matrix operations, it is necessary to pull the input features into a series of row vectors and expand the convolutional kernels into a two-dimensional matrix. Therefore, the traditional memory space allocation method can cause the addresses needing to be read by the characteristic/weight prefetching module to be not continuous, thereby reducing the memory access efficiency. The characteristics and the weights are rearranged, so that the continuity of the addresses read by the characteristic/weight prefetching module is ensured, and the access efficiency of the circuit is greatly improved. The process of rearranging the characteristics and the weights according to a certain mode is as follows:
as in fig. 4, for a size CinCutting the input characteristic of H W into H W strips, wherein the length of each strip is Cin. And writing the data in the H x W stripes into the memory in a sequential address mode. Starting from the low address, the data in the 0 th stripe is stored in 0 to Cin-1 data corresponding to the memory space, the data in the 1 st stripe being stored in CinTo 2Cin-1 data in the memory space, and so on, and the data in the last stripe (H W-1) is stored in (H W-1) CinTo Cin*H*W*Cin-1 data corresponding to the memory space. In other words, the order of expansion of the features in memory is Cin=> W =>H (W = traditional memory space allocation method)> H => Cin)。
The convolution kernel includes CoutSize is CinH and W sub-weight matrixes, and arranging each sub-weight matrix according to the form of the input features, so that the readjustment of the weight memory distribution can be completed. I.e. the order of expansion of the weight features in memory is Cin => W => H => Cout(W = traditional memory space allocation method)> H => Cin => Cout)。
In the invention, the scheduling of the characteristic/weight pre-fetching module, the local cache, the matrix operation unit, the temporary data accumulation module and the output control module adopts a pipeline mechanism, so that all hardware units in each clock cycle are in a working state, the utilization rate of the hardware units is improved, the chip area is reduced, and the working efficiency of the circuit is improved.
The invention has the beneficial effects that: the convolution layer and the full connection layer can share the same arithmetic circuit, so that hardware can be fully multiplexed, and the convolution device is suitable for various convolution neural network structures. At the same time, the output control module writes the outputs of each layer back to the external memory in the expected arrangement order. All the features of all layers except the first layer are arranged and no cost is required for rearranging the data. While the weights of the convolutional neural network are unchanged during the inference phase, i.e. the weights need to be rearranged only once at system initialization.
Drawings
Fig. 1 is a basic block diagram of the circuit.
FIG. 2 is a diagram illustrating conversion of full link layer operation into convolutional layer operation.
FIG. 3 is a diagram illustrating mapping of convolutional layer operations to matrix operations.
Fig. 4 is a schematic diagram of the memory arrangement of features and weights.
FIG. 5 is a schematic diagram of a decomposition of an arbitrary-scale matrix operation into multiple fixed-size matrix operations.
Detailed Description
In the present invention, a basic block diagram of a circuit capable of accelerating both the convolutional layer and the fully-connected layer is shown in fig. 1. The working process of the design is as follows: inputting the features of each layer with the corresponding weights in the external memory (DRAM) of claim 5. First, the feature/weight prefetch module reads out the features and weights to be involved in the operation from the external memory and puts them into the local cache. The new data can replace the old and unused data in the local cache; then, the control circuit fetches the features and weights to be involved in the operation from the local buffer in accordance with the order of the operation, and sends them to the matrix operation unit. After the features and the weights are rearranged, the operation of the convolution layer and the full connection layer is mapped into a series of matrix operation; and the output result of the matrix operation unit is written into the temporary data accumulation module. After a number of matrix operations are performed, the accumulated result is part of the output characteristics of the layer of operations. The output control module is responsible for writing the partial output characteristics back to the external memory according to a specific arrangement sequence. After all operations of the current layer are completed, the circuit can start to operate the next layer of network.
The operation of the convolution layer and the full connection layer is mapped into a series of matrix operations, and the specific flow is described as follows:
first, the operation of the full link layer is converted into the operation of the convolutional layer, as shown in fig. 2. Let the input feature be a shape CinH W cube, meaning: input is provided with CinAnd each channel is H W. For a fully connected layer, the usual operation is to rearrange the input matrix to a length CinH W row vector, and then adding the vector to a vector with a height of CinH W, width CoutThe weight matrix of (a) is multiplied. The result of the matrix multiplication is a length CoutThe row vector is the feature that the current layer transmits to the next layer of network. To convert a full link layer operation into a convolutional layer operation, the height needs to be CinH W, width CoutIs split into CoutThe sub-weight matrices are respectively denoted as K0, K1, K2, … …, Kn (n = C)out-1). Each sub-weight matrix is a shape CinH W cube. Convolving each sub-weight matrix with the input features respectively, because the shapes of the sub-weight matrices are completely the same (all are C)inH W) so that the result of each convolution is a scalar whose value is equal to the result of the inner product of the feature matrix and the weight matrix. For CoutThe sub-weight matrices, together, can yield CoutA scalar. Will C thisoutThe scalars are concatenated into a vector to obtain the output of the current network layer (fully-connected layer). According to this method, a fully-connected layer can be converted to an input signature and convolution kernel of size CinH W, the number of output channels is CoutThe convolution operation of (1).
Next, the convolution layer operations are mapped into matrix operations, as shown in FIG. 3. Input feature size of CinH W, the size of the convolution kernel (weight) is CinK, together with CoutA convolution kernel corresponding to CoutAn output channel. To obtain the first pixel of each output channel, the required C is obtainedinDrawing K input features into a row vector, and CoutThe convolution kernel is unfolded into a high CoutWidth CinK matrix. Multiplying the characteristic row vector by the weight matrix to obtain a length CoutEach element of the row vector represents the first pixel of each output channel. In order to calculate all the pixels, H × W matrix operations are required. By this method, convolution layer operation can be converted into H x W matrix operation, wherein the height of matrix is CoutWidth is CinK. This is a matrix with a relatively large dimension, and the size varies with the convolutional layer, which is not suitable for hardware implementation, and therefore, it is necessary to decompose such matrix operation into a plurality of matrix operations of fixed size.
Finally, the matrix operation is decomposed into a plurality of matrix operations of fixed size.
FIG. 5 illustrates how a fixed size H can be usedF*WFThe matrix operation unit is used for realizing the process of matrix operation of H x W. In order to realize H W matrix operation, ceil (H/H) is required to be calledF)*ceil(W/WF) The secondary size is HF*WFCeil denotes rounding up. The data used in the first operation is a sub-matrix of the original matrix, which is located from 0 to W of the original matrixF-1 rows and 0 to HF-1 column. The output of the first operation is a length WFThe scalar is output as temporary data to the temporary data accumulation module; the data used in the second operation is still a sub-matrix of the original matrix, which is located between 0 and W of the original matrixF-1 line and HFTo 2HFColumn-1, which shows the iterative operation in the column direction. The output of the second operation is still a length WFA scalar of (c). In the process of ceil (H/H)F) After the iteration, the row direction iteration is finished, and ceil (H/H) is generated togetherF) Each length is WFA scalar of (c). The sum of these scalars is a matrix of H x WFront W of operationFAnd (5) calculating results. The remaining W-W can be calculated by the same methodFAnd (6) obtaining the result. Thus, a matrix operation of arbitrary size can be decomposed into multiple fixed size matrix operations.
For example, a matrix operation of 100 × 32 is implemented by using a matrix operation unit of size 64 × 16 as follows. In order to realize 100 × 32 matrix operations, ceil (100/64) × ceil (32/16) =4 matrix operation units with size 64 × 16 need to be called. The data used in the first operation is a sub-matrix of the original matrix, which is located in 0 to 15 rows and 0 to 63 columns of the original matrix, as shown by the red box (i.e., inner frame) in fig. 5 (a). The output of the first operation is a scalar with the length of 16, and the scalar is output to the temporary data accumulation module as temporary data; the data used in the second operation is still a sub-matrix of the original matrix, which is located in rows 0 to 15 and columns 64 to 99 of the original matrix. Since this operation uses only 100-64=36 columns of matrix operation cells, the remaining 28 columns of data need to be complemented by 0. The output of the second operation is still a scalar of length 16, the sum of which and the result of the first operation, the first 16 operations of this 16 x 100 matrix operation. The remaining 16 results can be calculated by the same method, and thus, one arbitrary-scale matrix operation can be decomposed into a plurality of fixed-size matrix operations.
The output result of the matrix operation unit with fixed size is stored in the temporary data accumulation module. After the accumulation is finished, the accumulation module sends the accumulated result (the input characteristic of the next layer of network) to the output control module, and the output control module is responsible for writing the accumulated result back to the external memory according to a certain arrangement sequence, so that the operation of the current layer (which can be a convolution layer or a full connection layer) is completed.
In mapping convolutional layer operations to matrix operations, it is necessary to pull the input features into a series of row vectors and expand the convolutional kernels into a matrix. If the conventional memory space allocation method is used, the access bandwidth of the external memory becomes a bottleneck of the whole system, because the addresses required to be read by the feature/weight prefetch module become discontinuous. In order to ensure the continuity of the addresses where the data read by the feature/weight prefetching module is located, the memory arrangement of the features and the weights needs to be adjusted.
As in fig. 4, for a size CinCutting the input characteristic of H W into H W strips, wherein the length of each strip is Cin. And writing the data in the H x W stripes into the memory in a sequential address mode. Starting from the low address, the data in the 0 th stripe is stored in 0 to Cin-1 data corresponding to the memory space, the data in the 1 st stripe being stored in CinTo 2Cin-1 data in the memory space, and so on, and the data in the last stripe (H W-1) is stored in (H W-1) CinTo Cin*H*W*Cin-1 data corresponding to the memory space. In other words, the order of expansion of the features in memory is Cin=> W =>H (W = traditional memory space allocation method)> H => Cin)。
The convolution kernel includes CoutSize is CinH and W sub-weight matrixes, and arranging each sub-weight matrix according to the form of the input features, so that the readjustment of the weight memory distribution can be completed. I.e. the order of expansion of the weight features in memory is Cin => W => H => Cout(W = traditional memory space allocation method)> H => Cin => Cout)。
And when the operation of each layer is finished, the output control module writes the output of each layer back to the external memory according to the expected arrangement sequence. All the features of all layers except the first layer are arranged and no cost is required for rearranging the data. While the weights of the convolutional neural network are unchanged during the inference phase, i.e. the weights need to be rearranged only once at system initialization. The cost of adjusting the arrangement of features and weights in memory is relatively small.

Claims (4)

1. A circuit structure for accelerating a convolution layer and a full connection layer of a neural network is characterized in that the convolution layer and the full connection layer are both mapped to the same matrix operation unit in a mode of expanding operation; the access performance loss caused by the discontinuity of the expanded feature and weight reading addresses is reduced by reordering the features and weights of each layer of the neural network; the circuit structure comprises a characteristic/weight prefetching module, a local cache, a matrix operation unit, a temporary data accumulation module and an output control module; wherein:
the characteristic/weight pre-fetching module is used for taking out and putting new characteristic and weight data into a local cache from an external memory and replacing old and unused data; except for the first-layer features of the neural network, all the other features and weights are rearranged according to a certain mode, and the rearrangement of the first-layer features is also rearranged according to a certain mode, so that the feature/weight pre-fetching module does not need to realize the function of rearrangement;
the local cache is used for caching input data required by the matrix arithmetic unit;
the matrix operation unit is used for realizing the operation of a matrix; after the features and the weights are rearranged, the operation of the convolution layer and the full connection layer is mapped into a series of matrix operations, and the matrix operations are realized by calling a matrix operation module for multiple times;
the temporary data accumulation module is used for accumulating the data sent by the matrix operation module; after multiple times of accumulation, the accumulated result, namely the input characteristic of the next layer of network, is sent to an output control module;
the output control module is responsible for sequentially writing the accumulated results back to the external memory according to the rearrangement mode;
the features and the weights are rearranged according to a certain mode, and the specific process is as follows:
let a size CinCutting the input characteristic of H W into H W strips, wherein the length of each strip is Cin(ii) a Writing the data in the H x W strips into a memory in a sequential address mode; from low addressFirst, the data in the 0 th stripe is stored in 0 to Cin-1 data corresponding to the memory space, the data in the 1 st stripe being stored in CinTo 2Cin-1 data in the corresponding memory space, and so on, and the data in the last stripe is stored in (H × W-1) × CinTo Cin*H*W*Cin-1 data in a corresponding memory space;
let the convolution kernel contain CoutSize is CinH and W sub-weight matrixes, and arranging each sub-weight matrix according to the form of the input features, namely finishing readjustment of weight memory distribution.
2. The circuit structure for accelerating the convolutional layer and the fully-connected layer of the neural network as claimed in claim 1, wherein the feature/weight pre-fetching module, the local cache, the matrix operation unit, the temporary data accumulation module and the output control module are scheduled by a pipeline mechanism, so that all hardware units are in a working state every clock cycle.
3. The circuit structure for accelerating convolutional layers and fully-connected layers of a neural network as claimed in claim 1, wherein the operation of the convolutional layers and fully-connected layers is mapped to a series of matrix operations, and the specific flow is as follows:
firstly, converting the operation of a full connection layer into the operation of a convolution layer; let the input feature be a shape CinH W cube, meaning: input is provided with CinEach channel is H W; for a fully connected layer, the usual operation is to rearrange the input matrix to a length CinH W row vector, and then adding the vector to a vector with a height of CinH W, width CoutMultiplying the weight matrix; to convert full link layer operations to convolutional layer operations, the height is CinH W, width CoutIs split into CoutThe sub-weight matrices are respectively marked as K0, K1, K2, … …, Kn, n = Cout-1; each sub-weight matrix is of a shape ofCinH × W cubes; convolving each sub-weight matrix with the input features respectively, wherein the sub-weight matrices are all C when the shapes of the sub-weight matrices are completely the sameinH W; the result of each convolution is a scalar, and the value of the scalar is equal to the result of inner product of the feature matrix and the weight matrix; for CoutSub-weight matrices, together yielding CoutA scalar quantity; will C thisoutThe scalar quantities are connected into a vector, so that the output of the current network layer, namely the full connection layer, is obtained; thus, a full connection layer is converted to an input feature with a convolution kernel size of CinH W, the number of output channels is CoutThe convolution operation of (2);
secondly, mapping the operation of the convolution layer into matrix operation; input feature size of CinH W, the size of the convolution kernel, i.e. the weight, is CinK, together with CoutA convolution kernel corresponding to CoutAn output channel; to obtain the first pixel of each output channel, the required C is obtainedinDrawing K input features into a row vector, and CoutThe convolution kernel is unfolded into a high CoutWidth CinK matrix; multiplying the characteristic row vector by the weight matrix to obtain a length CoutEach element of the row vector represents a first pixel point of each output channel; calculating all pixel points, namely performing H × W times of matrix operation; thus, the convolution layer operation is converted into H x W times matrix operation, wherein the height of the matrix is CoutWidth is Cin*K*K;
Finally, such a matrix operation is decomposed into a plurality of fixed-size matrix operations.
4. The circuit structure for accelerating convolutional layers and fully-connected layers of a neural network of claim 3, wherein the process of decomposing the matrix operation into a plurality of fixed-size matrix operations is:
let the matrix to be operated on be H W, and decompose the matrix of fixed size used for operation into HF*WFTherefore ceil (H/H) needs to be calledF)*ceil(W/WF) The secondary size is HF*WFCeil represents rounding up; the data used in the first operation is a sub-matrix of the original matrix, which is located from 0 to W of the original matrixF-1 rows and 0 to HF-1 column; the output of the first operation is a length WFThe scalar is output as temporary data to the temporary data accumulation module; the data used in the second operation is still a sub-matrix of the original matrix, which is located between 0 and W of the original matrixF-1 line and HFTo 2HF-1 column, which represents the iterative operation in the column direction; the output of the second operation is still a length WFA scalar of (a); in the process of ceil (H/H)F) After the iteration, the column direction iteration is finished, and ceil (H/H) is generated togetherF) Each length is WFA scalar of (a); the sum of these scalars is the first W of the matrix operation of H x WFAn operation result; by analogy, the rest W-W is calculatedFAnd (6) obtaining the result.
CN201810120895.0A 2018-02-07 2018-02-07 Circuit structure for accelerating convolutional layer and full-connection layer of neural network Active CN108416434B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810120895.0A CN108416434B (en) 2018-02-07 2018-02-07 Circuit structure for accelerating convolutional layer and full-connection layer of neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810120895.0A CN108416434B (en) 2018-02-07 2018-02-07 Circuit structure for accelerating convolutional layer and full-connection layer of neural network

Publications (2)

Publication Number Publication Date
CN108416434A CN108416434A (en) 2018-08-17
CN108416434B true CN108416434B (en) 2021-06-04

Family

ID=63126912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810120895.0A Active CN108416434B (en) 2018-02-07 2018-02-07 Circuit structure for accelerating convolutional layer and full-connection layer of neural network

Country Status (1)

Country Link
CN (1) CN108416434B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308194B (en) * 2018-09-29 2021-08-10 北京字节跳动网络技术有限公司 Method and apparatus for storing data
CN109375952B (en) * 2018-09-29 2021-01-26 北京字节跳动网络技术有限公司 Method and apparatus for storing data
WO2020062252A1 (en) * 2018-09-30 2020-04-02 华为技术有限公司 Operational accelerator and compression method
CN111045958B (en) * 2018-10-11 2022-09-16 展讯通信(上海)有限公司 Acceleration engine and processor
CN109886398A (en) * 2019-01-03 2019-06-14 曾集伟 Neural network matrix multiplying method and Related product
CN109816108A (en) * 2019-02-15 2019-05-28 领目科技(上海)有限公司 Deep learning accelerator, device and method
CN109948787B (en) * 2019-02-26 2021-01-08 山东师范大学 Arithmetic device, chip and method for neural network convolution layer
CN110032538B (en) * 2019-03-06 2020-10-02 上海熠知电子科技有限公司 Data reading system and method
CN109993283B (en) * 2019-04-12 2023-02-28 南京吉相传感成像技术研究院有限公司 Deep convolution generation type countermeasure network acceleration method based on photoelectric calculation array
CN110222819B (en) * 2019-05-13 2021-04-20 西安交通大学 Multilayer data partition combined calculation method for convolutional neural network acceleration
CN111950718B (en) * 2019-05-16 2021-12-07 北京知存科技有限公司 Method for realizing progressive CNN operation by using storage and computation integrated chip
CN112784973A (en) * 2019-11-04 2021-05-11 北京希姆计算科技有限公司 Convolution operation circuit, device and method
CN113222136A (en) * 2020-01-21 2021-08-06 北京希姆计算科技有限公司 Convolution operation method and chip
CN111340224B (en) * 2020-02-27 2023-11-21 浙江芯劢微电子股份有限公司 Accelerated design method of CNN (computer network) suitable for low-resource embedded chip
WO2022013722A1 (en) * 2020-07-14 2022-01-20 United Microelectronics Centre (Hong Kong) Limited Processor, logic chip and method for binarized convolution neural network
CN112418419B (en) * 2020-11-20 2022-10-11 复旦大学 Data output circuit structure processed by neural network and scheduled according to priority
CN112614175A (en) * 2020-12-21 2021-04-06 苏州拓驰信息技术有限公司 Injection parameter determination method for hole sealing agent injector based on characteristic decorrelation
CN113592075B (en) * 2021-07-28 2024-03-08 浙江芯昇电子技术有限公司 Convolution operation device, method and chip
CN115906948A (en) * 2023-03-09 2023-04-04 浙江芯昇电子技术有限公司 Full-connection-layer hardware acceleration device and method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5414623A (en) * 1992-05-08 1995-05-09 Iowa State University Research Foundation Optoelectronic system for implementation of iterative computer tomography algorithms
CN1688119A (en) * 2005-04-01 2005-10-26 清华大学 Method for testing DS. CDMA system multi-user developed based on weighting
US20130282634A1 (en) * 2011-03-31 2013-10-24 Microsoft Corporation Deep convex network with joint use of nonlinear random projection, restricted boltzmann machine and batch-based parallelizable optimization
CN103902762A (en) * 2014-03-11 2014-07-02 复旦大学 Circuit structure for conducting least square equation solving according to positive definite symmetric matrices
CN104679895A (en) * 2015-03-18 2015-06-03 成都影泰科技有限公司 Medical image data storing method
US20150170020A1 (en) * 2013-12-13 2015-06-18 Amazon Technologies, Inc. Reducing dynamic range of low-rank decomposition matrices
US20170046602A1 (en) * 2015-08-14 2017-02-16 International Business Machines Corporation Learning temporal patterns from electronic health records
CN106503797A (en) * 2015-10-08 2017-03-15 上海兆芯集成电路有限公司 The data for being received from neural memorizer are arranged the neutral net unit and collective with neural memorizer the neural pe array for being shifted
CN106855853A (en) * 2016-12-28 2017-06-16 成都数联铭品科技有限公司 Entity relation extraction system based on deep neural network
US20170221176A1 (en) * 2016-01-29 2017-08-03 Fotonation Limited Convolutional neural network
CN107454966A (en) * 2015-05-21 2017-12-08 谷歌公司 Weight is prefetched for neural network processor

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5414623A (en) * 1992-05-08 1995-05-09 Iowa State University Research Foundation Optoelectronic system for implementation of iterative computer tomography algorithms
CN1688119A (en) * 2005-04-01 2005-10-26 清华大学 Method for testing DS. CDMA system multi-user developed based on weighting
US20130282634A1 (en) * 2011-03-31 2013-10-24 Microsoft Corporation Deep convex network with joint use of nonlinear random projection, restricted boltzmann machine and batch-based parallelizable optimization
US20150170020A1 (en) * 2013-12-13 2015-06-18 Amazon Technologies, Inc. Reducing dynamic range of low-rank decomposition matrices
CN103902762A (en) * 2014-03-11 2014-07-02 复旦大学 Circuit structure for conducting least square equation solving according to positive definite symmetric matrices
CN104679895A (en) * 2015-03-18 2015-06-03 成都影泰科技有限公司 Medical image data storing method
CN107454966A (en) * 2015-05-21 2017-12-08 谷歌公司 Weight is prefetched for neural network processor
US20170046602A1 (en) * 2015-08-14 2017-02-16 International Business Machines Corporation Learning temporal patterns from electronic health records
CN106503797A (en) * 2015-10-08 2017-03-15 上海兆芯集成电路有限公司 The data for being received from neural memorizer are arranged the neutral net unit and collective with neural memorizer the neural pe array for being shifted
US20170221176A1 (en) * 2016-01-29 2017-08-03 Fotonation Limited Convolutional neural network
CN106855853A (en) * 2016-12-28 2017-06-16 成都数联铭品科技有限公司 Entity relation extraction system based on deep neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Fast Decoding and Hardware Design for Binary-Input Compressive Sensing》;Min Wang,et al;《IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS》;20120930;第2卷(第3期);全文 *
《一种面积与功耗优化的卷积器设计》;陈琛等;《计算机工程》;20101130;全文 *

Also Published As

Publication number Publication date
CN108416434A (en) 2018-08-17

Similar Documents

Publication Publication Date Title
CN108416434B (en) Circuit structure for accelerating convolutional layer and full-connection layer of neural network
CN111242289B (en) Convolutional neural network acceleration system and method with expandable scale
CN111178519B (en) Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
Shen et al. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA
JP7329533B2 (en) Method and accelerator apparatus for accelerating operations
CN108205701B (en) System and method for executing convolution calculation
CN108108809B (en) Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof
CN108763612B (en) Circuit for accelerating operation of pooling layer of neural network
KR102523263B1 (en) Systems and methods for hardware-based pooling
CN109409512B (en) Flexibly configurable neural network computing unit, computing array and construction method thereof
US20210019594A1 (en) Convolutional neural network accelerating device and method
KR20170023708A (en) Convolutional neural network computing apparatus
CN108170640B (en) Neural network operation device and operation method using same
CN110580519B (en) Convolution operation device and method thereof
CN112703511B (en) Operation accelerator and data processing method
US20240119114A1 (en) Matrix Multiplier and Matrix Multiplier Control Method
CN114461978B (en) Data processing method and device, electronic equipment and readable storage medium
CN109446478B (en) Complex covariance matrix calculation system based on iteration and reconfigurable mode
CN112215345A (en) Convolutional neural network operation method and device based on Tenscorore
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN114003201A (en) Matrix transformation method and device and convolutional neural network accelerator
CN112836823B (en) Convolutional neural network back propagation mapping method based on cyclic recombination and blocking
CN111667052A (en) Standard and nonstandard volume consistency transformation method for special neural network accelerator
Sakr et al. Memory-efficient CMSIS-NN with replacement strategy
US20030187898A1 (en) Parallel processing method of an eigenvalue problem for a shared-memory type scalar parallel computer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant