CN108416434B - Circuit structure for accelerating convolutional layer and full-connection layer of neural network - Google Patents
Circuit structure for accelerating convolutional layer and full-connection layer of neural network Download PDFInfo
- Publication number
- CN108416434B CN108416434B CN201810120895.0A CN201810120895A CN108416434B CN 108416434 B CN108416434 B CN 108416434B CN 201810120895 A CN201810120895 A CN 201810120895A CN 108416434 B CN108416434 B CN 108416434B
- Authority
- CN
- China
- Prior art keywords
- matrix
- layer
- weight
- data
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Abstract
The invention belongs to the technical field of integrated circuit design, and particularly relates to a circuit structure capable of accelerating a convolution layer and a full connection layer simultaneously. The circuit structure of the invention comprises five parts: the system comprises a characteristic/weight prefetching module for data reading, a local cache for improving the data reuse rate, a matrix operation unit for realizing matrix multiplication, a temporary data accumulation module for accumulating temporary output results and an output control module for data write-back. The circuit uses a special mapping method to map the operation of the convolution layer and the operation of the full connection layer to a matrix operation unit with a fixed size. The circuit adjusts the memory arrangement mode of the characteristics and the weight, thereby greatly improving the memory access efficiency of the circuit. Meanwhile, the scheduling of the circuit module adopts a pipeline mechanism, so that all hardware units in each clock cycle are in a working state, the utilization rate of the hardware units is improved, and the working efficiency of the circuit is improved.
Description
Technical Field
The invention belongs to the technical field of integrated circuit design, and particularly relates to a circuit structure for accelerating a convolution layer and a full connection layer of a neural network.
Background
In the last 60 th century, Hubel et al proposed the concept of the receptive field through the study of the visual cortical cells of cats, and in the 80 th century, Fukushima proposed the concept of the neurocognitive machine on the basis of the receptive field concept, which can be regarded as the first implementation network of the convolutional neural network, the neurocognitive machine decomposed a visual pattern into a plurality of sub-patterns (features), and then entered the feature plane connected by hierarchical connection, it tried to model the visual system, so that it could complete the recognition even if the object had displacement or slight deformation.
Convolutional neural networks are a variant of the multi-layered perceptron. Developed by the biologists huboer and viser at an early stage of research on the visual cortex of cats. The cells of the visual cortex present a complex architecture. These cells are very sensitive to a sub-region of the visual input space, we call the receptive field, and are tiled in this way over the entire field of view area. These cells can be divided into two basic types, simple cells and complex cells. Simple cells respond maximally to marginal stimulation patterns from within the receptive field. Complex cells have a larger receptive domain that is locally invariant to stimuli from an exact site. The convolutional neural network structure includes: convolutional layer, downsampling layer, full connection layer. Each layer has a plurality of feature maps, each feature map extracting a feature of the input through a convolution filter, each feature map having a plurality of neurons.
Because of the huge calculation amount, the convolutional neural network is difficult to perform local operation on the mobile terminal at present, and is mostly realized by a cloud computing mode. While the amount of operations of the convolutional neural network is more than ninety percent of the computation of the convolutional layer and the fully-connected layer, a separate accelerating circuit is usually designed for the two operations, thereby introducing extra chip area.
The invention provides a circuit structure capable of accelerating convolution layers and full connection layers simultaneously, which can be mapped to the same matrix operation unit (array of a multiplier and an adder) by a method of reordering the characteristics and weights of each layer of a neural network. Therefore, the multiplexing efficiency of hardware is improved, the chip area is reduced, and the circuit can obtain higher operation throughput rate in unit area.
Disclosure of Invention
The invention aims to provide a circuit structure capable of accelerating a convolution layer and a full connection layer simultaneously aiming at the operation acceleration of the convolution layer and the full connection layer of a neural network so as to improve the hardware multiplexing efficiency and reduce the chip area.
The circuit structure for accelerating the convolution layer and the full-connection layer of the neural network provided by the invention can map the convolution layer and the full-connection layer to the same matrix operation unit by a method of expanding operation; and by a method of reordering the characteristics and the weights of each layer of the neural network, the access performance loss caused by the discontinuity of the read addresses of the characteristics and the weights after expansion is reduced.
The circuit structure provided by the invention comprises a characteristic/weight prefetching module, a local cache, a matrix operation unit, a temporary data accumulation module and an output control module; wherein:
the feature/weight prefetch module is to fetch and place new feature and weight data from an external memory (DRAM) into a local cache while replacing old, unused data. Except the first layer of characteristics of the neural network, all other characteristics and weights are rearranged according to a certain mode, and the first layer of characteristics are also rearranged according to a certain mode, which is realized by software; the feature/weight prefetch module does not need to implement the rearrangement function;
the local cache is used for caching input data required by the matrix operation unit. Whether the convolution layer or the full connection layer is adopted, a large amount of data multiplexing exists in the operation, so that the data which can be multiplexed is stored in the local cache, and the access amount to an external memory is reduced;
the matrix operation unit is an array of a multiplier and an adder and is used for realizing the operation of the matrix. After the features and the weights are rearranged, the operation of the convolution layer and the full connection layer is mapped into a series of matrix operations, and the matrix operations are realized by calling a matrix operation module for multiple times;
and the temporary data accumulation module is used for accumulating the data sent by the matrix operation module. After multiple times of accumulation, the accumulated result (the input characteristic of the next layer of network) is sent to an output control module;
and the output control module is responsible for sequentially writing the accumulated results back to the external memory according to the same rearrangement mode.
In mapping convolutional layer operations to matrix operations, it is necessary to pull the input features into a series of row vectors and expand the convolutional kernels into a two-dimensional matrix. Therefore, the traditional memory space allocation method can cause the addresses needing to be read by the characteristic/weight prefetching module to be not continuous, thereby reducing the memory access efficiency. The characteristics and the weights are rearranged, so that the continuity of the addresses read by the characteristic/weight prefetching module is ensured, and the access efficiency of the circuit is greatly improved. The process of rearranging the characteristics and the weights according to a certain mode is as follows:
as in fig. 4, for a size CinCutting the input characteristic of H W into H W strips, wherein the length of each strip is Cin. And writing the data in the H x W stripes into the memory in a sequential address mode. Starting from the low address, the data in the 0 th stripe is stored in 0 to Cin-1 data corresponding to the memory space, the data in the 1 st stripe being stored in CinTo 2Cin-1 data in the memory space, and so on, and the data in the last stripe (H W-1) is stored in (H W-1) CinTo Cin*H*W*Cin-1 data corresponding to the memory space. In other words, the order of expansion of the features in memory is Cin=> W =>H (W = traditional memory space allocation method)> H => Cin)。
The convolution kernel includes CoutSize is CinH and W sub-weight matrixes, and arranging each sub-weight matrix according to the form of the input features, so that the readjustment of the weight memory distribution can be completed. I.e. the order of expansion of the weight features in memory is Cin => W => H => Cout(W = traditional memory space allocation method)> H => Cin => Cout)。
In the invention, the scheduling of the characteristic/weight pre-fetching module, the local cache, the matrix operation unit, the temporary data accumulation module and the output control module adopts a pipeline mechanism, so that all hardware units in each clock cycle are in a working state, the utilization rate of the hardware units is improved, the chip area is reduced, and the working efficiency of the circuit is improved.
The invention has the beneficial effects that: the convolution layer and the full connection layer can share the same arithmetic circuit, so that hardware can be fully multiplexed, and the convolution device is suitable for various convolution neural network structures. At the same time, the output control module writes the outputs of each layer back to the external memory in the expected arrangement order. All the features of all layers except the first layer are arranged and no cost is required for rearranging the data. While the weights of the convolutional neural network are unchanged during the inference phase, i.e. the weights need to be rearranged only once at system initialization.
Drawings
Fig. 1 is a basic block diagram of the circuit.
FIG. 2 is a diagram illustrating conversion of full link layer operation into convolutional layer operation.
FIG. 3 is a diagram illustrating mapping of convolutional layer operations to matrix operations.
Fig. 4 is a schematic diagram of the memory arrangement of features and weights.
FIG. 5 is a schematic diagram of a decomposition of an arbitrary-scale matrix operation into multiple fixed-size matrix operations.
Detailed Description
In the present invention, a basic block diagram of a circuit capable of accelerating both the convolutional layer and the fully-connected layer is shown in fig. 1. The working process of the design is as follows: inputting the features of each layer with the corresponding weights in the external memory (DRAM) of claim 5. First, the feature/weight prefetch module reads out the features and weights to be involved in the operation from the external memory and puts them into the local cache. The new data can replace the old and unused data in the local cache; then, the control circuit fetches the features and weights to be involved in the operation from the local buffer in accordance with the order of the operation, and sends them to the matrix operation unit. After the features and the weights are rearranged, the operation of the convolution layer and the full connection layer is mapped into a series of matrix operation; and the output result of the matrix operation unit is written into the temporary data accumulation module. After a number of matrix operations are performed, the accumulated result is part of the output characteristics of the layer of operations. The output control module is responsible for writing the partial output characteristics back to the external memory according to a specific arrangement sequence. After all operations of the current layer are completed, the circuit can start to operate the next layer of network.
The operation of the convolution layer and the full connection layer is mapped into a series of matrix operations, and the specific flow is described as follows:
first, the operation of the full link layer is converted into the operation of the convolutional layer, as shown in fig. 2. Let the input feature be a shape CinH W cube, meaning: input is provided with CinAnd each channel is H W. For a fully connected layer, the usual operation is to rearrange the input matrix to a length CinH W row vector, and then adding the vector to a vector with a height of CinH W, width CoutThe weight matrix of (a) is multiplied. The result of the matrix multiplication is a length CoutThe row vector is the feature that the current layer transmits to the next layer of network. To convert a full link layer operation into a convolutional layer operation, the height needs to be CinH W, width CoutIs split into CoutThe sub-weight matrices are respectively denoted as K0, K1, K2, … …, Kn (n = C)out-1). Each sub-weight matrix is a shape CinH W cube. Convolving each sub-weight matrix with the input features respectively, because the shapes of the sub-weight matrices are completely the same (all are C)inH W) so that the result of each convolution is a scalar whose value is equal to the result of the inner product of the feature matrix and the weight matrix. For CoutThe sub-weight matrices, together, can yield CoutA scalar. Will C thisoutThe scalars are concatenated into a vector to obtain the output of the current network layer (fully-connected layer). According to this method, a fully-connected layer can be converted to an input signature and convolution kernel of size CinH W, the number of output channels is CoutThe convolution operation of (1).
Next, the convolution layer operations are mapped into matrix operations, as shown in FIG. 3. Input feature size of CinH W, the size of the convolution kernel (weight) is CinK, together with CoutA convolution kernel corresponding to CoutAn output channel. To obtain the first pixel of each output channel, the required C is obtainedinDrawing K input features into a row vector, and CoutThe convolution kernel is unfolded into a high CoutWidth CinK matrix. Multiplying the characteristic row vector by the weight matrix to obtain a length CoutEach element of the row vector represents the first pixel of each output channel. In order to calculate all the pixels, H × W matrix operations are required. By this method, convolution layer operation can be converted into H x W matrix operation, wherein the height of matrix is CoutWidth is CinK. This is a matrix with a relatively large dimension, and the size varies with the convolutional layer, which is not suitable for hardware implementation, and therefore, it is necessary to decompose such matrix operation into a plurality of matrix operations of fixed size.
Finally, the matrix operation is decomposed into a plurality of matrix operations of fixed size.
FIG. 5 illustrates how a fixed size H can be usedF*WFThe matrix operation unit is used for realizing the process of matrix operation of H x W. In order to realize H W matrix operation, ceil (H/H) is required to be calledF)*ceil(W/WF) The secondary size is HF*WFCeil denotes rounding up. The data used in the first operation is a sub-matrix of the original matrix, which is located from 0 to W of the original matrixF-1 rows and 0 to HF-1 column. The output of the first operation is a length WFThe scalar is output as temporary data to the temporary data accumulation module; the data used in the second operation is still a sub-matrix of the original matrix, which is located between 0 and W of the original matrixF-1 line and HFTo 2HFColumn-1, which shows the iterative operation in the column direction. The output of the second operation is still a length WFA scalar of (c). In the process of ceil (H/H)F) After the iteration, the row direction iteration is finished, and ceil (H/H) is generated togetherF) Each length is WFA scalar of (c). The sum of these scalars is a matrix of H x WFront W of operationFAnd (5) calculating results. The remaining W-W can be calculated by the same methodFAnd (6) obtaining the result. Thus, a matrix operation of arbitrary size can be decomposed into multiple fixed size matrix operations.
For example, a matrix operation of 100 × 32 is implemented by using a matrix operation unit of size 64 × 16 as follows. In order to realize 100 × 32 matrix operations, ceil (100/64) × ceil (32/16) =4 matrix operation units with size 64 × 16 need to be called. The data used in the first operation is a sub-matrix of the original matrix, which is located in 0 to 15 rows and 0 to 63 columns of the original matrix, as shown by the red box (i.e., inner frame) in fig. 5 (a). The output of the first operation is a scalar with the length of 16, and the scalar is output to the temporary data accumulation module as temporary data; the data used in the second operation is still a sub-matrix of the original matrix, which is located in rows 0 to 15 and columns 64 to 99 of the original matrix. Since this operation uses only 100-64=36 columns of matrix operation cells, the remaining 28 columns of data need to be complemented by 0. The output of the second operation is still a scalar of length 16, the sum of which and the result of the first operation, the first 16 operations of this 16 x 100 matrix operation. The remaining 16 results can be calculated by the same method, and thus, one arbitrary-scale matrix operation can be decomposed into a plurality of fixed-size matrix operations.
The output result of the matrix operation unit with fixed size is stored in the temporary data accumulation module. After the accumulation is finished, the accumulation module sends the accumulated result (the input characteristic of the next layer of network) to the output control module, and the output control module is responsible for writing the accumulated result back to the external memory according to a certain arrangement sequence, so that the operation of the current layer (which can be a convolution layer or a full connection layer) is completed.
In mapping convolutional layer operations to matrix operations, it is necessary to pull the input features into a series of row vectors and expand the convolutional kernels into a matrix. If the conventional memory space allocation method is used, the access bandwidth of the external memory becomes a bottleneck of the whole system, because the addresses required to be read by the feature/weight prefetch module become discontinuous. In order to ensure the continuity of the addresses where the data read by the feature/weight prefetching module is located, the memory arrangement of the features and the weights needs to be adjusted.
As in fig. 4, for a size CinCutting the input characteristic of H W into H W strips, wherein the length of each strip is Cin. And writing the data in the H x W stripes into the memory in a sequential address mode. Starting from the low address, the data in the 0 th stripe is stored in 0 to Cin-1 data corresponding to the memory space, the data in the 1 st stripe being stored in CinTo 2Cin-1 data in the memory space, and so on, and the data in the last stripe (H W-1) is stored in (H W-1) CinTo Cin*H*W*Cin-1 data corresponding to the memory space. In other words, the order of expansion of the features in memory is Cin=> W =>H (W = traditional memory space allocation method)> H => Cin)。
The convolution kernel includes CoutSize is CinH and W sub-weight matrixes, and arranging each sub-weight matrix according to the form of the input features, so that the readjustment of the weight memory distribution can be completed. I.e. the order of expansion of the weight features in memory is Cin => W => H => Cout(W = traditional memory space allocation method)> H => Cin => Cout)。
And when the operation of each layer is finished, the output control module writes the output of each layer back to the external memory according to the expected arrangement sequence. All the features of all layers except the first layer are arranged and no cost is required for rearranging the data. While the weights of the convolutional neural network are unchanged during the inference phase, i.e. the weights need to be rearranged only once at system initialization. The cost of adjusting the arrangement of features and weights in memory is relatively small.
Claims (4)
1. A circuit structure for accelerating a convolution layer and a full connection layer of a neural network is characterized in that the convolution layer and the full connection layer are both mapped to the same matrix operation unit in a mode of expanding operation; the access performance loss caused by the discontinuity of the expanded feature and weight reading addresses is reduced by reordering the features and weights of each layer of the neural network; the circuit structure comprises a characteristic/weight prefetching module, a local cache, a matrix operation unit, a temporary data accumulation module and an output control module; wherein:
the characteristic/weight pre-fetching module is used for taking out and putting new characteristic and weight data into a local cache from an external memory and replacing old and unused data; except for the first-layer features of the neural network, all the other features and weights are rearranged according to a certain mode, and the rearrangement of the first-layer features is also rearranged according to a certain mode, so that the feature/weight pre-fetching module does not need to realize the function of rearrangement;
the local cache is used for caching input data required by the matrix arithmetic unit;
the matrix operation unit is used for realizing the operation of a matrix; after the features and the weights are rearranged, the operation of the convolution layer and the full connection layer is mapped into a series of matrix operations, and the matrix operations are realized by calling a matrix operation module for multiple times;
the temporary data accumulation module is used for accumulating the data sent by the matrix operation module; after multiple times of accumulation, the accumulated result, namely the input characteristic of the next layer of network, is sent to an output control module;
the output control module is responsible for sequentially writing the accumulated results back to the external memory according to the rearrangement mode;
the features and the weights are rearranged according to a certain mode, and the specific process is as follows:
let a size CinCutting the input characteristic of H W into H W strips, wherein the length of each strip is Cin(ii) a Writing the data in the H x W strips into a memory in a sequential address mode; from low addressFirst, the data in the 0 th stripe is stored in 0 to Cin-1 data corresponding to the memory space, the data in the 1 st stripe being stored in CinTo 2Cin-1 data in the corresponding memory space, and so on, and the data in the last stripe is stored in (H × W-1) × CinTo Cin*H*W*Cin-1 data in a corresponding memory space;
let the convolution kernel contain CoutSize is CinH and W sub-weight matrixes, and arranging each sub-weight matrix according to the form of the input features, namely finishing readjustment of weight memory distribution.
2. The circuit structure for accelerating the convolutional layer and the fully-connected layer of the neural network as claimed in claim 1, wherein the feature/weight pre-fetching module, the local cache, the matrix operation unit, the temporary data accumulation module and the output control module are scheduled by a pipeline mechanism, so that all hardware units are in a working state every clock cycle.
3. The circuit structure for accelerating convolutional layers and fully-connected layers of a neural network as claimed in claim 1, wherein the operation of the convolutional layers and fully-connected layers is mapped to a series of matrix operations, and the specific flow is as follows:
firstly, converting the operation of a full connection layer into the operation of a convolution layer; let the input feature be a shape CinH W cube, meaning: input is provided with CinEach channel is H W; for a fully connected layer, the usual operation is to rearrange the input matrix to a length CinH W row vector, and then adding the vector to a vector with a height of CinH W, width CoutMultiplying the weight matrix; to convert full link layer operations to convolutional layer operations, the height is CinH W, width CoutIs split into CoutThe sub-weight matrices are respectively marked as K0, K1, K2, … …, Kn, n = Cout-1; each sub-weight matrix is of a shape ofCinH × W cubes; convolving each sub-weight matrix with the input features respectively, wherein the sub-weight matrices are all C when the shapes of the sub-weight matrices are completely the sameinH W; the result of each convolution is a scalar, and the value of the scalar is equal to the result of inner product of the feature matrix and the weight matrix; for CoutSub-weight matrices, together yielding CoutA scalar quantity; will C thisoutThe scalar quantities are connected into a vector, so that the output of the current network layer, namely the full connection layer, is obtained; thus, a full connection layer is converted to an input feature with a convolution kernel size of CinH W, the number of output channels is CoutThe convolution operation of (2);
secondly, mapping the operation of the convolution layer into matrix operation; input feature size of CinH W, the size of the convolution kernel, i.e. the weight, is CinK, together with CoutA convolution kernel corresponding to CoutAn output channel; to obtain the first pixel of each output channel, the required C is obtainedinDrawing K input features into a row vector, and CoutThe convolution kernel is unfolded into a high CoutWidth CinK matrix; multiplying the characteristic row vector by the weight matrix to obtain a length CoutEach element of the row vector represents a first pixel point of each output channel; calculating all pixel points, namely performing H × W times of matrix operation; thus, the convolution layer operation is converted into H x W times matrix operation, wherein the height of the matrix is CoutWidth is Cin*K*K;
Finally, such a matrix operation is decomposed into a plurality of fixed-size matrix operations.
4. The circuit structure for accelerating convolutional layers and fully-connected layers of a neural network of claim 3, wherein the process of decomposing the matrix operation into a plurality of fixed-size matrix operations is:
let the matrix to be operated on be H W, and decompose the matrix of fixed size used for operation into HF*WFTherefore ceil (H/H) needs to be calledF)*ceil(W/WF) The secondary size is HF*WFCeil represents rounding up; the data used in the first operation is a sub-matrix of the original matrix, which is located from 0 to W of the original matrixF-1 rows and 0 to HF-1 column; the output of the first operation is a length WFThe scalar is output as temporary data to the temporary data accumulation module; the data used in the second operation is still a sub-matrix of the original matrix, which is located between 0 and W of the original matrixF-1 line and HFTo 2HF-1 column, which represents the iterative operation in the column direction; the output of the second operation is still a length WFA scalar of (a); in the process of ceil (H/H)F) After the iteration, the column direction iteration is finished, and ceil (H/H) is generated togetherF) Each length is WFA scalar of (a); the sum of these scalars is the first W of the matrix operation of H x WFAn operation result; by analogy, the rest W-W is calculatedFAnd (6) obtaining the result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810120895.0A CN108416434B (en) | 2018-02-07 | 2018-02-07 | Circuit structure for accelerating convolutional layer and full-connection layer of neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810120895.0A CN108416434B (en) | 2018-02-07 | 2018-02-07 | Circuit structure for accelerating convolutional layer and full-connection layer of neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108416434A CN108416434A (en) | 2018-08-17 |
CN108416434B true CN108416434B (en) | 2021-06-04 |
Family
ID=63126912
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810120895.0A Active CN108416434B (en) | 2018-02-07 | 2018-02-07 | Circuit structure for accelerating convolutional layer and full-connection layer of neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108416434B (en) |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109308194B (en) * | 2018-09-29 | 2021-08-10 | 北京字节跳动网络技术有限公司 | Method and apparatus for storing data |
CN109375952B (en) * | 2018-09-29 | 2021-01-26 | 北京字节跳动网络技术有限公司 | Method and apparatus for storing data |
WO2020062252A1 (en) * | 2018-09-30 | 2020-04-02 | 华为技术有限公司 | Operational accelerator and compression method |
CN111045958B (en) * | 2018-10-11 | 2022-09-16 | 展讯通信(上海)有限公司 | Acceleration engine and processor |
CN109886398A (en) * | 2019-01-03 | 2019-06-14 | 曾集伟 | Neural network matrix multiplying method and Related product |
CN109816108A (en) * | 2019-02-15 | 2019-05-28 | 领目科技(上海)有限公司 | Deep learning accelerator, device and method |
CN109948787B (en) * | 2019-02-26 | 2021-01-08 | 山东师范大学 | Arithmetic device, chip and method for neural network convolution layer |
CN110032538B (en) * | 2019-03-06 | 2020-10-02 | 上海熠知电子科技有限公司 | Data reading system and method |
CN109993283B (en) * | 2019-04-12 | 2023-02-28 | 南京吉相传感成像技术研究院有限公司 | Deep convolution generation type countermeasure network acceleration method based on photoelectric calculation array |
CN110222819B (en) * | 2019-05-13 | 2021-04-20 | 西安交通大学 | Multilayer data partition combined calculation method for convolutional neural network acceleration |
CN111950718B (en) * | 2019-05-16 | 2021-12-07 | 北京知存科技有限公司 | Method for realizing progressive CNN operation by using storage and computation integrated chip |
CN112784973A (en) * | 2019-11-04 | 2021-05-11 | 北京希姆计算科技有限公司 | Convolution operation circuit, device and method |
CN113222136A (en) * | 2020-01-21 | 2021-08-06 | 北京希姆计算科技有限公司 | Convolution operation method and chip |
CN111340224B (en) * | 2020-02-27 | 2023-11-21 | 浙江芯劢微电子股份有限公司 | Accelerated design method of CNN (computer network) suitable for low-resource embedded chip |
WO2022013722A1 (en) * | 2020-07-14 | 2022-01-20 | United Microelectronics Centre (Hong Kong) Limited | Processor, logic chip and method for binarized convolution neural network |
CN112418419B (en) * | 2020-11-20 | 2022-10-11 | 复旦大学 | Data output circuit structure processed by neural network and scheduled according to priority |
CN112614175A (en) * | 2020-12-21 | 2021-04-06 | 苏州拓驰信息技术有限公司 | Injection parameter determination method for hole sealing agent injector based on characteristic decorrelation |
CN113592075B (en) * | 2021-07-28 | 2024-03-08 | 浙江芯昇电子技术有限公司 | Convolution operation device, method and chip |
CN115906948A (en) * | 2023-03-09 | 2023-04-04 | 浙江芯昇电子技术有限公司 | Full-connection-layer hardware acceleration device and method |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5414623A (en) * | 1992-05-08 | 1995-05-09 | Iowa State University Research Foundation | Optoelectronic system for implementation of iterative computer tomography algorithms |
CN1688119A (en) * | 2005-04-01 | 2005-10-26 | 清华大学 | Method for testing DS. CDMA system multi-user developed based on weighting |
US20130282634A1 (en) * | 2011-03-31 | 2013-10-24 | Microsoft Corporation | Deep convex network with joint use of nonlinear random projection, restricted boltzmann machine and batch-based parallelizable optimization |
CN103902762A (en) * | 2014-03-11 | 2014-07-02 | 复旦大学 | Circuit structure for conducting least square equation solving according to positive definite symmetric matrices |
CN104679895A (en) * | 2015-03-18 | 2015-06-03 | 成都影泰科技有限公司 | Medical image data storing method |
US20150170020A1 (en) * | 2013-12-13 | 2015-06-18 | Amazon Technologies, Inc. | Reducing dynamic range of low-rank decomposition matrices |
US20170046602A1 (en) * | 2015-08-14 | 2017-02-16 | International Business Machines Corporation | Learning temporal patterns from electronic health records |
CN106503797A (en) * | 2015-10-08 | 2017-03-15 | 上海兆芯集成电路有限公司 | The data for being received from neural memorizer are arranged the neutral net unit and collective with neural memorizer the neural pe array for being shifted |
CN106855853A (en) * | 2016-12-28 | 2017-06-16 | 成都数联铭品科技有限公司 | Entity relation extraction system based on deep neural network |
US20170221176A1 (en) * | 2016-01-29 | 2017-08-03 | Fotonation Limited | Convolutional neural network |
CN107454966A (en) * | 2015-05-21 | 2017-12-08 | 谷歌公司 | Weight is prefetched for neural network processor |
-
2018
- 2018-02-07 CN CN201810120895.0A patent/CN108416434B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5414623A (en) * | 1992-05-08 | 1995-05-09 | Iowa State University Research Foundation | Optoelectronic system for implementation of iterative computer tomography algorithms |
CN1688119A (en) * | 2005-04-01 | 2005-10-26 | 清华大学 | Method for testing DS. CDMA system multi-user developed based on weighting |
US20130282634A1 (en) * | 2011-03-31 | 2013-10-24 | Microsoft Corporation | Deep convex network with joint use of nonlinear random projection, restricted boltzmann machine and batch-based parallelizable optimization |
US20150170020A1 (en) * | 2013-12-13 | 2015-06-18 | Amazon Technologies, Inc. | Reducing dynamic range of low-rank decomposition matrices |
CN103902762A (en) * | 2014-03-11 | 2014-07-02 | 复旦大学 | Circuit structure for conducting least square equation solving according to positive definite symmetric matrices |
CN104679895A (en) * | 2015-03-18 | 2015-06-03 | 成都影泰科技有限公司 | Medical image data storing method |
CN107454966A (en) * | 2015-05-21 | 2017-12-08 | 谷歌公司 | Weight is prefetched for neural network processor |
US20170046602A1 (en) * | 2015-08-14 | 2017-02-16 | International Business Machines Corporation | Learning temporal patterns from electronic health records |
CN106503797A (en) * | 2015-10-08 | 2017-03-15 | 上海兆芯集成电路有限公司 | The data for being received from neural memorizer are arranged the neutral net unit and collective with neural memorizer the neural pe array for being shifted |
US20170221176A1 (en) * | 2016-01-29 | 2017-08-03 | Fotonation Limited | Convolutional neural network |
CN106855853A (en) * | 2016-12-28 | 2017-06-16 | 成都数联铭品科技有限公司 | Entity relation extraction system based on deep neural network |
Non-Patent Citations (2)
Title |
---|
《Fast Decoding and Hardware Design for Binary-Input Compressive Sensing》;Min Wang,et al;《IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS》;20120930;第2卷(第3期);全文 * |
《一种面积与功耗优化的卷积器设计》;陈琛等;《计算机工程》;20101130;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN108416434A (en) | 2018-08-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108416434B (en) | Circuit structure for accelerating convolutional layer and full-connection layer of neural network | |
CN111242289B (en) | Convolutional neural network acceleration system and method with expandable scale | |
CN111178519B (en) | Convolutional neural network acceleration engine, convolutional neural network acceleration system and method | |
Shen et al. | Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA | |
JP7329533B2 (en) | Method and accelerator apparatus for accelerating operations | |
CN108205701B (en) | System and method for executing convolution calculation | |
CN108108809B (en) | Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof | |
CN108763612B (en) | Circuit for accelerating operation of pooling layer of neural network | |
KR102523263B1 (en) | Systems and methods for hardware-based pooling | |
CN109409512B (en) | Flexibly configurable neural network computing unit, computing array and construction method thereof | |
US20210019594A1 (en) | Convolutional neural network accelerating device and method | |
KR20170023708A (en) | Convolutional neural network computing apparatus | |
CN108170640B (en) | Neural network operation device and operation method using same | |
CN110580519B (en) | Convolution operation device and method thereof | |
CN112703511B (en) | Operation accelerator and data processing method | |
US20240119114A1 (en) | Matrix Multiplier and Matrix Multiplier Control Method | |
CN114461978B (en) | Data processing method and device, electronic equipment and readable storage medium | |
CN109446478B (en) | Complex covariance matrix calculation system based on iteration and reconfigurable mode | |
CN112215345A (en) | Convolutional neural network operation method and device based on Tenscorore | |
CN109993293B (en) | Deep learning accelerator suitable for heap hourglass network | |
CN114003201A (en) | Matrix transformation method and device and convolutional neural network accelerator | |
CN112836823B (en) | Convolutional neural network back propagation mapping method based on cyclic recombination and blocking | |
CN111667052A (en) | Standard and nonstandard volume consistency transformation method for special neural network accelerator | |
Sakr et al. | Memory-efficient CMSIS-NN with replacement strategy | |
US20030187898A1 (en) | Parallel processing method of an eigenvalue problem for a shared-memory type scalar parallel computer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |