CN108960414B - Method for realizing single broadcast multiple operations based on deep learning accelerator - Google Patents

Method for realizing single broadcast multiple operations based on deep learning accelerator Download PDF

Info

Publication number
CN108960414B
CN108960414B CN201810804165.2A CN201810804165A CN108960414B CN 108960414 B CN108960414 B CN 108960414B CN 201810804165 A CN201810804165 A CN 201810804165A CN 108960414 B CN108960414 B CN 108960414B
Authority
CN
China
Prior art keywords
value
intermediate value
calculation
register
input characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810804165.2A
Other languages
Chinese (zh)
Other versions
CN108960414A (en
Inventor
陈书明
杨超
李斌
陈海燕
扈啸
张军阳
陈伟文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201810804165.2A priority Critical patent/CN108960414B/en
Publication of CN108960414A publication Critical patent/CN108960414A/en
Application granted granted Critical
Publication of CN108960414B publication Critical patent/CN108960414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Optimization (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a method for realizing single broadcast multiple operations based on a deep learning accelerator, which comprises the following steps: configuring a plurality of intermediate value registers for storing intermediate results for a given multiplier in a multiplier array of an accelerator; in the calculation process of executing the deep learning, when the input characteristic value needs to be multiplied by the corresponding weight value, the calculation result of the input characteristic value and the corresponding weight value is stored into the corresponding intermediate value register for use in the next calculation until the calculation of the output characteristic value is completed. The method has the advantages of simple implementation method, low cost, high data utilization rate, low energy consumption and the like, and can realize single-broadcast multi-operation of the deep learning accelerator.

Description

Method for realizing single broadcast multiple operations based on deep learning accelerator
Technical Field
The invention relates to the technical field of deep learning accelerators, in particular to a method for realizing single-broadcast multi-operation based on a deep learning accelerator.
Background
Deep Neural Networks (DNNs) are the basis for artificial intelligence applications, including various applications such as autonomous driving of automobiles, cancer detection, computer vision, speech recognition and robotics, complex gaming, and so forth. The DNN has very high precision in artificial intelligence tasks and can even exceed the accuracy of human beings, the outstanding performance of the DNN is due to the fact that the DNN can extract high-level features from original sensory data by using a statistical learning method, effective representations of an input space are obtained from a large amount of data, and the complexity of deep learning is high. In deep learning, the number of levels of the neural network is huge, the current neural network can reach 5 to 1000 levels, the numerous levels can greatly increase the required energy consumption, storage space and computational complexity, and the embedded platform for executing the DNN inference process has strict energy consumption, computational and storage cost limits. Applications such as speech recognition often have strong latency requirements when DNN inference is performed in the cloud. Therefore, how to make DNN process efficiently and improve efficiency and throughput without compromising performance accuracy or increasing hardware cost becomes a key to widely deploy DNN in artificial intelligence systems.
The cost of superior accuracy obtained by DNN is high computational complexity cost, and in order to improve computational efficiency, a computational engine such as a GPU is usually used for computation at present, and the computational efficiency is improved by realizing parallelism among data. However, in the current deep learning accelerator, a required input characteristic value is usually taken once, the data is discarded after the calculation is completed, the value is taken out again when the data is required to be used next time, but in the deep learning algorithm, the repeated utilization rate of the input characteristic value of the convolution operation is very high, the cost of one-time data taking in accelerator hardware is expensive, the data is discarded after the calculation is completed by adopting the above-mentioned data taking each time, and the data is taken again when the data is required to be used next time, so that a large amount of energy consumption waste is caused, the data in the calculation process cannot be fully utilized, and the energy consumption and the cost of the deep learning calculation are still high.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides the method for realizing the single-broadcast multi-operation based on the deep learning accelerator, which is simple in realization method, low in cost, high in data utilization rate and low in energy consumption, and can realize the single-broadcast multi-operation based on the deep learning accelerator.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
a method for realizing single broadcast multiple operations based on a deep learning accelerator comprises the following steps: configuring a plurality of intermediate value registers for storing intermediate results for a given multiplier in a multiplier array of an accelerator; in the calculation process of executing the deep learning, when the input characteristic value needs to be multiplied by the corresponding weight value, the calculation result of the input characteristic value and the corresponding weight value is stored into the corresponding intermediate value register for use in the next calculation until the calculation of the current output characteristic value is completed.
As a further improvement of the invention: when convolution calculation is executed, after multiplication operation is carried out on input characteristic values and corresponding weight values each time, operation results are stored in the intermediate value registers, addition operation is carried out on the multiplication operation results and the last calculation results stored in the corresponding intermediate value registers so as to complete one multiplication and addition operation, and the addition operation results are stored back to the corresponding intermediate value registers so as to be used for the next addition operation until calculation of output characteristic values is completed.
As a further improvement of the invention: the method also comprises the steps of initializing each intermediate result register to be 0 in advance, and restoring the corresponding intermediate result register to be 0 again after completing the calculation and output of a complete output characteristic.
As a further improvement of the invention: specifically, a set of intermediate value registers is configured for each multiplier to be involved in the calculation.
As a further improvement of the invention: the multipliers that compute the respective input eigenvalues share a set of intermediate value registers.
As a further improvement of the present invention, the specific steps when performing the convolution calculation are:
s1, configuring intermediate value registers R0-Rn for each multiplier in a multiplier array, and initializing the value of each intermediate value register to 0;
s2, inputting a first input characteristic value X0, multiplying the first input characteristic value X0 with a corresponding weight value, and storing an operation result in an intermediate value register R0;
s3, inputting an ith input characteristic value Xi, multiplying the ith input characteristic value Xi by a corresponding weight value, adding the operation result with results stored in an intermediate value register R0-an intermediate value register Ri respectively, and correspondingly storing the addition operation result in each intermediate value register, wherein i is 0,1,2, …, n;
s4, circularly executing the step S3 until the multiplication and addition operation of the n input characteristic values is completed, and then completing the initialization of the convolution operation;
s5, finishing the calculation of the first output characteristic value, outputting the value of the corresponding intermediate value register R0, and restoring the value of the intermediate value register R0 to 0 again;
s6, inputting the (n + 1) th input characteristic value Xn +1, multiplying the input characteristic value Xn +1 by a corresponding weight value, storing an operation result in an intermediate value register R0, adding the operation result with values in an intermediate value memory R1-an intermediate value register Rn respectively, and correspondingly storing the addition operation result in each intermediate value memory;
s7, completing calculation of an output characteristic value, outputting a value of a corresponding intermediate value register, and restoring the value of the intermediate value register to be 0 again;
and S8, taking n as n +1, and returning to execute the steps S6 and S7 until all the output characteristic values are calculated.
Compared with the prior art, the invention has the advantages that:
1. the invention relates to a method for realizing single broadcast multiple operation based on a deep learning accelerator, which is characterized in that an intermediate value register is configured for a designated multiplier body, an intermediate calculation result of an input characteristic value participating in calculation is stored in the intermediate value register in the calculation process, the intermediate result can be directly obtained from the intermediate value register when needing to be used subsequently, and the operation of obtaining the input characteristic value is not required to be executed again, so that multiple operations can be carried out after one input characteristic value is obtained, single broadcast multiple operation is realized, multiple operations can be carried out by transmitting one input characteristic value, based on the single broadcast multiple operation mode, the access to an external input characteristic value can be reduced, the repeated input of the input characteristic value is avoided, the utilization rate of the input characteristic value is improved, a data stream can be called from a memory at low cost, the efficient multiplexing of data is realized, and the control complexity and the energy consumption required by calculation can be reduced at the same time, effectively improve the calculation efficiency
2. The method for realizing single broadcast multiple operation based on the deep learning accelerator comprises the steps of multiplying an input characteristic value by a corresponding weight value each time when convolution calculation is executed, storing an operation result in each intermediate value register, and adding the multiplication result and a result in the corresponding intermediate value register after initialization is completed, so that multiple operations can be completed by transmitting the input characteristic value once after initialization is completed, the transmission times of the input characteristic value can be effectively reduced, and efficient utilization of data is realized.
3. The method for realizing single broadcast multiple operation based on the deep learning accelerator further enables the multiplications for correspondingly calculating the input characteristic values to share one group of intermediate value registers, and when one output characteristic value is calculated, the product results of a plurality of multipliers are transmitted in sequence and then added, so that the whole number of the intermediate value registers in the multiplier array can be reduced, and the design area of hardware is reduced.
Drawings
Fig. 1 is a schematic implementation flow diagram of a method for implementing single broadcast multiple operations based on a deep learning accelerator according to the embodiment.
Fig. 2 is a schematic diagram of the structural principle of the multiplier array in the deep learning accelerator employed in the present embodiment.
Fig. 3 is a schematic diagram of a single multiplier in an embodiment of the present invention.
FIG. 4 is a schematic diagram of an embodiment of the present invention in which a plurality of multipliers share a set of intermediate value registers.
Fig. 5 is a schematic diagram illustrating the implementation principle of performing convolution operation in the embodiment of the present invention (with a convolution kernel size of 5 × 5).
Detailed Description
The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.
As shown in fig. 1, the method for implementing single broadcast multiple operations based on a deep learning accelerator in this embodiment includes: configuring a plurality of intermediate value registers for storing intermediate results for a given multiplier in a multiplier array of an accelerator; in the calculation process of executing the deep learning, when the input characteristic value needs to be multiplied by the corresponding weight value, the calculation result of the input characteristic value and the corresponding weight value is stored into the corresponding intermediate value register for use in the next calculation until the calculation of the current output characteristic value is completed. In the convolution calculation process of deep learning, one input characteristic value needs to be multiplied by a plurality of input weight values, in this embodiment, the input characteristic value is sequentially multiplied by the corresponding weight values, and an intermediate result of the input characteristic value participating in the calculation is put into an intermediate value register to be used in a subsequent calculation process.
Based on the characteristics of the convolution algorithm in deep learning, namely, an input characteristic value and a plurality of weight values are required to be respectively subjected to multiplication operation, the embodiment configures an intermediate value register for a designated multiplier body, stores an intermediate calculation result of the input characteristic value participating in calculation into the intermediate value register in the calculation process, can directly obtain the intermediate result from the intermediate value register when the intermediate result is required to be used subsequently, and does not need to execute the operation of obtaining the input characteristic value again, so that multiple operations can be carried out after one input characteristic value is obtained, single-broadcast multiple operations are realized, multiple operations can be carried out by transmitting one input characteristic value once, based on the single-broadcast multiple operations mode, the access to an external input characteristic value can be reduced, multiple repeated inputs of the input characteristic value are avoided, the utilization rate of the input characteristic value is improved, data streams can be called from a memory at low cost, and high-efficiency multiplexing of data is realized, meanwhile, the control complexity and the energy consumption required by calculation can be reduced, and the calculation efficiency is effectively improved.
In this embodiment, when performing convolution calculation, after an input eigenvalue is multiplied by a corresponding weight value each time, an operation result is stored in each intermediate value register, and the multiplication result is added to a last calculation result stored in the corresponding intermediate value register to complete one multiplication and addition operation, and the addition result is stored back in the corresponding intermediate value register for a next addition operation until the calculation of an output eigenvalue is completed.
In this embodiment, specifically, in a hardware structure, a plurality of intermediate value registers are configured for each multiplier in a multiplier array, the number of the intermediate value registers may be determined according to the scale of a convolution kernel, and the convolution kernel and an input feature map may be calculated to obtain a plurality of intermediate values, and then a plurality of intermediate value registers may be configured according to the number of the calculated intermediate values, specifically, if the scale of the convolution kernel is a b, a row and a column of the convolution kernel implement single broadcast multiple operations by rows in the above manner, and if each row has b elements, at least b intermediate value registers are needed; and then configuring a software control program, so that in the operation process, when an input characteristic value is input each time, the input characteristic value is multiplied by R required weight values respectively, so that multiplication operation of the input characteristic value and a plurality of weight values is realized, the multiplication results are stored in each intermediate value register respectively and correspondingly for subsequent calculation, then, addition operation is performed on the multiplication results of the input characteristic value and the weight values and the values in the corresponding intermediate value registers, so that the intermediate results of the input characteristic value participating in the operation are stored in the intermediate registers until an output characteristic value is completely calculated and then the calculation results are output by the intermediate registers, only one input characteristic value needs to be obtained in the whole calculation process for multiple times of operation, and the efficient utilization of the input characteristic value is realized, so that the energy consumption is reduced. The number of the intermediate value registers can be specifically set according to actual requirements.
The multiplier array in the deep learning accelerator adopted in this embodiment is shown in fig. 2, and the whole multiplier array has M rows and N columns, where the left arrow represents the data flow direction of the input feature value, the upper arrow represents the data flow direction of the weight value, and the lower side represents the data flow direction of the output feature value after the operation is completed. When the multiplier array is used for calculation, after the input characteristic values of M rows are input, the characteristic values are broadcast to the corresponding N columns of multipliers, each multiplier can perform multiplication operation with the weight value input from the upper side after obtaining the input characteristic values, the M multipliers in the corresponding column correspond to the input of M input characteristic graphs, and during convolution, the multiplication and addition results of the corresponding part need to be added, so that the multiplication and addition results in the corresponding column are added after the multiplication result operation is completed and output from the lower side, and after the multiplication and addition, other operations such as pooling, activation, normalization and the like can be further performed, and the multiplication and addition can also be used as intermediate values for caching.
In the specific embodiment shown in fig. 3, five intermediate registers R1-R4 are provided for storing intermediate multiplication results for a single multiplier in the multiplier array, and the inputs of the multipliers are input characteristic values and weight values. During operation, the multiplier accesses the input characteristic value and the corresponding weight value to multiply, adds the characteristic value and the numerical value in the corresponding intermediate value register, then puts the result back to the corresponding intermediate value register, and outputs the result in the corresponding intermediate value register after complete operation is completed.
In this embodiment, the method further includes initializing each intermediate result register to 0 in advance, and after completing the calculation of a complete output feature and outputting, restoring the corresponding intermediate result register to 0 again, where the register may be used for the next calculation.
In the deep learning algorithm convolution operation, one output characteristic value is usually obtained by the sum of products of a plurality of input characteristic graphs and corresponding weight values, and the final output characteristic value is calculated by multiplying a plurality of groups of input characteristic values and corresponding weight values and then summing. By enabling a plurality of multipliers in the multiplier array to share one group of intermediate value registers, the whole number of the intermediate value registers in the multiplier array can be reduced, and the hardware design area is reduced. As shown in fig. 4, in the embodiment, two multiplication units share a set of intermediate value registers R0-R4, each multiplication unit respectively accesses an input feature value and a corresponding weight value, the calculation result is transmitted to an addition unit, the addition unit performs an addition operation according to the value of each intermediate value register, and the intermediate value registers R0-R4 store the intermediate value of the multiplication and addition result.
In this embodiment, the specific steps when performing convolution calculation are as follows:
s1, configuring intermediate value registers R0-Rn for each multiplier in a multiplier array, and initializing the value of each intermediate value register to 0;
s2, inputting a first input characteristic value X0, multiplying the first input characteristic value X0 with a corresponding weight value, and storing an operation result in an intermediate value register R0;
s3, inputting an ith input characteristic value Xi, multiplying the ith input characteristic value Xi by a corresponding weight value, adding the operation result with results stored in an intermediate value register R0-an intermediate value register Ri respectively, and correspondingly storing the addition operation result in each intermediate value register, wherein i is 0,1,2, …, n;
s4, circularly executing the step S3 until the multiplication and addition operation of the n input characteristic values is completed, and then completing the initialization of the convolution operation;
s5, finishing the calculation of the first output characteristic value, outputting the value of the corresponding intermediate value register R0, and restoring the value of the intermediate value register R0 to 0 again;
s6, inputting the (n + 1) th input characteristic value Xn +1, multiplying the input characteristic value Xn +1 by a corresponding weight value, storing an operation result in an intermediate value register R0, adding the operation result with values in an intermediate value memory R1-an intermediate value register Rn respectively, and correspondingly storing the addition operation result in each intermediate value memory;
s7, completing calculation of an output characteristic value, outputting a value of a corresponding intermediate value register, and restoring the value of the intermediate value register to be 0 again;
and S8, taking n as n +1, and returning to execute the steps S6 and S7 until all the output characteristic values are calculated.
Through the process, multiple operations can be completed by one-time transmission of the input characteristic value after initialization is completed, the transmission times of the input characteristic value can be effectively reduced, and efficient utilization of data is realized.
Taking an example of a convolution kernel size of 5 × 5, the data multiply-add operation and the storage of the intermediate value in one row of the convolution kernel are shown in fig. 5, where X0 to X9 represent input characteristic values, R0 to R4 represent intermediate value registers, the initial values are 0, and C0 to C14 correspond to the case where the operation in fig. 5(a) is just started, and each execution cycle includes:
in the C0 period, the X0 value is input and multiplied by the corresponding weight value, and the result is put in R0;
in the C1-C2 period, the X1 value is input and multiplied by the weight value, and is added with the corresponding R0 and R1 median values, and then the results are respectively and correspondingly placed in R0 and R1;
the same period as C1-C2, each period is sequentially multiplied and added until C14, and the initialization of convolution operation is completed;
in the following cycle, the value of one intermediate value register is output each time, and then the register value is set to 0.
FIG. 5(b) corresponds to the cycle immediately after FIG. 5(a), where the previous value in R0 has been completely calculated and initialized to 0, and then used to store the result of X5 and the corresponding weight value; during the period C15-C19, a complete output characteristic value can be output in R1.
Fig. 5(c) corresponds to the cycle immediately after fig. 5(b), in which the X6 eigenvalue needs to be completed by five multiply-add operations, wherein the output eigenvalue has been completed when the multiply-add operation corresponding to R2 is completed.
And performing five times of multiply-add operation on the subsequent input characteristic values in the same manner as the manner, outputting the output characteristic values after the operation is completed, setting the corresponding intermediate value register to be 0, and finally completing the calculation of all the output characteristic values.
The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims (4)

1. A method for realizing single broadcast multiple operations based on a deep learning accelerator is characterized by comprising the following steps: configuring a plurality of intermediate value registers for storing intermediate results for a given multiplier in a multiplier array of an accelerator; in the calculation process of executing deep learning, when the input characteristic value and the corresponding weight value need to be multiplied, the calculation result of the input characteristic value and the corresponding weight value is stored into the corresponding intermediate value register to be used in the next calculation until the calculation of the output characteristic value is completed;
when convolution calculation is executed, after multiplication operation is carried out on input characteristic values and corresponding weight values each time, operation results are stored in the intermediate value registers, addition operation is carried out on the multiplication operation results and the last calculation results stored in the corresponding intermediate value registers so as to complete one multiplication and addition operation, and the addition operation results are stored back to the corresponding intermediate value registers so as to be used for the next addition operation until calculation of output characteristic values is completed;
the specific steps when performing the convolution calculation are:
s1, configuring intermediate value registers R0-Rn for each multiplier in a multiplier array, and initializing the value of each intermediate value register to 0;
s2, inputting a first input characteristic value X0, multiplying the first input characteristic value X0 with a corresponding weight value, and storing an operation result in an intermediate value register R0;
s3, inputting an ith input characteristic value Xi, multiplying the ith input characteristic value Xi by a corresponding weight value, adding the operation result with results stored in an intermediate value register R0-an intermediate value register Ri respectively, and correspondingly storing the addition operation result in each intermediate value register, wherein i is 0,1,2, …, n;
s4, circularly executing the step S3 until the multiplication and addition operation of the n input characteristic values is completed, and then completing the initialization of the convolution operation;
s5, finishing the calculation of the first output characteristic value, outputting the value of the corresponding intermediate value register R0, and restoring the value of the intermediate value register R0 to 0 again;
s6, inputting the (n + 1) th input characteristic value Xn +1, multiplying the input characteristic value Xn +1 by a corresponding weight value, storing an operation result in an intermediate value register R0, adding the operation result with values in an intermediate value memory R1-an intermediate value register Rn respectively, and correspondingly storing the addition operation result in each intermediate value memory;
s7, completing calculation of an output characteristic value, outputting a value of a corresponding intermediate value register, and restoring the value of the intermediate value register to be 0 again;
and S8, taking n as n +1, and returning to execute the steps S6 and S7 until all the output characteristic values are calculated.
2. The method as claimed in claim 1, further comprising initializing each of the intermediate result registers to 0 in advance, and restoring the corresponding intermediate result register to 0 again after completing the calculation and output of a complete output feature.
3. The method for realizing single broadcast multiple operation based on the deep learning accelerator as claimed in claim 1 or 2, wherein a set of intermediate value registers is configured for each multiplier to be involved in calculation.
4. The method for realizing single broadcast multiple operation based on the deep learning accelerator as claimed in claim 1 or 2, wherein: the multipliers that compute the respective input eigenvalues share a set of intermediate value registers.
CN201810804165.2A 2018-07-20 2018-07-20 Method for realizing single broadcast multiple operations based on deep learning accelerator Active CN108960414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810804165.2A CN108960414B (en) 2018-07-20 2018-07-20 Method for realizing single broadcast multiple operations based on deep learning accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810804165.2A CN108960414B (en) 2018-07-20 2018-07-20 Method for realizing single broadcast multiple operations based on deep learning accelerator

Publications (2)

Publication Number Publication Date
CN108960414A CN108960414A (en) 2018-12-07
CN108960414B true CN108960414B (en) 2022-06-07

Family

ID=64497953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810804165.2A Active CN108960414B (en) 2018-07-20 2018-07-20 Method for realizing single broadcast multiple operations based on deep learning accelerator

Country Status (1)

Country Link
CN (1) CN108960414B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200082613A (en) * 2018-12-31 2020-07-08 에스케이하이닉스 주식회사 Processing system
CN111695683B (en) * 2019-03-15 2023-09-01 华邦电子股份有限公司 Memory chip capable of executing artificial intelligent operation and operation method thereof
CN110458277B (en) * 2019-04-17 2021-11-16 上海酷芯微电子有限公司 Configurable precision convolution hardware architecture suitable for deep learning hardware accelerator
CN113033798B (en) * 2019-12-24 2023-11-24 北京灵汐科技有限公司 Device and method for reducing precision loss
CN111610963B (en) * 2020-06-24 2021-08-17 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4489393A (en) * 1981-12-02 1984-12-18 Trw Inc. Monolithic discrete-time digital convolution circuit
CN106066783A (en) * 2016-06-02 2016-11-02 华为技术有限公司 The neutral net forward direction arithmetic hardware structure quantified based on power weight
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN107578095A (en) * 2017-09-01 2018-01-12 中国科学院计算技术研究所 Neural computing device and the processor comprising the computing device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160379109A1 (en) * 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Convolutional neural networks on hardware accelerators
CN106844294B (en) * 2016-12-29 2019-05-03 华为机器有限公司 Convolution algorithm chip and communication equipment
CN108205702B (en) * 2017-12-29 2020-12-01 中国人民解放军国防科技大学 Parallel processing method for multi-input multi-output matrix convolution

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4489393A (en) * 1981-12-02 1984-12-18 Trw Inc. Monolithic discrete-time digital convolution circuit
CN106066783A (en) * 2016-06-02 2016-11-02 华为技术有限公司 The neutral net forward direction arithmetic hardware structure quantified based on power weight
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN107578095A (en) * 2017-09-01 2018-01-12 中国科学院计算技术研究所 Neural computing device and the processor comprising the computing device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
二维矩阵卷积的并行计算方法;张军阳等;《浙江大学学报(工学版)》;20180331;第53卷(第3期);摘要、第519-520页 *
利用参数稀疏性的卷积神经网络计算优化及其FPGA加速器设计;刘勤让 等;《电子与信息学报》;20180409;第1368-1374页 *

Also Published As

Publication number Publication date
CN108960414A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN108960414B (en) Method for realizing single broadcast multiple operations based on deep learning accelerator
US11574195B2 (en) Operation method
EP3373210B1 (en) Transposing neural network matrices in hardware
CN107862374B (en) Neural network processing system and processing method based on assembly line
EP3539059B1 (en) Performing kernel striding in hardware
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
CN107844826B (en) Neural network processing unit and processing system comprising same
CN105892989B (en) Neural network accelerator and operational method thereof
US20180174036A1 (en) Hardware Accelerator for Compressed LSTM
CN109543830B (en) Splitting accumulator for convolutional neural network accelerator
JP6826181B2 (en) Computing device and calculation method
US8694451B2 (en) Neural network system
US20210241071A1 (en) Architecture of a computer for calculating a convolution layer in a convolutional neural network
Huynh Deep neural network accelerator based on FPGA
CN109144469B (en) Pipeline structure neural network matrix operation architecture and method
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
CN109919312B (en) Operation method and device of convolutional neural network and DPU
CN110716751B (en) High-parallelism computing platform, system and computing implementation method
CN111178492B (en) Computing device, related product and computing method for executing artificial neural network model
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method
CN111652365B (en) Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof
Elsokkary et al. CNN acceleration through flexible distribution of computations between a hardwired processor and an FPGA
CN110765413A (en) Matrix summation structure and neural network computing platform
Zernov et al. Associative methods of fuzzy operations implementation
Zholondkovskiy et al. LSTM-type Neural Network Implementation on a Processor Based on Neuromatrix and RISC Cores for Resource-Limited Applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant