WO2022001301A1 - 神经网络运算方法及相关设备 - Google Patents

神经网络运算方法及相关设备 Download PDF

Info

Publication number
WO2022001301A1
WO2022001301A1 PCT/CN2021/088443 CN2021088443W WO2022001301A1 WO 2022001301 A1 WO2022001301 A1 WO 2022001301A1 CN 2021088443 W CN2021088443 W CN 2021088443W WO 2022001301 A1 WO2022001301 A1 WO 2022001301A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
convolution
target
neural network
adders
Prior art date
Application number
PCT/CN2021/088443
Other languages
English (en)
French (fr)
Inventor
李炜
曹庆新
Original Assignee
深圳云天励飞技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳云天励飞技术股份有限公司 filed Critical 深圳云天励飞技术股份有限公司
Publication of WO2022001301A1 publication Critical patent/WO2022001301A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations

Definitions

  • the present application relates to the technical field of neural networks, and in particular, to a neural network computing method and related equipment.
  • acceleration algorithms have been proposed to accelerate the computing of convolutional neural network.
  • the acceleration algorithm is the most efficient to accelerate the operation.
  • a specific dedicated hardware can be used to implement the acceleration algorithm to accelerate the operation of the convolutional neural network with a convolution kernel of 3 ⁇ 3
  • network computing resulting in a waste of hardware resources.
  • the embodiment of the present application discloses a neural network computing method and related equipment, which can make full use of existing hardware resources to realize the acceleration of the convolutional neural network computing by the acceleration algorithm.
  • a first aspect of the embodiments of the present application discloses a neural network computing method, which is applied to a neural network computing unit, where the neural network computing unit is used for convolutional neural network operations, and the convolutional neural network includes a target convolution kernel, so
  • the neural network computing unit includes A first adders, A multipliers and B second adders, wherein A is an exponent of 4 and A is greater than or equal to 8, and B is half of A, and the method include:
  • the A first adders, the A multipliers, and the B second adders perform a one-dimensional acceleration algorithm operation according to the A first target data input to the target convolution kernel, to obtain the The convolution operation result of the first operation cycle of the target convolution kernel;
  • the A first target data is shifted to the right, and 2 new target data are moved in while moving out 2 first target data to obtain A second target data;
  • the A first adder, the A multiplier and the B second adder perform the one-dimensional acceleration algorithm operation according to the A second target data to obtain the target convolution kernel
  • the convolution operation result of the target convolution kernel is obtained according to the convolution operation result of the first operation cycle and the convolution operation result of the second operation cycle.
  • a second aspect of the embodiments of the present application discloses a neural network operation method, which is applied to a two-dimensional convolutional neural network operation, where the two-dimensional convolutional neural network includes a first convolution kernel, a second convolution kernel, and a third volume Accumulating kernels, the method includes:
  • the neural network operation method is performed to obtain a convolution operation result of the first convolution kernel
  • the neural network operation method is executed to obtain the target convolution operation result of the second convolution kernel, and according to the first convolution kernel
  • the convolution operation result of the second convolution kernel and the target convolution operation result of the second convolution kernel obtain the convolution operation result of the second convolution kernel
  • the neural network operation method is executed to obtain the target convolution operation result of the third convolution kernel, and according to the second convolution kernel
  • the convolution operation result of the third convolution kernel and the target convolution operation result of the third convolution kernel obtain the convolution operation result of the two-dimensional convolutional neural network.
  • a third aspect of the embodiments of the present application discloses a neural network computing device, which is applied to a neural network computing unit, where the neural network computing unit is used for convolutional neural network operations, and the convolutional neural network includes a target convolution kernel, so
  • the neural network computing unit includes A first adders, A multipliers and B second adders, wherein A is an exponent of 4 and A is greater than or equal to 8, B is half of A, and the neural network
  • the network computing device includes a processing unit for:
  • the A first adders, the A multipliers, and the B second adders perform a one-dimensional acceleration algorithm operation according to the A first target data input to the target convolution kernel, to obtain the The convolution operation result of the first operation cycle of the target convolution kernel;
  • a fourth aspect of the embodiments of the present application discloses a neural network computing device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured Executed by the processor, the program includes instructions for executing steps in the method according to any one of the first aspect of the embodiments of the present application.
  • a fifth aspect of an embodiment of the present application discloses a chip, which is characterized by comprising: a processor for calling and running a computer program from a memory, so that a device installed with the chip executes the first aspect of the embodiment of the present application The method of any of the above.
  • a sixth aspect of the embodiments of the present application discloses a computer-readable storage medium, which is characterized in that it stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method as described in the first aspect of the embodiments of the present application. The method of any one.
  • a seventh aspect of the embodiments of the present application discloses a computer program, and the computer program causes a computer to execute the method according to any one of the first aspects of the embodiments of the present application.
  • the present application executes a first adder, A multiplier and B second adder on the A first target data of the input convolution kernel operation by utilizing the existing convolution operation hardware.
  • Dimension acceleration algorithm operation to obtain the convolution operation result of the first operation cycle; then move A first target data to the right, move out 2 first target data and move in 2 new target data at the same time, get A second target data target data; then perform a one-dimensional acceleration algorithm operation on the A second target data through A first adders, A multipliers and B second adders to obtain the convolution operation result of the second operation cycle;
  • the convolution operation result of the first operation cycle and the convolution operation result of the second operation cycle are accumulated to obtain the operation result of the convolution kernel operation; thus, the existing hardware resources can be fully utilized to realize the acceleration of the convolutional neural network operation by the acceleration algorithm.
  • FIG. 1 is a schematic diagram of a convolution operation process provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a convolution operation hardware provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a neural network computing method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of another convolution operation hardware structure provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of another convolution operation hardware structure provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a neural network computing device provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a neural network computing device provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • FIG. 1 a schematic diagram of a convolution operation process.
  • the calculation convolution kernel is a 3 ⁇ 3 convolutional neural network as an example
  • a neural network computing unit has 8 multiply-accumulate units (MAC)
  • MAC multiply-accumulate units
  • each neural network computing unit has only 8 multiply-accumulate units, the entire input feature map is cut vertically, every 8 pixels.
  • a multiply-accumulate unit can calculate the convolution operation of an image block, and the convolution operation of the entire input feature is performed jointly by multiple multiply-accumulate units.
  • Cycle 1 Read 8 data in the first row of the input channel, that is, data 0, data 1, data 2, data 3, data 4, data 5, data 6 and data 7, which are respectively related to the weight of the convolution kernel Multiply w0 to obtain the partial sum (partial sum) of the 8 data in the first row, that is, the partial sum 1.
  • Cycle 2 The first row of data is shifted to the right, and the shifted data is the shifted out data or padding data of the adjacent neural network computing unit, that is, the data 0 of the first row is shifted out, and the data 8 is shifted to the left to obtain the second row.
  • 8 data namely data 1, data 2, data 3, data 4, data 5, data 6, data 7 and data 8; then multiply the 8 data in the second row with the weight w1 of the convolution kernel respectively , and the result of multiplying each of the 8 data in the second row by w1 is accumulated with the corresponding partial sum 1 of period 1 to obtain a partial sum 2.
  • Cycle 3 The first row of data continues to move to the right, and the moved-in data is the shifted-out data or padding data of the adjacent neural network computing unit, that is, the data 1 of the second row is moved out, and the data 9 is shifted in on the left to obtain the third
  • the 8 data in the row that is, data 2, data 3, data 4, data 5, data 6, data 7, data 8 and data 9; then the 8 data in the third row are respectively related to the weight w2 of the convolution kernel. Multiply and accumulate the result of multiplying each of the 8 data in the third row with w2 and the corresponding partial sum 2 of period 2 to obtain a partial sum 3, which is the first row of convolution kernels Partial sum after calculation.
  • Period 4 to Period 6 Read the 8 data in the second row of the input channel, repeat the entire calculation process from Period 1 to Period 3, and obtain partial sum 4, partial sum 5 and partial sum 6 respectively; among them, period 4, period
  • the weights of the convolution kernels of period 5 and period 6 are w3, w4 and w5 respectively; when calculating the part and 4 of period 4, the part and 3 of period 3 need to be added; among them, part sum 6 is the second row of convolution kernels Calculate the partial sum.
  • Period 7 to Period 9 Read the 8 data in the third row of the input channel, repeat the entire calculation process from Period 1 to Period 3, and obtain partial sum 7, partial sum 8 and partial sum 9 respectively; among them, period 7, period
  • the weights of the convolution kernels of 8 and period 9 are w6, w7 and w8 respectively; when calculating the part and 7 of period 7, the part and 6 of period 6 need to be added; among them, the part sum 9 is the calculation of the entire convolution kernel.
  • the convolution value after that.
  • FIG. 2 is a schematic structural diagram of a convolution operation hardware provided by an embodiment of the present application.
  • FIG. 2 is a calculation unit for convolution operation, which can be used for the convolution calculation shown in FIG. 1 .
  • a calculation unit there are 8 multiply-accumulate units, and each multiply-accumulate unit has a multiplier and an adder, so a calculation unit includes 8 multipliers and 8 adders.
  • each operation cycle provides the same weight for each multiply-accumulate unit, and the weight of each cycle is different, using 9 cycles to realize a convolution operation with a convolution kernel of 3 ⁇ 3 .
  • the 8 data of cycle 1 are shifted to the right to obtain 8 data of cycle 2, that is, data 1, data 2, data 3, data 4, data 5, data 6, data 7 and data 8; through the multiplier Multiply the 8 data of period 2 by the weight w1 of the convolution kernel respectively, and each of the 8 data of period 2 corresponds to a multiplication result, and return the part of period 1 and 1 of the corresponding position to the adder,
  • the multiplication result corresponding to each of the 8 data of period 2 is added to the partial sum 1 of the period 1 at the corresponding position through the adder to obtain the partial sum 2. Repeat the calculation process of cycle 2 to obtain the part of cycle 3 and 3, thereby obtaining the operation result of the convolution kernel of the first row.
  • FIG. 3 is a neural network operation method provided by an embodiment of the present application, which is applied to a neural network computing unit.
  • the neural network computing unit is used for convolutional neural network operations, and the convolutional neural network includes a target A convolution kernel, the neural network computing unit includes A first adders, A multipliers, and B second adders, where A is an exponent of 4 and A is greater than or equal to 8, and B is half of A
  • the method includes but is not limited to the following steps:
  • Step S301 performing a one-dimensional acceleration algorithm operation according to the A first target data input to the target convolution kernel through the A first adders, the A multipliers and the B second adders, Obtain the convolution operation result of the first operation cycle of the target convolution kernel.
  • the convolutional neural network is a convolutional neural network with a convolution kernel of 3 ⁇ 3, and the target convolution kernel is one of the rows of the convolution kernels.
  • A is 8, that is, through the The 8 first adders, the 8 multipliers and the 4 second adders perform a one-dimensional acceleration algorithm operation according to the 8 first target data input to the target convolution kernel to obtain the target convolution The result of the convolution operation for the first operation cycle of the kernel.
  • the one-dimensional acceleration algorithm may be a one-dimensional winograd algorithm, and the one-dimensional winograd algorithm is briefly introduced below.
  • the one-dimensional winograd convolution acceleration algorithm with a convolution kernel of 1 ⁇ 3 as an example, assuming that the input features are [d0, d1, d2, d3], the convolution kernel is [w0, w1, w2], the convolution calculation result is 2 values [r0, r1].
  • r0 can be expressed as m0+m1+m2
  • r1 can be expressed as m1-m2-m3; among them, m0, m1, m2 and m3 can be calculated in advance, or can be dynamically calculated during the weight reading process, and its calculation formula as follows:
  • the one-dimensional winograd algorithm uses 4 multipliers and 4 adders to calculate 2 results in the process of calculating the convolution kernel of 1 ⁇ 3.
  • the ordinary convolution calculation requires 6 multipliers and 4 adders to calculate 2 results at the same time. Therefore, the winograd algorithm can accelerate the convolution calculation when the multiplier resources are certain.
  • Step S302 Shift the A pieces of first target data to the right, move out two pieces of first target data and move in two pieces of new target data to obtain A pieces of second target data.
  • the new target data that is moved in is the moved out data or filling data of adjacent neural network computing units.
  • Step S303 Execute the one-dimensional acceleration algorithm operation according to the A second target data through the A first adders, the A multipliers, and the B second adders to obtain the target The convolution operation result of the second operation cycle of the convolution kernel.
  • the one-dimensional acceleration algorithm is executed according to the 8 second target data input to the target convolution kernel through the 8 first adders, the 8 multipliers and the 4 second adders operation to obtain the convolution operation result of the first operation cycle of the target convolution kernel.
  • Step S304 Obtain a convolution operation result of the target convolution kernel according to the convolution operation result of the first operation cycle and the convolution operation result of the second operation cycle.
  • the one-dimensional acceleration algorithm can obtain 2 operation results in one operation, that is, the operation results of 2 of the A target data, and A is an index of 4 and A is greater than or equal to 8, at least 2 cycles are required. 4 operation results can be obtained.
  • A is 2 times or more than 4, the one-dimensional acceleration algorithm can be superimposed in one operation cycle.
  • the neural network operation method uses A first adders, A multipliers, and B second adders in the existing convolution operation hardware to operate on the input convolution kernel.
  • the A first target data of the 1-dimensional acceleration algorithm operation is performed, and the convolution operation result of the first operation cycle is obtained; then the A first target data is shifted to the right, and 2 new first target data are moved out at the same time
  • the target data is obtained, and A second target data is obtained; then A first adder, A multiplier and B second adder are used to perform one-dimensional acceleration algorithm operation on the A second target data to obtain the second operation.
  • the new target data is target data or filling data moved out by a neural network computing unit adjacent to the neural network computing unit.
  • the performing by the A first adders, the A multipliers and the B second adders according to the A first target data input to the target convolution kernel includes: performing the one-dimensional acceleration according to the A first target data through the A first adders The first addition operation of the algorithm obtains A first operation data; the multiplication operation of the one-dimensional acceleration algorithm is performed according to the A first operation data and the weight of the one-dimensional acceleration algorithm through the A multipliers , obtain A second operation data; perform the second addition operation of the one-dimensional acceleration algorithm according to the A second operation data by the B second adders to obtain the first value of the target convolution kernel The result of the convolution operation for the operation cycle.
  • the one-dimensional acceleration algorithm is first executed according to the 4 first target data through the 4 first adders.
  • the first addition operation that is, executing d0-d2, d1+d2, d2-d1, d1-d3, obtains 4 first operation data; then through 4 multipliers according to the 4 first operation data and the one-dimensional acceleration algorithm
  • the 8 first target data of the input feature are [d0, d1, d2, d3, d4, d5, d6, d7], which are divided into two groups [d0, d1, d2, d3] and [ d4, d5, d6, d7], there are 8 first adders, 8 multipliers and 4 second adders, and the two groups execute the one-dimensional acceleration algorithm at the same time, and finally two groups [r0, r1] are obtained at the same time, That is, 4 operation results.
  • the method before the A multipliers perform the multiplication operation of the one-dimensional acceleration algorithm according to the A first operation data and the weight of the one-dimensional acceleration algorithm to obtain A second operation data , the method further includes: determining the weight of the one-dimensional acceleration algorithm according to the weight of the target convolution kernel.
  • performing the first addition operation of the one-dimensional acceleration algorithm according to the A first target data by the A first adders to obtain A first operation data including: Perform the following operations on each of the C first target data groups to obtain C first operation data groups, each of the C first operation data groups including 4 first operation data, and the C is a quarter of A, the C first target data groups are obtained by grouping the A first target data into groups of 4, and the C first target data groups and the C first target data groups One adder group is in one-to-one correspondence, the C first adder groups are obtained by grouping the A first adders into groups of 4, and each group in the C first adder groups The operations performed are the same: the first addition operation of the one-dimensional acceleration algorithm is performed by the first adder group corresponding to the currently processed first target data group, and the first operation data corresponding to the currently processed first target data group is obtained. Group.
  • A is 8, through the 4 first adders of the 8 first adders according to the input of the 8 first target data of the target convolution kernel.
  • the four first target data perform the first addition operation of the one-dimensional acceleration algorithm to obtain four first operation data of the first group; through the other four first adders in the eight first adders according to the
  • the other 4 first target data in the 8 first target data input to the target convolution kernel are subjected to the first addition operation of the one-dimensional acceleration algorithm to obtain the second group of 4 first operation data.
  • the eight first target data of the input feature are [d0, d1, d2, d3, d4, d5, d6, d7], which are divided into two groups [d0, d1, d2, d3] ] and [d4, d5, d6, d7]; for the first group [d0, d1, d2, d3], d0-d2, d1+d2 are performed by 4 of the 8 first adders, d2-d1, d1-d3, get the 4 first operation data of the first group; for the second group [d4, d5, d6, d7], pass the other 4 first adders in the 8 first adders Execute d4-d6, d5+d6, d6-d5, and d5-d7 to obtain four first operation data of the second group.
  • the 16 first target data of the input feature are [d0, d1, d2, ..., d15]. It is divided into 4 groups, respectively [d0, d1, d2, d3], [d4, d5, d6, d7], [d8, d9, d10, d11] and [d12, d13, d14, d15]; A set of [d0, d1, d2, d3], d0-d2, d1+d2, d2-d1, d1-d3 are performed by the 4 first adders of the first adder group 1 of the 16 first adders , obtain the 4 first operation data of the first operation data group 1; for the second group [d4, d5, d6, d7], pass the 4 first operation data of the first adder group 1 in the 16 first adders The adder executes d4-d6, d5+d6, d6-d5, d5-
  • performing the multiplication operation of the one-dimensional acceleration algorithm by the A multipliers according to the A first operation data and the weight of the one-dimensional acceleration algorithm, to obtain A second operation data including: performing the following operations on each of the C first operation data groups and the 4 weights of the one-dimensional acceleration algorithm to obtain C second operation data groups, the C second operation data groups
  • Each of the groups includes 4 pieces of second operation data, and the 4 pieces of first operation data of each of the C first operation data groups are in one-to-one correspondence with the 4 weights of the one-dimensional acceleration algorithm
  • the C first operation data groups are in one-to-one correspondence with C multiplier groups, and the C multiplier groups are obtained by grouping the A multipliers into groups of 4: through the currently processed first operation
  • the multiplier group corresponding to the data group performs the multiplication operation of the one-dimensional acceleration algorithm to obtain the second operation data group corresponding to the currently processed first operation data group.
  • one-dimensional execution is performed. Accelerate the multiplication operation of the algorithm to obtain four second operation data of the first group, and the four first operation data of the first group correspond one-to-one with the four weights of the accelerated algorithm; through the eight multipliers
  • the other 4 multipliers in the second group perform the multiplication operation of the one-dimensional acceleration algorithm according to the 4 first operation data of the second group and the 4 weights of the one-dimensional acceleration algorithm, and obtain the 4 second operations of the second group.
  • the four first operation data of the second group are in one-to-one correspondence with the four weights of the acceleration algorithm.
  • the second addition operation of the one-dimensional acceleration algorithm is performed by the B second adders according to the A second operation data to obtain the first value of the target convolution kernel.
  • the convolution operation result of the operation cycle includes: performing the following operations for each of the C second operation data groups to obtain C convolution operation result groups, each of the C convolution operation result groups One group includes 2 convolution operation results, the C second operation data groups are in one-to-one correspondence with the C second adder groups, and the C second adder groups are the B multipliers according to each 2
  • Each of the C second adder groups performs the same operation: execute the one-dimensional acceleration algorithm through the second adder group corresponding to the currently processed second operation data group to obtain the convolution operation result group corresponding to the second operation data group currently processed.
  • the second adder of the one-dimensional acceleration algorithm is executed according to the four second operation data of the first group by two second adders in the four second adders.
  • An addition operation is performed to obtain the results of the two convolution operations of the first group;
  • the second addition operation of the dimensional acceleration algorithm obtains two convolution operation results of the second group; according to the two convolution operation results of the first group and the two convolution operation results of the second group, the The convolution operation result of the first operation cycle of the target convolution kernel.
  • cycle 1 for the 4 second operation data m0, m1, m2 and m3 of the first group, m0+m1+m2, m1 is executed by 2 second adders in the 4 second adders -m2-m3, get the 2 convolution operation results r0 and r1 of the first group; for the 4 second operation data m4, m5, m6 and m7 of the second group, pass the other 2 of the 4 second adders A second adder executes m4+m5+m6, m5-m6-m7, and obtains two convolution operation results r4 and r5 of the second group.
  • the technical solution provided by the application is described above from the operation process of a convolution kernel.
  • the embodiment of the application also provides a neural network operation method, which is applied to the operation of a two-dimensional convolutional neural network.
  • the two-dimensional convolutional neural network includes: The first convolution kernel, the second convolution kernel and the third convolution kernel, the method includes: for the first convolution kernel, executing the neural network operation method as shown in FIG. 3 to obtain the first volume The convolution operation result of the product kernel; for the second convolution kernel, the neural network operation method as shown in FIG.
  • the convolution operation result of the convolution kernel and the target convolution operation result of the second convolution kernel obtain the convolution operation result of the second convolution kernel; for the third convolution kernel, perform as shown in FIG. 3 .
  • the neural network operation method described above obtain the target convolution operation result of the third convolution kernel, and according to the convolution operation result of the second convolution kernel and the target convolution operation result of the third convolution kernel A convolution operation result of the two-dimensional convolutional neural network is obtained.
  • the 8 first target data are [d0, d1, d2, d3, d4, d5, d6, d7], and divide them into two groups [d0, d1, d2, d3] and [d4, d5, d6, d7]; for the first group [d0, d1, d2, d3], perform d0-d2, d1 through 4 of the 8 first adders +d2, d2-d1, d1-d3, get the 4 first operation data of the first group, denoted as (d0-d2), (d1+d2), (d2-d1) and (d1-d3); for The second group [d4, d5, d6, d7], by executing d4-d6, d5+d6, d6-d5, d5-d7 through the other 4 first adders of the 8 first adders, to obtain the second group
  • the four first operation data of denoted4-d6)
  • the first target data is shifted to the right, and 8 second target data are obtained as [d2, d3, d4, d5, d6, d7, d8, d9], which are divided into two groups [d2, d3, d4, d5] and [ d6, d7, d8, d9]; for the first set [d2, d3, d4, d5], perform d2-d4, d3+d4, d4-d3 by 4 first adders out of 8 first adders , d3-d5, get the four first operation data of the first group, denoted as (d2-d4), (d3+d4), (d4-d3) and (d3-d5); for the second group [d6, d7, d8, d9], perform d6-d8, d7+d8, d8-d7, d7-d9 through the other 4 first adders of the 8 first adders to obtain the 4 first operations of the second group Data, denoted as (d6-
  • the first target data of the second convolution kernel is shifted to the right to obtain 8 second target data of the second convolution kernel, and the calculation process of the second operation cycle is repeated to obtain the target convolution operation results r2 and r3 of the fourth operation cycle.
  • r6 and r7 respectively add the target convolution operation results r2, r3, r6 and r7 of the fourth operation cycle to the operation results r2, r3, r6 and r7 of the first row of convolution kernels, respectively, to obtain the second row of volumes
  • the first target data of the third convolution kernel is shifted to the right to obtain 8 second target data of the third convolution kernel, and the calculation process of the second operation cycle is repeated to obtain the target convolution operation results r2 and r3 of the sixth operation cycle.
  • r6 and r7 respectively add the target convolution operation results r2, r3, r6 and r7 of the sixth operation cycle with the operation results r2, r3, r6 and r7 of the second row of convolution kernels, respectively, to obtain the third row of volumes
  • the operation results of the product kernel r2, r3, r6 and r7 are examples of the product kernel r2, r3, r6 and r7.
  • FIG. 4 and FIG. 5 illustrate the embodiments of the present application by taking the bit width of the input data as 8 and 16, respectively, but the bit width of the input data in the embodiment of the present application is not limited to this, the bit width of the input data is not limited to this. It may also be 32, 64, 18, etc., and the specific calculation rule is the same as that in FIG. 4 or 5, and will not be repeated in this application.
  • FIG. 6 is a schematic structural diagram of a neural network computing device 600 .
  • the neural network computing device 600 is applied to a neural network computing unit, the neural network computing unit is used for convolutional neural network operations, the convolutional neural network includes a target convolution kernel, and the neural network computing unit includes A first An adder, A multipliers and B second adders, wherein A is an exponent of 4 and A is greater than or equal to 8, and B is half of A.
  • the neural network operation device 600 specifically includes: a processing unit 602 and Communication unit 603.
  • the processing unit 602 is used to control and manage the actions of the neural network computing unit.
  • the processing unit 602 is used to support the neural network computing unit to perform the steps in the above method embodiments and other processes for the techniques described herein.
  • the communication unit 603 is used to support the communication between the neural network computing unit and other devices.
  • the neural network computing device 600 may further include a storage unit 601 for storing program codes and data of the neural network computing unit.
  • the processing unit 602 may be a processor or a controller, such as a central processing unit (Central Processing Unit). Processing Unit, CPU), general-purpose processor, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), Field Programmable Gate Array (FPGA), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure.
  • the processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like.
  • the communication unit 603 may be a communication interface, a transceiver, a transceiver circuit, etc., and the storage unit 601 may be a memory.
  • the processing unit 602 is configured to perform any step in the above method embodiments, and when performing data transmission such as sending, the communication unit 603 can be selectively invoked to complete corresponding operations. A detailed description will be given below.
  • the processing unit 602 is configured to: use the A first adders, the A multipliers and the B second adders to perform a process according to the A first target data input to the target convolution kernel.
  • Dimension acceleration algorithm operation obtain the convolution operation result of the first operation cycle of the target convolution kernel; move the A first target data to the right, and move out 2 first target data while moving in 2 new target data, A second target data is obtained; the A first adder, the A multiplier and the B second adder are used to execute the one-dimensional execution according to the A second target data Accelerate the arithmetic operation to obtain the convolution operation result of the second operation cycle of the target convolution kernel; obtain the target according to the convolution operation result of the first operation cycle and the convolution operation result of the second operation cycle The result of the convolution operation of the convolution kernel.
  • the new target data is target data or filling data moved out by a neural network computing unit adjacent to the neural network computing unit.
  • the processing unit 602 is specifically configured to: use the A first adders according to the A first target
  • the first addition operation of the one-dimensional acceleration algorithm is performed on the data to obtain A first operation data; the A multipliers are used to execute the A first operation data and the weight of the one-dimensional acceleration algorithm.
  • the multiplication operation of the one-dimensional acceleration algorithm obtains A pieces of second operation data; the B second adders perform the second addition operation of the one-dimensional acceleration algorithm according to the A pieces of second operation data to obtain the The convolution operation result of the first operation cycle of the target convolution kernel.
  • the processing unit 602 is further configured to: determine the weight of the one-dimensional acceleration algorithm according to the weight of the target convolution kernel.
  • the The processing unit 602 is specifically configured to: perform the following operations on each of the C first target data groups to obtain C first operation data groups, where each of the C first operation data groups includes four The first operation data, the C is a quarter of A, the C first target data groups are the A first target data groups are obtained by grouping each 4 into a group, the C first target data The target data group is in one-to-one correspondence with C first adder groups, the C first adder groups are obtained by grouping the A first adders in groups of 4, and the C first adder groups are The operations performed by each group in the group are the same: the first addition operation of the one-dimensional acceleration algorithm is performed by the first adder group corresponding to the first target data group currently being processed, and the first target data currently being processed is obtained. The first operation data group corresponding to the group.
  • the A multipliers perform a multiplication operation of the one-dimensional acceleration algorithm according to the A first operation data and the weight of the one-dimensional acceleration algorithm to obtain A second operation data.
  • the processing unit 602 is specifically configured to: perform the following operations for each of the C first operation data groups and the 4 weights of the one-dimensional acceleration algorithm to obtain C second operation data groups, so Each of the C second operation data groups includes 4 second operation data, and the 4 first operation data of each of the C first operation data groups is the same as that of the one-dimensional acceleration algorithm.
  • the multiplication operation of the one-dimensional acceleration algorithm is performed by the multiplier group corresponding to the currently processed first operation data group, to obtain a second operation data group corresponding to the currently processed first operation data group.
  • the processing unit 602 is specifically configured to: perform the following operations on each of the C second operation data groups to obtain C convolution operation result groups, and the C volume Each group in the product operation result group includes 2 convolution operation results, the C second operation data groups are in one-to-one correspondence with the C second adder groups, and the C second adder groups are the
  • the B multipliers are obtained in groups of 2, and the operation operations performed by each of the C second adder groups are the same: the second adder group corresponding to the currently processed second operation data group
  • a second addition operation of the one-dimensional acceleration algorithm is performed to obtain a convolution operation result group corresponding to the currently processed second operation data group.
  • the A first target data input to the convolution kernel operation is firstly accelerated through A first adder, A multiplier and B second adder. Algorithmic operation to obtain the convolution operation result of the first operation cycle; then move the A first target data to the right, move out 2 first target data and move in 2 new target data at the same time, obtain A second target data ; Then through A first adders, A multipliers and B second adders, one-dimensional acceleration arithmetic operation is performed to A second target data to obtain the convolution operation result of the second operation cycle; The convolution operation result of the operation cycle and the convolution operation result of the second operation cycle are accumulated to obtain the operation result of the convolution kernel operation; thus, the existing hardware resources can be fully utilized to realize the acceleration of the convolutional neural network operation by the acceleration algorithm.
  • FIG. 7 is a schematic structural diagram of a neural network computing device 710 provided by an embodiment of the present application.
  • the neural network computing device 710 includes a communication interface 711 , a processor 712 , a memory 713 and a At least one communication bus 714 for connecting the communication interface 711 , the processor 712 , and the memory 713 .
  • the memory 713 includes, but is not limited to, random access memory (random access memory). memory, RAM), read-only memory (read-only) memory, ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM), the memory 713 is used for related instructions and data.
  • random access memory random access memory
  • RAM random access memory
  • read-only memory read-only memory
  • ROM read-only memory
  • EPROM erasable programmable read only memory
  • portable read-only memory compact disc read-only memory
  • the communication interface 711 is used to receive and transmit data.
  • the processor 712 may be one or more central processing units (central processing units) processing unit, CPU), in the case where the processor 712 is a CPU, the CPU may be a single-core CPU or a multi-core CPU.
  • CPU central processing units
  • the neural network operation device 710 is used for convolutional neural network operation, the convolutional neural network includes a target convolution kernel, and the neural network operation device 710 includes A first adders, A multipliers and B second adders wherein, A is an exponent of 4 and A is greater than or equal to 8, and B is half of A, the processor 712 in the neural network computing device 710 is configured to read one or more stored in the memory 713 program code, performing the following operations: performing a one-dimensional operation according to the A first target data input to the target convolution kernel through the A first adders, the A multipliers and the B second adders Accelerate the algorithm operation to obtain the convolution operation result of the first operation cycle of the target convolution kernel; move the A first target data to the right, and move out 2 first target data while moving in 2 new targets data to obtain A second target data; the one-dimensional acceleration is performed according to the A second target data through the A first adders, the A multipliers and the B second adders Algorithmic operation to obtain the convolution operation result of the second operation cycle
  • the neural network operation device 710 described in FIG. 7 uses A first adder, A multiplier and B second adder in the existing convolution operation hardware to perform the A-th operation on the input convolution kernel.
  • One target data performs a one-dimensional acceleration algorithm operation to obtain the convolution operation result of the first operation cycle; then the A first target data is shifted to the right, and two first target data are moved out and two new target data are moved at the same time, Obtain A second target data; then perform one-dimensional acceleration algorithm operation on A second target data through A first adder, A multiplier and B second adder to obtain the convolution of the second operation cycle operation result; then accumulate the convolution operation result of the first operation cycle and the convolution operation result of the second operation cycle to obtain the operation result of the convolution kernel operation; so that the existing hardware resources can be fully utilized to realize the acceleration algorithm for the convolution neural network. Acceleration of network operations.
  • FIG. 8 is a hardware structure of a chip provided by an embodiment of the present application, and the chip includes a neural network processor 800 .
  • the chip can be set in the execution device 710 as shown in FIG. 7 , and the neural network operation methods in the above method embodiments can all be implemented in the chip as shown in FIG. 8 .
  • the neural network processor 800 is mounted on the host CPU (host CPU) as a coprocessor, and the host CPU assigns tasks.
  • the core part of the neural network processor 800 is the operation circuit 803, and the controller 804 controls the operation circuit 803 to extract the data in the memory (weight memory or input memory) and perform operations.
  • the arithmetic circuit 803 includes multiple processing units or computing units (process engine, PE).
  • the arithmetic circuit 803 is a two-dimensional systolic array.
  • the arithmetic circuit 803 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 803 is a general-purpose matrix processor.
  • the operation circuit 803 fetches the data corresponding to the matrix B from the weight memory 802 and buffers it on each PE in the operation circuit 803 .
  • the operation circuit 803 fetches the data of the matrix A from the input memory 801 and performs the matrix operation on the matrix B, and stores the partial result or the final result of the matrix in the accumulator (MAC) 808 .
  • MAC accumulator
  • the vector calculation unit 807 can further process the output of the operation circuit 803, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and the like.
  • the vector computing unit 807 can be used for network computation of non-convolutional/non-FC layers in the neural network, such as pooling, batch normalization normalization), local response normalization, etc.
  • the vector computation unit 807 can store the processed output vectors to the unified buffer 806 .
  • the vector calculation unit 807 may apply a nonlinear function to the output of the arithmetic circuit 803, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 807 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as an activation input to the arithmetic circuit 803, eg, for use in subsequent layers in a neural network.
  • Unified memory 806 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 801 and/or the unified memory 806 through the storage unit access controller (direct memory access controller, DMAC) 805, and stores the weight data in the external memory into the weight memory 802, And the data in the unified memory 806 is stored in the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (bus interface unit, BIU) 810 is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory 809 through the bus.
  • the instruction fetch memory (instruction) connected to the controller 804 fetch buffer) 809, used to store the instructions used by the controller 804;
  • the controller 804 is used for invoking the instructions cached in the finger memory 809 to control the working process of the operation accelerator.
  • the unified memory 806 , the input memory 801 , the weight memory 802 and the instruction fetch memory 809 are all on-chip memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access.
  • Memory double data rate synchronous dynamic random access memory, referred to as DDR SDRAM
  • HBM high bandwidth memory
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is run on a computer, the method processes shown in the above method embodiments are implemented.
  • the embodiments of the present application further provide a computer program product, when the computer program product runs on a computer, the method flow shown in the above method embodiments is realized.
  • processors mentioned in the embodiments of the present application may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processors) Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Off-the-shelf Programmable Gate Array (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • CPU Central Processing Unit
  • DSP Digital Signal Processors
  • ASIC Application Specific Integrated Circuit
  • FPGA Off-the-shelf Programmable Gate Array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory mentioned in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (Read-Only Memory, ROM), Programmable Read-Only Memory (Programmable Read-Only Memory) ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (Random Access Memory, RAM), which is used as an external cache.
  • RAM Random Access Memory
  • SRAM Static RAM
  • DRAM Dynamic RAM
  • SDRAM Synchronous DRAM
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM DDR SDRAM
  • enhanced synchronous dynamic random access memory Enhanced synchronous dynamic random access memory
  • SDRAM Double Data Rate SDRAM
  • ESDRAM Synchlink Dynamic Random Access Memory
  • SLDRAM Synchlink Dynamic Random Access Memory
  • Direct Rambus RAM Direct Rambus RAM
  • the processor is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components
  • the memory storage module
  • memory described herein is intended to include, but not be limited to, these and any other suitable types of memory.
  • the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

一种神经网络运算方法及相关设备,通过利用现有卷积运算硬件中的A个第一加法器、A个乘法器和B个第二加法器对输入卷积核运算的A个第一目标数据执行一维加速算法运算,得到第一运算周期的卷积运算结果;将A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据;通过A个第一加法器、A个乘法器和B个第二加法器对A个第二目标数据执行一维加速算法运算,得到第二运算周期的卷积运算结果;第一运算周期的卷积运算结果与第二运算周期的卷积运算结果累加得到卷积核运算的结果;能够充分利用现有硬件资源实现加速算法对卷积神经网络运算的加速。

Description

神经网络运算方法及相关设备 技术领域
本申请涉及神经网络技术领域,尤其涉及一种神经网络运算方法及相关设备。
本申请要求于2020年6月28日提交中国专利局,申请号为202010602307.4、发明名称为“神经网络运算方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
 
背景技术
随着计算机技术对神经网络运算速度的追求,人们提出了加速算法,用于加速卷积神经网络运算。其中,对于卷积核(kernel)为3×3的卷积神经网络运算来说,采用加速算法加速运算最为高效。然而,目前只能采用特定的专属硬件实现加速算法加速卷积核为3×3的卷积神经网络运算,无法复用现有的硬件实现加速算法加速卷积核为3×3的卷积神经网络运算,从而导致硬件资源浪费。
技术解决方案
本申请实施例公开了一种神经网络运算方法及相关设备,能够充分利用现有硬件资源实现加速算法对卷积神经网络运算的加速。
本申请实施例第一方面公开了一种神经网络运算方法,应用于神经网络计算单元,所述神经网络计算单元用于卷积神经网络运算,所述卷积神经网络包括目标卷积核,所述神经网络计算单元包括A个第一加法器、A个乘法器和B个第二加法器,其中,A为4的指数且A大于等于8,B为A的二分之一,所述方法包括:
通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据输入所述目标卷积核的A个第一目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果;
将所述A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据;
通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据所述A个第二目标数据执行所述一维加速算法运算,得到所述目标卷积核的第二运算周期的卷积运算结果;
根据所述第一运算周期的卷积运算结果和所述第二运算周期的卷积运算结果得到所述目标卷积核的卷积运算结果。
本申请实施例第二方面公开了一种神经网络运算方法,应用于二维卷积神经网络运算,所述二维卷积神经网络包括第一卷积核、第二卷积核和第三卷积核,所述方法包括:
针对所述第一卷积核,执行如本申请实施例第一方面所述的神经网络运算方法,得到所述第一卷积核的卷积运算结果;
针对所述第二卷积核,执行如本申请实施例第一方面所述的神经网络运算方法,得到所述第二卷积核的目标卷积运算结果,并根据所述第一卷积核的卷积运算结果和所述第二卷积核的目标卷积运算结果得到所述第二卷积核的卷积运算结果;
针对所述第三卷积核,执行如本申请实施例第一方面所述的神经网络运算方法,得到所述第三卷积核的目标卷积运算结果,并根据所述第二卷积核的卷积运算结果和所述第三卷积核的目标卷积运算结果得到所述二维卷积神经网络的卷积运算结果。
本申请实施例第三方面公开了一种神经网络运算装置,应用于神经网络计算单元,所述神经网络计算单元用于卷积神经网络运算,所述卷积神经网络包括目标卷积核,所述神经网络计算单元包括A个第一加法器、A个乘法器和B个第二加法器,其中,A为4的指数且A大于等于8,B为A的二分之一,所述神经网络运算装置包括处理单元,所述处理单元用于:
通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据输入所述目标卷积核的A个第一目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果;
以及将所述A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据。
以及通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据所述A个第二目标数据执行所述一维加速算法运算,得到所述目标卷积核的第二运算周期的卷积运算结果;
以及根据所述第一运算周期的卷积运算结果和所述第二运算周期的卷积运算结果得到所述目标卷积核的卷积运算结果。
本申请实施例第四方面公开了一种神经网络运算设备,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行如本申请实施例第一方面任一项所述的方法中的步骤的指令。
本申请实施例第五方面公开了一种芯片,其特征在于,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有所述芯片的设备执行如本申请实施例第一方面中任一项所述的方法。
本申请实施例第六方面公开了一种计算机可读存储介质,其特征在于,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如本申请实施例第一方面中任一项所述的方法。
本申请实施例第七方面公开了一种计算机程序,所述计算机程序使得计算机执行如本申请实施例第一方面中任一项所述的方法。
可以看出,本申请通过利用现有的卷积运算硬件中的A个第一加法器、A个乘法器和B个第二加法器对输入卷积核运算的A个第一目标数据执行一维加速算法运算,得到第一运算周期的卷积运算结果;然后将A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据;再通过A个第一加法器、A个乘法器和B个第二加法器对A个第二目标数据执行一维加速算法运算,得到第二运算周期的卷积运算结果;再将第一运算周期的卷积运算结果和第二运算周期的卷积运算结果累加,得到卷积核运算的运算结果;从而能够充分利用现有硬件资源实现加速算法对卷积神经网络运算的加速。
附图说明
图1是本申请实施例提供的一种卷积运算过程示意图;
图2是本申请实施例提供的一种卷积运算硬件结构示意图;
图3是本申请实施例提供的一种神经网络运算方法的流程示意图;
图4是本申请实施例提供的另一种卷积运算硬件结构示意图;
图5是本申请实施例提供的又一种卷积运算硬件结构示意图;
图6是本申请实施例提供的一种神经网络运算装置的结构示意图;
图7是本申请实施例提供的一种神经网络运算设备的结构示意图;
图8是本申请实施例提供的一种芯片硬件结构示意图。
本发明的最佳实施方式
在此处键入本发明的最佳实施方式描述段落。
本发明的实施方式
请参阅图1,一种卷积运算过程示意图。假设计算卷积核为3×3的卷积神经网络为例,一个神经网络计算单元(PE)有8个乘累加单元(MAC),整个神经网络处理器计算阵列由多个神经网络计算单元组成,那么整个计算过程描述如下:
首先,将输入通道(input channel)映射到神经网络计算单元上,由于每个神经网络计算单元只有8个乘累加单元,所以将整个输入特征(input feature map)进行垂直切割,每8个像素点为一个图像块,这样一个乘累加单元可以计算一个图像块的卷积运算,整个输入特征的卷积运算由多个乘累加单元联合完成。
然后,对于每个神经网络计算单元来说,每9个周期完成一个输出通道(output channel)的一行的8个像素点的卷积运算,具体如下:
周期1:读入输入通道的第一行的8个数据,也即数据0、数据1、数据2、数据3、数据4、数据5、数据6和数据7,它们分别与卷积核的权重w0相乘,分别得到第一行的8个数据的部分和(partial sum),也即部分和1。
周期2:第一行数据右移,移入数据为相邻神经网络计算单元的移出数据或填充(padding)数据,也即将第一行的数据0移出,同时在左边移入数据8,得到第二行的8个数据,也即数据1、数据2、数据3、数据4、数据5、数据6、数据7和数据8;然后将第二行的8个数据分别与卷积核的权重w1相乘,并将第二行的8个数据中的每个数据与w1相乘的结果与对应的周期1的部分和1进行累加,得到部分和2。
周期3:第一行数据继续右移,移入数据为相邻神经网络计算单元的移出数据或填充(padding)数据,也即将第二行的数据1移出,同时在左边移入数据9,得到第三行的8个数据,也即数据2、数据3、数据4、数据5、数据6、数据7、数据8和数据9;然后将第三行的8个数据分别与卷积核的权重w2相乘,并将第三行的8个数据中的每个数据与w2相乘的结果与对应的周期2的部分和2进行累加,得到部分和3,这个部分和3就是第一行卷积核计算完之后的部分和。
周期4至周期6:读入输入通道的第二行的8个数据,重复周期1至周期3的整个计算过程,分别得到部分和4、部分和5和部分和6;其中,周期4、周期5和周期6的卷积核的权重分别为w3、w4和w5;在计算周期4的部分和4时,需要加上周期3的部分和3;其中,部分和6为第二行卷积核计算完后的部分和。
周期7至周期9:读入输入通道的第三行的8个数据,重复周期1至周期3的整个计算过程,分别得到部分和7、部分和8和部分和9;其中,周期7、周期8和周期9的卷积核的权重分别为w6、w7和w8;在计算周期7的部分和7时,需要加上周期6的部分和6;其中,部分和9就是整个卷积核计算完毕之后的卷积值。
请参阅图2,图2是本申请实施例提供的一种卷积运算硬件结构示意图。图2是用于卷积运算的一个计算单元,其可用于图1所示的卷积计算。如图2所示,在一个计算单元中,有8个乘累加单元,每个乘累加单元中有一个乘法器和一个加法器,故一个计算单元中包含8个乘法器和8个加法器。其中,在运算过程中,每个运算周期为每个乘累加单元提供同一个一个权重(weight),每个周期的权重不同,利用9个周期实现一个卷积核为3×3的卷积运算。
举例来说,例如读入输入通道的第一行卷积核的8个数据,也即周期1的8个数据,也即数据0、数据1、数据2、数据3、数据4、数据5、数据6和数据7,通过乘法器它们分别与卷积核的权重w0相乘,由于周期1不存在与之相加的前一部分和,周期1的8个数据与卷积核的权重w0相乘直接得到部分和1。然后,周期1的8个数据右移,得到周期2的8个数据,也即也即数据1、数据2、数据3、数据4、数据5、数据6、数据7和数据8;通过乘法器将周期2的8个数据分别与卷积核的权重w1相乘,周期2的8个数据中每个数据分别对应得到一个乘法结果,将对应位置的周期1的部分和1返回到加法器,通过加法器将周期2的8个数据中每个数据对应的乘法结果与对应位置的周期1的部分和1相加,得到部分和2。再重复周期2的计算过程,得到周期3的部分和3,从而得到第一行卷积核的运算结果。继续读入输入通道的第二行卷积核的8个数据,重复周期1至周期3,得到周期4至周期6的卷积运算结果,从而得到第二行卷积核的运算结果,其中,在计算周期4的部分和4时,需要加上周期3的部分和3。继续读入输入通道的第三行卷积核的8个数据,重复周期1至周期3,得到周期7至周期9的卷积运算结果,从而得到整个卷积核的运算结果,其中,在计算周期7的部分和7时,需要加上周期6的部分和6。
请参阅图3,图3是本申请实施例提供的一种神经网络运算方法,应用于神经网络计算单元,所述神经网络计算单元用于卷积神经网络运算,所述卷积神经网络包括目标卷积核,所述神经网络计算单元包括A个第一加法器、A个乘法器和B个第二加法器,其中,A为4的指数且A大于等于8,B为A的二分之一,该方法包括但不限于如下步骤:
步骤S301:通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据输入所述目标卷积核的A个第一目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果。
具体地,所述卷积神经网络为卷积核为3×3的卷积神经网络,所述目标卷积核为所述卷积核中的其中一行,假设A为8,也即通过所述8个第一加法器、所述8个乘法器和所述4个第二加法器根据输入所述目标卷积核的8个第一目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果。
其中,所述一维加速算法可以为一维winograd算法,下面简单介绍一下一维winograd算法。以卷积核为1×3的一维winograd卷积加速算法为例,假设输入特征为[d0,d1,d2,d3],卷积核kernel为[w0,w1,w2],卷积计算结果为2个值[r0,r1]。那么,r0可以表示为m0+m1+m2,r1可以表示为m1-m2-m3;其中,m0、m1、m2和m3可以提前计算好,也可以在weight读取过程中动态计算,其计算公式如下:
m0=(d0-d2)×g0,其中,g0=w0;
m1=(d1+d2)×g1,其中,g1=(w0+w1+w2)/2;
m2=(d2-d1)×g2,其中,g2=(w0-w1-w2)/2;
m3=(d1-d3)×g3,其中,g3=w2。
由此可知,一维的winograd算法在计算卷积核为1×3的卷积运算过程中,利用4个乘法器和4个加法器计算出2个结果。而普通的卷积计算,却需要6个乘法器和4个加法器才能够同时计算出2个结果。因此,在乘法器资源一定的情况下,winograd算法是可以加速卷积计算。
步骤S302:将所述A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据。
具体地,假设C为2,也即将所述8个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到8个第二目标数据。
其中,移入的新的目标数据为相邻神经网络计算单元的移出数据或填充数据。
步骤S303:通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据所述A个第二目标数据执行所述一维加速算法运算,得到所述目标卷积核的第二运算周期的卷积运算结果。
具体地,也即通过所述8个第一加法器、所述8个乘法器和所述4个第二加法器根据输入所述目标卷积核的8个第二目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果。
步骤S304:根据所述第一运算周期的卷积运算结果和所述第二运算周期的卷积运算结果得到所述目标卷积核的卷积运算结果。
具体地,由于一维加速算法一次运算可以得到2个运算结果,也即A个目标数据中的其中2个的运算结果,而A为4的指数且A大于等于8,则至少需要2个周期才能得到4个运算结果。而当A为4的2倍及以上时,可以在一个运算周期中叠加运算一维加速算法。例如,A为8时,读取输入通道的8个第一目标数据,第一周期中运算2次一维加速算法,也即8个第一目标数据分为2组,每4个第一目标数据运算一次一维加速算法,分别得到2个运算结果,则8个第一目标数据一个运算周期一共可以得到4个运算结果;再运算第二周期,又得到4个运算结果,将第一运算周期的4个运算结果和第二运算周期的4个运算结果组合得到该输入的8个第一目标数据的运算结果,也即得到该卷积核的运算结果。
可以看出,本申请实施例提供的神经网络运算方法,通过利用现有的卷积运算硬件中的A个第一加法器、A个乘法器和B个第二加法器对输入卷积核运算的A个第一目标数据执行一维加速算法运算,得到第一运算周期的卷积运算结果;然后将A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据;再通过A个第一加法器、A个乘法器和B个第二加法器对A个第二目标数据执行一维加速算法运算,得到第二运算周期的卷积运算结果;再将第一运算周期的卷积运算结果和第二运算周期的卷积运算结果累加,得到卷积核运算的运算结果;从而能够充分利用现有硬件资源实现加速算法对卷积神经网络运算的加速。
在一些可能的示例中,所述新的目标数据为与所述神经网络计算单元相邻的神经网络计算单元移出的目标数据或填充数据。
在一些可能的示例中,所述通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据输入所述目标卷积核的A个第一目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果,包括:通过所述A个第一加法器根据所述A个第一目标数据执行所述一维加速算法的第一加法运算,得到A个第一运算数据;通过所述A个乘法器根据所述A个第一运算数据和所述一维加速算法的权重执行所述一维加速算法的乘法运算,得到A个第二运算数据;通过所述B个第二加法器根据所述A个第二运算数据执行所述一维加速算法的第二加法运算,得到所述目标卷积核的第一运算周期的卷积运算结果。
举例来说,假设A为4,输入特征的4个第一目标数据为[d0,d1,d2,d3],先通过4个第一加法器根据4个第一目标数据执行一维加速算法的第一加法运算,也即执行d0-d2,d1+d2,d2-d1,d1-d3,得到4个第一运算数据;然后通过4个乘法器根据4个第一运算数据和一维加速算法的权重执行一维加速算法的乘法运算,也即执行m0=(d0-d2)×g0,m1=(d1+d2)×g1,m2=(d2-d1)×g2,m3=(d1-d3)×g3,得到4个第二运算数据m0、m1、m2和m3;在通过所述2个第二加法器根据4个第二运算数据执行一维加速算法的第二加法运算,也即执行m0+m1+m2,m1-m2-m3,得到两个运算结果[r0、r1]。
假设A为8,输入特征的8个第一目标数据为[d0,d1,d2,d3,d4,d5,d6,d7],将其分为两组[d0,d1,d2,d3]和[d4,d5,d6,d7],则有8个第一加法器、8个乘法器和4个第二加法器,两组同时执行一维加速算法,最终同时得到两组[r0、r1],也即4个运算结果。
在一些可能的示例中,在通过所述A个乘法器根据所述A个第一运算数据和所述一维加速算法的权重执行一维加速算法的乘法运算,得到A个第二运算数据之前,所述方法还包括:根据所述目标卷积核的权重确定所述一维加速算法的权重。
举例来说,假设卷积神经网络为卷积核为3×3的卷积神经网络,有第一行卷积核的权重w0、w1、w2,则一维加速算法的权重g0=w0,g1=(w0+w1+w2)/2,g2=(w0-w1-w2)/2,g3=w2;有第二行卷积核的权重w3、w4、w5,则一维加速算法的权重g0=w3,g1=(w3+w4+w5)/2,g2=(w3-w4-w5)/2,g3=w5;有第三行卷积核的权重w6、w7、w8,则一维加速算法的权重g0=w6,g1=(w6+w7+w8)/2,g2=(w6-w7-w8)/2,g3=w8。
在一些可能的示例中,所述通过所述A个第一加法器根据所述A个第一目标数据执行所述一维加速算法的第一加法运算,得到A个第一运算数据,包括:针对C个第一目标数据组中的每一组执行如下操作,得到C个第一运算数据组,所述C个第一运算数据组中的每一组包括4个第一运算数据,所述C为A的四分之一,所述C个第一目标数据组为所述A个第一目标数据按照每4个为一组分组得到,所述C个第一目标数据组与C个第一加法器组一一对应,所述C个第一加法器组为所述A个第一加法器按照每4个为一组分组得到,所述C个第一加法器组中的每一组执行的运算操作相同:通过当前处理的第一目标数据组对应的第一加法器组执行所述一维加速算法的第一加法运算,得到当前处理的第一目标数据组对应的第一运算数据组。
举例来说,请一并参阅图4,假设A为8,通过所述8个第一加法器中的4个第一加法器根据输入所述目标卷积核的8个第一目标数据中的4个第一目标数据执行所述一维加速算法的第一加法运算,得到第一组的4个第一运算数据;通过所述8个第一加法器中的另外4个第一加法器根据输入所述目标卷积核的8个第一目标数据中的另外4个第一目标数据进行所述一维加速算法的第一加法运算,得到第二组的4个第一运算数据。具体地,在周期一中,输入特征的8个第一目标数据为[d0,d1,d2,d3,d4,d5,d6,d7],将其分为两组[d0,d1,d2,d3]和[d4,d5,d6,d7];对于第一组[d0,d1,d2,d3],通过8个第一加法器中的4个第一加法器执行d0-d2,d1+d2,d2-d1,d1-d3,得到第一组的4个第一运算数据;对于第二组[d4,d5,d6,d7],通过8个第一加法器中的另外4个第一加法器执行d4-d6,d5+d6,d6-d5,d5-d7,得到第二组的4个第一运算数据。
又举例来说,请一并参阅图5,假设A为16,在周期一中,输入特征的16个第一目标数据为[d0,d1,d2,......,d15],将其分为4组,分别为[d0,d1,d2,d3]、[d4,d5,d6,d7]、[d8,d9,d10,d11]和[d12,d13,d14,d15];对于第一组[d0,d1,d2,d3],通过16个第一加法器中的第一加法器组1的4个第一加法器执行d0-d2,d1+d2,d2-d1,d1-d3,得到第一运算数据组1的4个第一运算数据;对于第二组[d4,d5,d6,d7],通过16个第一加法器中的第一加法器组1的4个第一加法器执行d4-d6,d5+d6,d6-d5,d5-d7,得到第一运算数据组2的4个第一运算数据;对于第三组[d8,d9,d10,d11],通过16个第一加法器中的第一加法器组3的4个第一加法器执行d8-d10,d9+d10,d10-d9,d9-d11,得到第一运算数据组3的4个第一运算数据;对于第四组[d12,d13,d14,d15],通过16个第一加法器中的第一加法器组4的4个第一加法器执行d12-d14,d13+d14,d14-d13,d13-d15,得到第一运算数据组2的4个第一运算数据。
在一些可能的示例中,所述通过所述A个乘法器根据所述A个第一运算数据和所述一维加速算法的权重执行一维加速算法的乘法运算,得到A个第二运算数据,包括:针对所述C个第一运算数据组中的每一组和所述一维加速算法的4个权重执行如下操作,得到C个第二运算数据组,所述C个第二运算数据组中的每一组包括4个第二运算数据,所述C个第一运算数据组中的每一组的4个第一运算数据与所述一维加速算法的4个权重一一对应,所述C个第一运算数据组与C个乘法器组一一对应,所述C个乘法器组为所述A个乘法器按照每4个为一组分组得到:通过当前处理的第一运算数据组对应的乘法器组执行所述一维加速算法的乘法运算,得到当前处理的第一运算数据组对应的第二运算数据组。
举例来说,请继续参阅图4,通过所述8个乘法器中的4个乘法器根据所述第一组的4个第一运算数据和所述一维加速算法的4个权重执行一维加速算法的乘法运算,得到第一组的4个第二运算数据,所述第一组的4个第一运算数据与所述加速算法的4个权重一一对应;通过所述8个乘法器中的另外4个乘法器根据所述第二组的4个第一运算数据和所述一维加速算法的4个权重执行一维加速算法的乘法运算,得到第二组的4个第二运算数据,所述第二组的4个第一运算数据与所述加速算法的4个权重一一对应。具体地,在周期一中,对于第一组的4个第一运算数据,通过8个乘法器中的4个乘法器执行m0=(d0-d2)×g0,m1=(d1+d2)×g1,m2=(d2-d1)×g2,m3=(d1-d3)×g3,得到第一组的4个第二运算数据m0、m1、m2和m3;对于第二组的4个第一运算数据,通过8个乘法器中的另外4个乘法器执行m4=(d4-d6)×g0,m5=(d5+d6)×g1,m6=(d6-d5)×g2,m7=(d5-d7)×g3,得到第二组的4个第二运算数据m4、m5、m6和m7。
举例来说,请继续参阅图5,在周期一中,对于第一运算数据组1,通过16个乘法器中的乘法器组1的4个乘法器执行m0=(d0-d2)×g0,m1=(d1+d2)×g1,m2=(d2-d1)×g2,m3=(d1-d3)×g3,得到第二运算数据组1的4个第二运算数据m0、m1、m2和m3;对于第一运算数据组2,通过16个乘法器中的乘法器组2的4个乘法器执行执行m4=(d4-d6)×g0,m5=(d5+d6)×g1,m6=(d6-d5)×g2,m7=(d5-d7)×g3,得到第二运算数据组2的的4个第二运算数据m4、m5、m6和m7;对于第一运算数据组3,通过16个乘法器中的乘法器组3的4个乘法器执行m8=(d8-d10)×g0,m9=(d9+d10)×g1,m10=(d10-d9)×g2,m11=(d9-d11)×g3,得到第二运算数据组3的4个第二运算数据m8、m9、m10和m11;对于第一运算数据组4,通过16个乘法器中的乘法器组4的4个乘法器执行m0=(d12-d14)×g0,m1=(d13+d14)×g1,m2=(d14-d13)×g2,m3=(d13-d15)×g3,得到第二运算数据组4的4个第二运算数据m12、m13、m14和m15。
在一些可能的示例中,所述通过所述B个第二加法器根据所述A个第二运算数据执行所述一维加速算法的第二加法运算,得到所述目标卷积核的第一运算周期的卷积运算结果,包括:针对所述C个第二运算数据组中的每一组执行如下操作,得到C个卷积运算结果组,所述C个卷积运算结果组中的每一组包括2个卷积运算结果,所述C个第二运算数据组与C个第二加法器组一一对应,所述C个第二加法器组为所述B个乘法器按照每2个为一组分组得到,所述C个第二加法器组中的每一组执行的运算操作相同:通过当前处理的第二运算数据组对应的第二加法器组执行所述一维加速算法的第二加法运算,得到当前处理的第二运算数据组对应的卷积运算结果组。
举例来说,请继续参阅图4,通过所述4个第二加法器中的2个第二加法器根据所述第一组的4个第二运算数据执行所述一维加速算法的第二加法运算,得到第一组的2个卷积运算结果;通过所述4个第二加法器中的另外2个第二加法器根据所述第二组的4个第二运算数据执行所述一维加速算法的第二加法运算,得到第二组的2个卷积运算结果;根据所述第一组的2个卷积运算结果和所述第二组的2个卷积运算结果得到所述目标卷积核的第一运算周期的卷积运算结果。具体地,在周期一中,对于第一组的4个第二运算数据m0、m1、m2和m3,通过4个第二加法器中的2个第二加法器执行m0+m1+m2,m1-m2-m3,得到第一组的2个卷积运算结果r0和r1;对于第二组的4个第二运算数据m4、m5、m6和m7,通过4个第二加法器中的另外2个第二加法器执行m4+m5+m6,m5-m6-m7,得到第二组的2个卷积运算结果r4和r5。
举例来说,请继续参阅图5,在周期一中,对于第二运算数据组1的m0、m1、m2和m3,通过8个第二加法器中的第二加法器组1的2个第二加法器执行m0+m1+m2,m1-m2-m3,得到卷积运算结果组1的2个卷积运算结果r0和r1;对于第二运算数据组2的m4、m5、m6和m7,通过8个第二加法器中的第二加法器组1的2个第二加法器执行m4+m5+m6,m5-m6-m7,得到卷积运算结果组2的2个卷积运算结果r4和r5;对于第二运算数据组3的m8、m9、m10和m11,通过8个第二加法器中的第二加法器组3的2个第二加法器执行m8+m9+m10,m9-m10-m11,得到卷积运算结果组3的2个卷积运算结果r8和r9;对于第二运算数据组4的m12、m13、m14和m15,通过8个第二加法器中的第二加法器组4的2个第二加法器执行m12+m13+m14,m13-m14-m15,得到卷积运算结果组4的2个卷积运算结果r12和r13。
上面从一个卷积核的运算过程对本申请提供的技术方案进行描述,本申请实施例还提供一种神经网络运算方法,应用于二维卷积神经网络运算,所述二维卷积神经网络包括第一卷积核、第二卷积核和第三卷积核,所述方法包括:针对所述第一卷积核,执行如图3所述的神经网络运算方法,得到所述第一卷积核的卷积运算结果;针对所述第二卷积核,执行如图3所述的神经网络运算方法,得到所述第二卷积核的目标卷积运算结果,并根据所述第一卷积核的卷积运算结果和所述第二卷积核的目标卷积运算结果得到所述第二卷积核的卷积运算结果;针对所述第三卷积核,执行如图3所述的神经网络运算方法,得到所述第三卷积核的目标卷积运算结果,并根据所述第二卷积核的卷积运算结果和所述第三卷积核的目标卷积运算结果得到所述二维卷积神经网络的卷积运算结果。
下面以卷积核为3×3的卷积神经网络的整个计算过程为例来进一步描述本申请提供的技术方案。
请继续参阅图4,对于第一行卷积核,有第一行卷积核的权重w0、w1、w2,则一维加速算法的权重g0=w0,g1=(w0+w1+w2)/2,g2=(w0-w1-w2)/2,g3=w2;其卷积计算过程如下:
(1)第一运算周期
读取第一行卷积核的输入特征,也即8个第一目标数据为[d0,d1,d2,d3,d4,d5,d6,d7],将其分为两组[d0,d1,d2,d3]和[d4,d5,d6,d7];对于第一组[d0,d1,d2,d3],通过8个第一加法器中的4个第一加法器执行d0-d2,d1+d2,d2-d1,d1-d3,得到第一组的4个第一运算数据,记为(d0-d2)、(d1+d2)、(d2-d1)和(d1-d3);对于第二组[d4,d5,d6,d7],通过8个第一加法器中的另外4个第一加法器执行d4-d6,d5+d6,d6-d5,d5-d7,得到第二组的4个第一运算数据,记为(d4-d6)、(d5+d6)、(d6-d5)和(d5-d7)。
对于第一组的4个第一运算数据,通过8个乘法器中的4个乘法器执行m0=(d0-d2)×g0,m1=(d1+d2)×g1,m2=(d2-d1)×g2,m3=(d1-d3)×g3,得到第一组的4个第二运算数据m0、m1、m2和m3;对于第二组的4个第一运算数据,通过8个乘法器中的另外4个乘法器执行m4=(d4-d6)×g0,m5=(d5+d6)×g1,m6=(d6-d5)×g2,m7=(d5-d7)×g3,得到第二组的4个第二运算数据m4、m5、m6和m7。
对于第一组的4个第二运算数据m0、m1、m2和m3,通过4个第二加法器中的2个第二加法器执行m0+m1+m2,m1-m2-m3,得到第一组的2个卷积运算结果r0和r1;对于第二组的4个第二运算数据m4、m5、m6和m7,通过4个第二加法器中的另外2个第二加法器执行m4+m5+m6,m5-m6-m7,得到第二组的2个卷积运算结果r4和r5。
(2)第二运算周期
第一目标数据右移,得到8个第二目标数据为[d2,d3,d4,d5,d6,d7,d8,d9],将其分为两组[d2,d3,d4,d5]和[d6,d7,d8,d9];对于第一组[d2,d3,d4,d5],通过8个第一加法器中的4个第一加法器执行d2-d4,d3+d4,d4-d3,d3-d5,得到第一组的4个第一运算数据,记为(d2-d4)、(d3+d4)、(d4-d3)和(d3-d5);对于第二组[d6,d7,d8,d9],通过8个第一加法器中的另外4个第一加法器执行d6-d8,d7+d8,d8-d7,d7-d9,得到第二组的4个第一运算数据,记为(d6-d8)、(d7+d8)、(d8-d7)和(d7-d9)。
对于第一组的4个第一运算数据,通过8个乘法器中的4个乘法器执行m0=(d2-d4)×g0,m1=(d3+d4)×g1,m2=(d4-d3)×g2,m3=(d3-d5)×g3,得到第一组的4个第二运算数据m0、m1、m2和m3;对于第二组的4个第一运算数据,通过8个乘法器中的另外4个乘法器执行m4=(d6-d8)×g0,m5=(d7+d8)×g1,m6=(d8-d7)×g2,m7=(d7-d9)×g3,得到第二组的4个第二运算数据m4、m5、m6和m7。
对于第一组的4个第二运算数据m0、m1、m2和m3,通过4个第二加法器中的2个第二加法器执行m0+m1+m2,m1-m2-m3,得到第一组的2个卷积运算结果r2和r3;对于第二组的4个第二运算数据m4、m5、m6和m7,通过4个第二加法器中的另外2个第二加法器执行m4+m5+m6,m5-m6-m7,得到第二组的2个卷积运算结果r6和r7。
至此,得到第一行卷积核的运算结果r0、r1、r2、r3、r4、r5、r6和r7。
对于第二行卷积核,有第二行卷积核的权重w3、w4、w5,则一维加速算法的权重g0=w3,g1=(w3+w4+w5)/2,g2=(w3-w4-w5)/2,g3=w5;其卷积计算过程如下:
(3)第三运算周期
读取第二行卷积核的输入特征,重复第一运算周期的计算过程,得到第三运算周期的目标卷积运算结果r0、r1、r4和r5,将第三运算周期的目标卷积运算结果r0、r1、r4和r5分别与第一行卷积核的运算结果r0、r1、r4和r5对应相加,得到第二行卷积核的运算结果r0、r1、r4和r5。
(4)第四运算周期
第二卷积核的第一目标数据右移,得到第二卷积核的8个第二目标数据,重复第二运算周期的计算过程,得到第四运算周期的目标卷积运算结果r2、r3、r6和r7,将第四运算周期的目标卷积运算结果r2、r3、r6和r7分别与第一行卷积核的运算结果r2、r3、r6和r7对应相加,得到第二行卷积核的运算结果r2、r3、r6和r7。
至此,得到第二行卷积核的运算结果r0、r1、r2、r3、r4、r5、r6和r7。
对于第三行卷积核,有第三行卷积核的权重w6、w7、w8,则一维加速算法的权重g0=w6,g1=(w6+w7+w8)/2,g2=(w6-w7-w8)/2,g3=w8;其卷积计算过程如下:
(5)第五运算周期
读取第三行卷积核的输入特征,重复第一运算周期的计算过程,得到第五运算周期的目标卷积运算结果r0、r1、r4和r5,将第五运算周期的目标卷积运算结果r0、r1、r4和r5分别与第二行卷积核的运算结果r0、r1、r4和r5对应相加,得到第三行卷积核的运算结果r0、r1、r4和r5。
(6)第六运算周期
第三卷积核的第一目标数据右移,得到第三卷积核的8个第二目标数据,重复第二运算周期的计算过程,得到第六运算周期的目标卷积运算结果r2、r3、r6和r7,将第六运算周期的目标卷积运算结果r2、r3、r6和r7分别与第二行卷积核的运算结果r2、r3、r6和r7对应相加,得到第三行卷积核的运算结果r2、r3、r6和r7。
至此,得到卷积核为3×3的卷积神经网络的整个运算结果r0、r1、r2、r3、r4、r5、r6和r7。
可以看到,原先需要9个周期才能够完成的卷积运算,现在6个周期就可以完成,极大地复用了原先的硬件资源,同时提高了运算速度。
需要指出的是,图4和图5分别以输入数据的位宽为8和16来举例对本申请实施例进行说明,但本申请实施例对输入数据的位宽不仅限于此,输入数据的位宽还可以为32、64、18等,其具体计算规律与图4或5相同,本申请不再赘述。
请参阅图6,图6示出了一种神经网络运算装置600的结构示意图。该神经网络运算装置600应用于神经网络计算单元,所述神经网络计算单元用于卷积神经网络运算,所述卷积神经网络包括目标卷积核,所述神经网络计算单元包括A个第一加法器、A个乘法器和B个第二加法器,其中,A为4的指数且A大于等于8,B为A的二分之一,该神经网络运算装置600具体包括:处理单元602和通信单元603。处理单元602用于对神经网络计算单元的动作进行控制管理,例如,处理单元602用于支持神经网络计算单元执行上述方法实施例中的步骤和用于本文所描述的技术的其它过程。通信单元603用于支持神经网络计算单元与其他设备的通信。该神经网络运算装置600还可以包括存储单元601,用于存储神经网络计算单元的程序代码和数据。
其中,处理单元602可以是处理器或控制器,例如可以是中央处理器(Central Processing Unit,CPU),通用处理器,数字信号处理器(Digital Signal Processor,DSP),专用集成电路(Application-Specific Integrated Circuit,ASIC),现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。通信单元603可以是通信接口、收发器、收发电路等,存储单元601可以是存储器。
具体实现时,所述处理单元602用于执行如上述方法实施例中的任一步骤,且在执行诸如发送等数据传输时,可选择的调用所述通信单元603来完成相应操作。下面进行详细说明。
所述处理单元602用于:通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据输入所述目标卷积核的A个第一目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果;将所述A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据;通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据所述A个第二目标数据执行所述一维加速算法运算,得到所述目标卷积核的第二运算周期的卷积运算结果;根据所述第一运算周期的卷积运算结果和所述第二运算周期的卷积运算结果得到所述目标卷积核的卷积运算结果。
在一些可能的示例中,所述新的目标数据为与所述神经网络计算单元相邻的神经网络计算单元移出的目标数据或填充数据。
在一些可能的示例中,在通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据输入所述目标卷积核的A个第一目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果方面,所述处理单元602具体用于:通过所述A个第一加法器根据所述A个第一目标数据执行所述一维加速算法的第一加法运算,得到A个第一运算数据;通过所述A个乘法器根据所述A个第一运算数据和所述一维加速算法的权重执行所述一维加速算法的乘法运算,得到A个第二运算数据;通过所述B个第二加法器根据所述A个第二运算数据执行所述一维加速算法的第二加法运算,得到所述目标卷积核的第一运算周期的卷积运算结果。
在一些可能的示例中,在通过所述A个乘法器根据所述A个第一运算数据和所述一维加速算法的权重执行一维加速算法的乘法运算,得到A个第二运算数据之前,所述处理单元602还用于:根据所述目标卷积核的权重确定所述一维加速算法的权重。
在一些可能的示例中,在通过所述A个第一加法器根据所述A个第一目标数据执行所述一维加速算法的第一加法运算,得到A个第一运算数据方面,所述处理单元602具体用于:针对C个第一目标数据组中的每一组执行如下操作,得到C个第一运算数据组,所述C个第一运算数据组中的每一组包括4个第一运算数据,所述C为A的四分之一,所述C个第一目标数据组为所述A个第一目标数据按照每4个为一组分组得到,所述C个第一目标数据组与C个第一加法器组一一对应,所述C个第一加法器组为所述A个第一加法器按照每4个为一组分组得到,所述C个第一加法器组中的每一组执行的运算操作相同:通过当前处理的第一目标数据组对应的第一加法器组执行所述一维加速算法的第一加法运算,得到当前处理的第一目标数据组对应的第一运算数据组。
在一些可能的示例中,在通过所述A个乘法器根据所述A个第一运算数据和所述一维加速算法的权重执行一维加速算法的乘法运算,得到A个第二运算数据方面,所述处理单元602具体用于:针对所述C个第一运算数据组中的每一组和所述一维加速算法的4个权重执行如下操作,得到C个第二运算数据组,所述C个第二运算数据组中的每一组包括4个第二运算数据,所述C个第一运算数据组中的每一组的4个第一运算数据与所述一维加速算法的4个权重一一对应,所述C个第一运算数据组与C个乘法器组一一对应,所述C个乘法器组为所述A个乘法器按照每4个为一组分组得到:通过当前处理的第一运算数据组对应的乘法器组执行所述一维加速算法的乘法运算,得到当前处理的第一运算数据组对应的第二运算数据组。
在一些可能的示例中,在通过所述B个第二加法器根据所述A个第二运算数据执行所述一维加速算法的第二加法运算,得到所述目标卷积核的第一运算周期的卷积运算结果方面,所述处理单元602具体用于:针对所述C个第二运算数据组中的每一组执行如下操作,得到C个卷积运算结果组,所述C个卷积运算结果组中的每一组包括2个卷积运算结果,所述C个第二运算数据组与C个第二加法器组一一对应,所述C个第二加法器组为所述B个乘法器按照每2个为一组分组得到,所述C个第二加法器组中的每一组执行的运算操作相同:通过当前处理的第二运算数据组对应的第二加法器组执行所述一维加速算法的第二加法运算,得到当前处理的第二运算数据组对应的卷积运算结果组。
可以理解的是,由于方法实施例与装置实施例为相同技术构思的不同呈现形式,因此,本申请中方法实施例部分的内容应同步适配于装置实施例部分,此处不再赘述。
在图6所描述的神经网络运算装置600中,先通过A个第一加法器、A个乘法器和B个第二加法器对输入卷积核运算的A个第一目标数据执行一维加速算法运算,得到第一运算周期的卷积运算结果;然后将A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据;再通过A个第一加法器、A个乘法器和B个第二加法器对A个第二目标数据执行一维加速算法运算,得到第二运算周期的卷积运算结果;再将第一运算周期的卷积运算结果和第二运算周期的卷积运算结果累加,得到卷积核运算的运算结果;从而能够充分利用现有硬件资源实现加速算法对卷积神经网络运算的加速。
请参阅图7,图7是本申请实施例提供的一种神经网络运算设备710的结构示意图,如图7所示,所述神经网络运算设备710包括通信接口711、处理器712、存储器713和至少一个用于连接所述通信接口711、所述处理器712、所述存储器713的通信总线714。
存储器713包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory, ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM),该存储器713用于相关指令及数据。
通信接口711用于接收和发送数据。
处理器712可以是一个或多个中央处理器(central processing unit,CPU),在处理器712是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。
该神经网络运算设备710用于卷积神经网络运算,所述卷积神经网络包括目标卷积核,该神经网络运算设备710包括A个第一加法器、A个乘法器和B个第二加法器,其中,A为4的指数且A大于等于8,B为A的二分之一,该神经网络运算设备710中的处理器712用于读取所述存储器713中存储的一个或多个程序代码,执行以下操作:通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据输入所述目标卷积核的A个第一目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果;将所述A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据;通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据所述A个第二目标数据执行所述一维加速算法运算,得到所述目标卷积核的第二运算周期的卷积运算结果;根据所述第一运算周期的卷积运算结果和所述第二运算周期的卷积运算结果得到所述目标卷积核的卷积运算结果。
需要说明的是,各个操作的实现还可以对应参照上述方法实施例中相应的描述。
图7所描述的神经网络运算设备710,通过利用现有的卷积运算硬件中的A个第一加法器、A个乘法器和B个第二加法器对输入卷积核运算的A个第一目标数据执行一维加速算法运算,得到第一运算周期的卷积运算结果;然后将A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据;再通过A个第一加法器、A个乘法器和B个第二加法器对A个第二目标数据执行一维加速算法运算,得到第二运算周期的卷积运算结果;再将第一运算周期的卷积运算结果和第二运算周期的卷积运算结果累加,得到卷积核运算的运算结果;从而能够充分利用现有硬件资源实现加速算法对卷积神经网络运算的加速。
请参阅图8,图8为本申请实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器800。该芯片可以被设置在如图7所示的执行设备710中,上述方法实施例中神经网络运算方法均可在如图8所示的芯片中得以实现。
神经网络处理器800作为协处理器挂载到主CPU(host CPU)上,由主CPU分配任务。神经网络处理器800的核心部分为运算电路803,控制器804控制运算电路803提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现方式中,运算电路803内部包括多个处理单元或计算单元(process engine ,PE)。
在一些实现方式中,运算电路803是二维脉动阵列。运算电路803还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。
在一些实现方式中,运算电路803是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路803从权重存储器802中取矩阵B相应的数据,并缓存在运算电路803中每一个PE上。运算电路803从输入存储器801中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator,MAC)808中。
向量计算单元807可以对运算电路803的输出做进一步处理,如向量乘、向量加、指数运算、对数运算、大小比较等。例如,向量计算单元807可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现方式中,向量计算单元能807将经处理的输出的向量存储到统一缓存器806。例如,向量计算单元807可以将非线性函数应用到运算电路803的输出,例如累加值的向量,用以生成激活值。
在一些实现方式中,向量计算单元807生成归一化的值、合并值,或二者均有。
在一些实现方式中,处理过的输出的向量能够用作到运算电路803的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器806用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)805 将外部存储器中的输入数据搬运到输入存储器801和/或统一存储器806、将外部存储器中的权重数据存入权重存储器802,以及将统一存储器806中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)810,用于通过总线实现主CPU、DMAC和取指存储器809之间进行交互。
与控制器804连接的取指存储器(instruction fetch buffer)809,用于存储控制器804使用的指令;
控制器804,用于调用指存储器809中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器806,输入存储器801,权重存储器802以及取指存储器809均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,上述方法实施例中所示的方法流程得以实现。
本申请实施例还提供一种计算机程序产品,当所述计算机程序产品在计算机上运行时,上述方法实施例中所示的方法流程得以实现。
应理解,本申请实施例中提及的处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中提及的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。
需要说明的是,当处理器为通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件时,存储器(存储模块)集成在处理器中。
应注意,本文描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
 

Claims (13)

  1. 一种神经网络运算方法,其特征在于,应用于神经网络计算单元,所述神经网络计算单元用于卷积神经网络运算,所述卷积神经网络包括目标卷积核,所述神经网络计算单元包括A个第一加法器、A个乘法器和B个第二加法器,其中,A为4的指数且A大于等于8,B为A的二分之一,所述方法包括:
    通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据输入所述目标卷积核的A个第一目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果;
    将所述A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据;
    通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据所述A个第二目标数据执行所述一维加速算法运算,得到所述目标卷积核的第二运算周期的卷积运算结果;
    根据所述第一运算周期的卷积运算结果和所述第二运算周期的卷积运算结果得到所述目标卷积核的卷积运算结果。
  2. 根据权利要求1所述的方法,其特征在于,所述新的目标数据为与所述神经网络计算单元相邻的神经网络计算单元移出的目标数据或填充数据。
  3. 根据权利要求1或2所述的方法,其特征在于,所述通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据输入所述目标卷积核的A个第一目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果,包括:
    通过所述A个第一加法器根据所述A个第一目标数据执行所述一维加速算法的第一加法运算,得到A个第一运算数据;
    通过所述A个乘法器根据所述A个第一运算数据和所述一维加速算法的权重执行所述一维加速算法的乘法运算,得到A个第二运算数据;
    通过所述B个第二加法器根据所述A个第二运算数据执行所述一维加速算法的第二加法运算,得到所述目标卷积核的第一运算周期的卷积运算结果。
  4. 根据权利要求3所述的方法,其特征在于,在通过所述A个乘法器根据所述A个第一运算数据和所述一维加速算法的权重执行一维加速算法的乘法运算,得到A个第二运算数据之前,所述方法还包括:
    根据所述目标卷积核的权重确定所述一维加速算法的权重。
  5. 根据权利要求3所述的方法,其特征在于,所述通过所述A个第一加法器根据所述A个第一目标数据执行所述一维加速算法的第一加法运算,得到A个第一运算数据,包括:
    针对C个第一目标数据组中的每一组执行如下操作,得到C个第一运算数据组,所述C个第一运算数据组中的每一组包括4个第一运算数据,所述C为A的四分之一,所述C个第一目标数据组为所述A个第一目标数据按照每4个为一组分组得到,所述C个第一目标数据组与C个第一加法器组一一对应,所述C个第一加法器组为所述A个第一加法器按照每4个为一组分组得到,所述C个第一加法器组中的每一组执行的运算操作相同:
    通过当前处理的第一目标数据组对应的第一加法器组执行所述一维加速算法的第一加法运算,得到当前处理的第一目标数据组对应的第一运算数据组。
  6. 根据权利要求5所述的方法,其特征在于,所述通过所述A个乘法器根据所述A个第一运算数据和所述一维加速算法的权重执行一维加速算法的乘法运算,得到A个第二运算数据,包括:
    针对所述C个第一运算数据组中的每一组和所述一维加速算法的4个权重执行如下操作,得到C个第二运算数据组,所述C个第二运算数据组中的每一组包括4个第二运算数据,所述C个第一运算数据组中的每一组的4个第一运算数据与所述一维加速算法的4个权重一一对应,所述C个第一运算数据组与C个乘法器组一一对应,所述C个乘法器组为所述A个乘法器按照每4个为一组分组得到:
    通过当前处理的第一运算数据组对应的乘法器组执行所述一维加速算法的乘法运算,得到当前处理的第一运算数据组对应的第二运算数据组。
  7. 根据权利要求6所述的方法,其特征在于,所述通过所述B个第二加法器根据所述A个第二运算数据执行所述一维加速算法的第二加法运算,得到所述目标卷积核的第一运算周期的卷积运算结果,包括:
    针对所述C个第二运算数据组中的每一组执行如下操作,得到C个卷积运算结果组,所述C个卷积运算结果组中的每一组包括2个卷积运算结果,所述C个第二运算数据组与C个第二加法器组一一对应,所述C个第二加法器组为所述B个乘法器按照每2个为一组分组得到,所述C个第二加法器组中的每一组执行的运算操作相同:
    通过当前处理的第二运算数据组对应的第二加法器组执行所述一维加速算法的第二加法运算,得到当前处理的第二运算数据组对应的卷积运算结果组。
  8. 一种神经网络运算方法,其特征在于,应用于二维卷积神经网络运算,所述二维卷积神经网络包括第一卷积核、第二卷积核和第三卷积核,所述方法包括:
    针对所述第一卷积核,执行如权利要求1所述的神经网络运算方法,得到所述第一卷积核的卷积运算结果;
    针对所述第二卷积核,执行如权利要求1所述的神经网络运算方法,得到所述第二卷积核的目标卷积运算结果,并根据所述第一卷积核的卷积运算结果和所述第二卷积核的目标卷积运算结果得到所述第二卷积核的卷积运算结果;
    针对所述第三卷积核,执行如权利要求1所述的神经网络运算方法,得到所述第三卷积核的目标卷积运算结果,并根据所述第二卷积核的卷积运算结果和所述第三卷积核的目标卷积运算结果得到所述二维卷积神经网络的卷积运算结果。
  9. 一种神经网络运算装置,其特征在于,应用于神经网络计算单元,所述神经网络计算单元用于卷积神经网络运算,所述卷积神经网络包括目标卷积核,所述神经网络计算单元包括A个第一加法器、A个乘法器和B个第二加法器,其中,A为4的指数且A大于等于8,B为A的二分之一,所述神经网络运算装置包括处理单元,所述处理单元用于:
    通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据输入所述目标卷积核的A个第一目标数据执行一维加速算法运算,得到所述目标卷积核的第一运算周期的卷积运算结果;
    以及将所述A个第一目标数据进行右移,移出2个第一目标数据的同时移入2个新的目标数据,得到A个第二目标数据;
    以及通过所述A个第一加法器、所述A个乘法器和所述B个第二加法器根据所述A个第二目标数据执行所述一维加速算法运算,得到所述目标卷积核的第二运算周期的卷积运算结果;
    以及根据所述第一运算周期的卷积运算结果和所述第二运算周期的卷积运算结果得到所述目标卷积核的卷积运算结果。
  10. 一种神经网络运算设备,其特征在于,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行如权利要求1-7任一项所述的方法中的步骤的指令。
  11. 一种芯片,其特征在于,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有所述芯片的设备执行如权利要求1-7中任一项所述的方法。
  12. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求1-7中任一项所述的方法。
  13. 一种计算机程序,所述计算机程序使得计算机执行如权利要求1-7中任一项所述的方法。
PCT/CN2021/088443 2020-06-28 2021-04-20 神经网络运算方法及相关设备 WO2022001301A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010602307.4A CN111814957B (zh) 2020-06-28 2020-06-28 神经网络运算方法及相关设备
CN202010602307.4 2020-06-28

Publications (1)

Publication Number Publication Date
WO2022001301A1 true WO2022001301A1 (zh) 2022-01-06

Family

ID=72856388

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/088443 WO2022001301A1 (zh) 2020-06-28 2021-04-20 神经网络运算方法及相关设备

Country Status (2)

Country Link
CN (1) CN111814957B (zh)
WO (1) WO2022001301A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118094069A (zh) * 2024-04-18 2024-05-28 北京壁仞科技开发有限公司 逐通道卷积装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814957B (zh) * 2020-06-28 2024-04-02 深圳云天励飞技术股份有限公司 神经网络运算方法及相关设备
CN114697275B (zh) * 2020-12-30 2023-05-12 深圳云天励飞技术股份有限公司 数据处理方法和装置
TWI797985B (zh) * 2021-02-10 2023-04-01 國立成功大學 卷積運算的執行方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875012A (zh) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 一种基于fpga的深度卷积神经网络的流水化加速系统
CN106940815A (zh) * 2017-02-13 2017-07-11 西安交通大学 一种可编程卷积神经网络协处理器ip核
US20190228307A1 (en) * 2018-01-23 2019-07-25 Samsung Electronics Co., Ltd. Method and apparatus with data processing
US20200019847A1 (en) * 2019-09-25 2020-01-16 Intel Corporation Processor array for processing sparse binary neural networks
CN111814957A (zh) * 2020-06-28 2020-10-23 深圳云天励飞技术有限公司 神经网络运算方法及相关设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001298349A (ja) * 2000-04-17 2001-10-26 Matsushita Electric Ind Co Ltd オーバーサンプリングディジタルフィルタ回路
WO2018108126A1 (zh) * 2016-12-14 2018-06-21 上海寒武纪信息科技有限公司 神经网络卷积运算装置及方法
CN109117455A (zh) * 2017-06-26 2019-01-01 上海寒武纪信息科技有限公司 计算装置及方法
WO2019168084A1 (ja) * 2018-03-02 2019-09-06 日本電気株式会社 推論装置、畳み込み演算実行方法及びプログラム
CN109117187A (zh) * 2018-08-27 2019-01-01 郑州云海信息技术有限公司 卷积神经网络加速方法及相关设备
CN111144545B (zh) * 2018-11-02 2022-02-22 深圳云天励飞技术股份有限公司 用于实现卷积运算的处理元件、装置和方法
CN111260020B (zh) * 2018-11-30 2024-04-16 深圳市海思半导体有限公司 卷积神经网络计算的方法和装置
CN110580519B (zh) * 2019-08-19 2022-03-22 中国科学院计算技术研究所 一种卷积运算装置及其方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875012A (zh) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 一种基于fpga的深度卷积神经网络的流水化加速系统
CN106940815A (zh) * 2017-02-13 2017-07-11 西安交通大学 一种可编程卷积神经网络协处理器ip核
US20190228307A1 (en) * 2018-01-23 2019-07-25 Samsung Electronics Co., Ltd. Method and apparatus with data processing
US20200019847A1 (en) * 2019-09-25 2020-01-16 Intel Corporation Processor array for processing sparse binary neural networks
CN111814957A (zh) * 2020-06-28 2020-10-23 深圳云天励飞技术有限公司 神经网络运算方法及相关设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118094069A (zh) * 2024-04-18 2024-05-28 北京壁仞科技开发有限公司 逐通道卷积装置

Also Published As

Publication number Publication date
CN111814957B (zh) 2024-04-02
CN111814957A (zh) 2020-10-23

Similar Documents

Publication Publication Date Title
WO2022001301A1 (zh) 神经网络运算方法及相关设备
CN112214727B (zh) 运算加速器
US10445638B1 (en) Restructuring a multi-dimensional array
CN107729989B (zh) 一种用于执行人工神经网络正向运算的装置及方法
CN111062472B (zh) 一种基于结构化剪枝的稀疏神经网络加速器及其加速方法
CN104899182B (zh) 一种支持可变分块的矩阵乘加速方法
US11775430B1 (en) Memory access for multiple circuit components
US20230026006A1 (en) Convolution computation engine, artificial intelligence chip, and data processing method
TW201937416A (zh) 排程神經網路處理
JP7387017B2 (ja) アドレス生成方法及びユニット、深層学習処理器、チップ、電子機器並びにコンピュータプログラム
WO2022016926A1 (zh) 神经网络计算装置和数据读取、数据存储方法及相关设备
US20220253668A1 (en) Data processing method and device, storage medium and electronic device
CN110531954B (zh) 乘法器、数据处理方法、芯片及电子设备
CN116888591A (zh) 一种矩阵乘法器、矩阵计算方法及相关设备
WO2021081854A1 (zh) 一种卷积运算电路和卷积运算方法
US10997277B1 (en) Multinomial distribution on an integrated circuit
CN114510217A (zh) 处理数据的方法、装置和设备
CN112925727A (zh) Tensor高速缓存及访问结构及其方法
CN111832714A (zh) 运算方法及装置
TWI798591B (zh) 卷積神經網路運算方法及裝置
US11983128B1 (en) Multidimensional and multiblock tensorized direct memory access descriptors
WO2023115529A1 (zh) 芯片内的数据处理方法及芯片
CN111291864B (zh) 运算处理模组、神经网络处理器、电子设备及数据处理方法
CN118093452B (zh) 一种内存架构映射方法、设备、存储介质及程序产品
CN117313803B (zh) 基于risc-v向量处理器架构的滑动窗口2d卷积计算方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21831831

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21831831

Country of ref document: EP

Kind code of ref document: A1