CN114528526B - Convolution data processing method and device, convolution operation accelerator and storage medium - Google Patents

Convolution data processing method and device, convolution operation accelerator and storage medium Download PDF

Info

Publication number
CN114528526B
CN114528526B CN202210434400.8A CN202210434400A CN114528526B CN 114528526 B CN114528526 B CN 114528526B CN 202210434400 A CN202210434400 A CN 202210434400A CN 114528526 B CN114528526 B CN 114528526B
Authority
CN
China
Prior art keywords
data
processed
convolution
data group
time sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210434400.8A
Other languages
Chinese (zh)
Other versions
CN114528526A (en
Inventor
梁猷强
张斌
刘钊含
沈小勇
吕江波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Simou Intelligent Technology Co ltd
Shenzhen Smartmore Technology Co Ltd
Original Assignee
Beijing Simou Intelligent Technology Co ltd
Shenzhen Smartmore Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Simou Intelligent Technology Co ltd, Shenzhen Smartmore Technology Co Ltd filed Critical Beijing Simou Intelligent Technology Co ltd
Priority to CN202210434400.8A priority Critical patent/CN114528526B/en
Publication of CN114528526A publication Critical patent/CN114528526A/en
Application granted granted Critical
Publication of CN114528526B publication Critical patent/CN114528526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Biophysics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Neurology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The present application relates to a convolution data processing method, apparatus, convolution operation accelerator, storage medium and computer program product. The method comprises the following steps: receiving data to be processed input into an integrated circuit chip; carrying out delay processing on the time sequence of the data to be processed to obtain delay time sequence information of the data to be processed; and sending the data to be processed to a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed to obtain a corresponding convolution operation result. The method can reduce the circuit scale of the integrated circuit chip.

Description

Convolution data processing method and device, convolution operation accelerator and storage medium
Technical Field
The present application relates to the field of electronic circuit technology, and in particular, to a convolution data processing method and apparatus, a convolution accelerator, a storage medium, and a computer program product.
Background
In the Field of data processing technology, related convolution operations are generally performed using an FPGA (Field Programmable Gate Array) chip, for example, convolution calculations of image data are performed by the FPGA chip.
In the traditional technology, a calculation unit of an FPGA chip is used for realizing convolution operation, and a trigger (Flip Flop) is used for performing beat delay processing on data, so that the time sequence correctness of the convolution data in the FPGA chip is controlled; however, this approach consumes a large amount of flip-flop resources, resulting in an increase in circuit scale.
Disclosure of Invention
In view of the above, it is necessary to provide a convolution data processing method, a convolution operation accelerator, a computer-readable storage medium, and a computer program product capable of reducing the circuit scale in order to solve the above-described technical problems.
In a first aspect, the present application provides a method of convolutional data processing. The method comprises the following steps:
receiving data to be processed input into an integrated circuit chip;
carrying out delay processing on the time sequence of the data to be processed to obtain delay time sequence information of the data to be processed;
and sending the data to be processed to a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed to obtain a corresponding convolution operation result.
In one embodiment, the delaying the timing of the data to be processed to obtain delayed timing information of the data to be processed includes:
and carrying out delay processing on the time sequence of the data to be processed through a front-end circuit to obtain delay time sequence information of the data to be processed.
In one embodiment, the front end circuit comprises a delay device; the delay means is constituted by at least one register.
In one embodiment, delaying the timing of the data to be processed to obtain delayed timing information of the data to be processed includes:
determining a data group corresponding to the data to be processed;
when the data to be processed comprises at least two data groups, sending a first data group in the data to be processed to a convolution calculation unit, and performing delay processing on the data groups except the first data group in the data to be processed to obtain delay time sequence information of the data groups except the first data group in the data to be processed.
In one embodiment, the delaying the data group except the first data group in the data to be processed to obtain the delay timing information of the data group except the first data group in the data to be processed includes:
delaying the time sequence of the data groups except the first data group in the data to be processed to obtain first delayed time sequence information of each data group except the first data group in the data to be processed;
sending a second data group in the data to be processed to the convolution calculation unit according to the first delay time sequence information;
performing delay processing on first delay time sequence information of each data group except the first data group and the second data group in the data to be processed to obtain second delay time sequence information of each data group except the first data group and the second data group in the data to be processed;
and sending a third data group of the data to be processed to the convolution calculating unit according to the second delay time sequence information until a last data group in the data to be processed is sent to the convolution calculating unit.
In one embodiment, according to the delay timing information of the data to be processed, sending the data to be processed to a convolution calculation unit for convolution operation to obtain a corresponding convolution operation result, including:
acquiring the weight corresponding to the data to be processed;
and respectively inputting the data to be processed and the weight corresponding to the data to be processed into the convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed, so as to obtain the convolution processing result output by the convolution calculation unit.
In one embodiment, the inputting the data to be processed and the weight corresponding to the data to be processed into the convolution calculation unit for convolution operation to obtain the convolution processing result output by the convolution calculation unit includes:
according to each convolution calculation unit, performing multiplication and addition operation on each data group in the data to be processed and the weight corresponding to each data group respectively to obtain a multiplication and addition operation result output by each convolution calculation unit; the convolution units carry out multiplication and addition operation in a parallel mode;
and adding the multiplication and addition operation results output by the convolution calculation units to obtain the convolution processing result.
In one embodiment, according to each convolution calculation unit, performing a multiply-add operation on each data group in the data to be processed and a weight corresponding to each data group, respectively, to obtain a multiply-add operation result output by each convolution calculation unit, includes:
in each convolution calculation unit, performing multiplication and addition operation on a first data group and the weight corresponding to the first data group to obtain a first multiplication and addition operation result;
triggering the convolution calculation unit to carry out multiply-add operation on a second data group, the weight corresponding to the second data group and the first multiply-add operation result according to first delay time sequence information to obtain a second multiply-add operation result;
and triggering the convolution calculation unit to carry out multiply-add operation on a third data group, the weight corresponding to the third data group and the second multiply-add operation result according to second delay time sequence information to obtain a third multiply-add operation result until the multiply-add operation result is obtained.
In a second aspect, the present application further provides a convolution data processing apparatus. The device comprises:
the data receiving module is used for receiving data to be processed input into the integrated circuit chip;
the front-end circuit module is used for carrying out delay processing on the time sequence of the data to be processed to obtain delay time sequence information of the data to be processed;
and the convolution operation module is used for sending the data to be processed to a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed to obtain a corresponding convolution operation result.
In a third aspect, the present application further provides a convolution operation accelerator. The convolution operation accelerator comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the following steps when executing the computer program:
receiving data to be processed input into an integrated circuit chip;
carrying out delay processing on the time sequence of the data to be processed to obtain delay time sequence information of the data to be processed;
and sending the data to be processed to a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed to obtain a corresponding convolution operation result.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
receiving data to be processed input into an integrated circuit chip;
carrying out delay processing on the time sequence of the data to be processed to obtain delay time sequence information of the data to be processed;
and sending the data to be processed to a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed to obtain a corresponding convolution operation result.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:
receiving data to be processed input into an integrated circuit chip;
carrying out delay processing on the time sequence of the data to be processed to obtain delay time sequence information of the data to be processed;
and sending the data to be processed to a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed to obtain a corresponding convolution operation result.
According to the convolution data processing method, the convolution operation accelerator, the storage medium and the computer program product, the delay time sequence information of the data to be processed is obtained by receiving the data to be processed input into the integrated circuit chip and performing delay processing on the time sequence of the data to be processed; and sending the data to be processed to a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed to obtain a corresponding convolution operation result. By adopting the method, the delay processing is firstly carried out on the data to be processed which is input into the integrated circuit chip, and then the convolution calculation unit is utilized to carry out the convolution processing.
Drawings
FIG. 1 is a schematic flow chart diagram of a convolution data processing method in one embodiment;
FIG. 2 is a logical block diagram of a convolution data processing method in one embodiment;
FIG. 3 is a flowchart illustrating the step of obtaining delay timing information of data to be processed according to one embodiment;
FIG. 4 is a block diagram of the logic for delay processing by the front end in one embodiment;
FIG. 5 is a diagram of the logic structure for performing convolution operations with a convolution kernel in one embodiment;
FIG. 6 is a flow chart illustrating a convolution data processing method according to another embodiment;
FIG. 7 is a block diagram showing the structure of a convolution data processing apparatus according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a convolution data processing method is provided, which may be applied to convolution operation of a deep neural network by a convolution operation accelerator, which may be an integrated circuit chip, but may also include other chips capable of performing convolution operation. In this embodiment, the method includes the steps of:
step S101, receiving data to be processed input into the integrated circuit chip.
The integrated circuit chip may include an FPGA (Field Programmable Gate Array) chip.
The data to be processed refers to data that needs to be subjected to convolution operation, and may be image data, voice data, or point cloud data used for convolution operation, for example. The data to be processed may be of an n × n structure, where n may be 1, 2, 3, 4, 5, etc., and is not particularly limited herein.
Specifically, the data stream inputs the data to be processed, which needs to be subjected to convolution operation, into a convolution operation accelerator, and the convolution operation accelerator receives the data to be processed.
And step S102, carrying out delay processing on the time sequence of the data to be processed to obtain delay time sequence information of the data to be processed.
Here, the delay timing information refers to information of a delayed clock cycle.
Specifically, fig. 2 shows a logical structure diagram of a convolution data processing method, as shown in fig. 2, a front-end circuit is arranged in a convolution operation accelerator; and (3) delaying the time sequence of each data in the data to be processed by a delay device 21 in the front-end circuit to obtain the delay time sequence information of each data in the data to be processed, wherein the delay processing is used for ensuring the correct time sequence in the convolution operation accelerator.
And step S103, sending the data to be processed to a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed, and obtaining a corresponding convolution operation result.
The convolution calculation unit refers to a calculation unit in a convolution operation accelerator and is used for realizing convolution operation. The convolution calculation unit may be a DSP (Digital Signal Processing) unit at the bottom of the FPGA chip, where the DSP unit includes the IP core of the DSP 48.
Specifically, the convolution operation accelerator sends the data to be processed to the convolution calculation unit according to the delay time sequence information of the data to be processed, and the convolution operation accelerator performs convolution operation on the received data through the plurality of cascaded convolution calculation units to obtain a convolution operation result corresponding to the data to be processed.
In practical application, in a convolutional neural network, it is often necessary to multiplex a convolution computing unit in parallel and perform convolution operation on data to be processed at the same time, as shown in fig. 2, the data to be processed is input in a single channel, and a pre-circuit performs delay processing on the data to be processed to obtain delay time sequence information of the data to be processed; and then, according to the delay time sequence information of the data to be processed, the data to be processed is sent to the convolution kernel 22 for convolution operation, so that a convolution operation result output by the convolution kernel 22 is obtained, wherein one convolution kernel corresponds to one output channel, and the convolution operation result output by multiple channels is obtained. It should be noted that the convolution kernel is formed by cascading a plurality of convolution computing units, each convolution computing unit is required to perform delay processing on data to be processed in the conventional technology, and when multi-channel output is processed, a plurality of convolution kernels in the conventional technology consume a large amount of resources for processing delay.
In the convolution data processing method, delay processing is carried out on the time sequence of the data to be processed by receiving the data to be processed input into the integrated circuit chip, so as to obtain delay time sequence information of the data to be processed; and sending the data to be processed to a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed to obtain a corresponding convolution operation result. By adopting the method, the data to be processed input into the integrated circuit chip is firstly delayed, and then the convolution calculation unit is utilized for convolution processing, compared with the method for delaying the data to be processed in the convolution calculation unit, the method consumes less resources for processing delay, and effectively reduces the circuit scale of the integrated circuit chip.
In an embodiment, in the step S102, performing delay processing on the time sequence of the data to be processed to obtain delay time sequence information of the data to be processed specifically includes the following steps: and carrying out delay processing on the time sequence of the data to be processed through the front-end circuit to obtain delay time sequence information of the data to be processed.
Specifically, as shown in fig. 2, the convolution accelerator performs signal beating on a data group in the data to be processed through the front-end circuit to obtain delay timing information of the data to be processed. The signal beating is used for controlling the time sequence of a data group in the data to be processed. For example, beating one beat on a data set means delaying the data set by one clock cycle, beating two beats on the data set means delaying the data set by two clock cycles, and beating R beats means extending R clock cycles, where R is a positive integer.
In the embodiment, the pre-circuit is arranged to uniformly delay the data to be processed, so that register resources required to be consumed during multi-channel output can be reduced, the circuit area is effectively reduced, and the power consumption of the integrated circuit chip is further reduced.
In one embodiment, the front end circuit includes a delay device; the delay means is formed by at least one register.
Specifically, as shown in fig. 2, the delay means 21 may be a register. The delay means may be constituted by at least one register. The FlipFlop (trigger) is a basic composition unit of a register, and the convolution operation accelerator performs signal beating on a data group in data to be processed on a front-end circuit by using the FlipFlop so as to change the output timing sequence of the data group.
In an embodiment, as shown in fig. 3, in the step S102, performing delay processing on the time sequence of the data to be processed to obtain delay time sequence information of the data to be processed specifically includes the following steps:
step S301, determining a data group corresponding to the data to be processed.
Step S302, when the data to be processed comprises at least two data groups, sending a first data group in the data to be processed to a convolution calculation unit, and performing delay processing on data groups except the first data group in the data to be processed to obtain delay time sequence information of the data groups except the first data group in the data to be processed.
Wherein, the data group refers to a data set formed by data in the data to be processed. The data group may be a column of data in the data to be processed, a row of data in the data to be processed, or a data group formed by other forms of data.
Specifically, the convolution operation accelerator acquires a data group corresponding to the data to be processed according to a preset data group determination mode; when the data to be processed comprises a data group, sending the data to be processed to a convolution calculation unit; when the data to be processed comprises at least two data groups, a first group of data in the data to be processed is directly sent to the convolution calculation unit, namely the first group of data in the data to be processed does not need to be subjected to delay processing, and then each group of data group except the first data group in the data to be processed is subjected to delay processing to obtain delay time sequence information of the data group except the first data group in the data to be processed.
In this embodiment, the convolution operation accelerator determines a data group corresponding to the data to be processed, sends a first data group in the data to be processed to the convolution calculation unit, and performs delay processing on data groups except the first data group in the data to be processed to obtain delay time sequence information of the data groups except the first data group in the data to be processed, so that reasonable delay of the time sequence of the data to be processed is realized, and the time sequence of the convolution operation accelerator is ensured to be correct.
In an embodiment, in the step S302, performing delay processing on the data group to be processed, except for the first data group, to obtain the delay time sequence information of the data group to be processed, except for the first data group, specifically includes the following steps: delaying the time sequence of the data groups except the first data group in the data to be processed to obtain first delay time sequence information of each data group except the first data group in the data to be processed; sending a second data group in the data to be processed to a convolution calculation unit according to the first delay time sequence information; delaying the first delay time sequence information of each data group except the first data group and the second data group in the data to be processed to obtain second delay time sequence information of each data group except the first data group and the second data group in the data to be processed; and sending a third data group of the data to be processed to the convolution calculating unit according to the second delay time sequence information until the last data group in the data to be processed is sent to the convolution calculating unit.
The delay processing refers to delaying the timing by one clock cycle.
Specifically, the convolution operation accelerator firstly directly sends a first group of data in the data to be processed to the convolution calculation unit by using the delay device, and then delays the time sequence of the data group except the first data group in the data to be processed by one clock cycle by using the delay device to obtain first delay time sequence information of each data group except the first data group in the data to be processed, namely the first delay time sequence information indicates that one clock cycle is delayed; sending a second data group in the data to be processed to a convolution calculation unit according to the first delay time sequence information; delaying the first delay time sequence information of each data group except the first data group and the second data group in the data to be processed by a clock cycle again by using a delaying device to obtain second delay time sequence information of each data group except the first data group and the second data group in the data to be processed, wherein the second delay time sequence information represents two clock cycles of delay; and sending a third data group of the data to be processed to the convolution calculating unit according to the second delay time sequence information until the last data group in the data to be processed is sent to the convolution calculating unit.
The number of times of processing delay of the delay device is equal to the number of the data groups minus one, and the data amount of each delay processing is equal to the total data amount of the data to be processed minus the data amount of the sent data group. Taking 3 × 3 data to be processed as an example, every 3 data are set as one group, then 3 data in the first group are directly sent to the convolution calculation unit, the delay device performs delay processing on 6 data in total in the remaining second group and third group, then 3 data in the second group are sent to the convolution calculation unit, the delay device performs delay processing on 3 data in total in the remaining third group, and finally 3 data in the third group are sent to the convolution calculation unit, so that delay processing is performed twice in total, 6 data are processed in total in the first delay processing, 3 data are processed in total in the second delay processing, and 9 data are processed in total.
In practical application, the method comprises the following steps: (1) directly sending a first group of data in the data to be processed to a convolution calculation unit; (2) carrying out delay processing on the time sequences of the data groups except the sent data group in the data to be processed; (3) after delay processing, sending a first group of data in the remaining data group in the data to be processed to a convolution calculation unit; and (4) circulating the steps (2) to (3) until the last data group in the data to be processed is sent to the convolution calculation unit.
Fig. 4 is a logical structure diagram of the delay processing performed by the front-end circuit, and as shown in fig. 4, the data to be processed is in an n × n structure, and the data group of the data to be processed is determined according to the columns; sending the first set of data (D11, D12, D13, D14, …, D1 n) directly to the convolution calculation unit; delaying the timing of n-1 data groups including D21, D31, D41, …, and Dn1 by 1 clock cycle, and transmitting the delayed second data group (D21, D22, D23, D24, …, and D2 n) to a convolution calculation unit; then delaying the timing of n-2 data groups including D31, D41, … and Dn1 by 1 clock cycle, namely delaying the timing of n-2 data groups including D31, D41, … and Dn1 by 2 clock cycles, and sending the delayed third data group (D31, D32, D33, D34, … and D3 n) to the convolution calculation unit; the fourth data group (D41, D42, D43, D44, …, D4 n), the fifth data group, the sixth data group, and so on are delayed, transmitted in the same manner until the nth data group (Dn 1, Dn2, Dn3, Dn4, …, Dnn) is transmitted to the convolution calculation unit.
Further, assuming that 8 bits are consumed by the delay device for processing one data, taking the convolution operation of 5 × 5 data to be processed with 1 input channel and 32 output channels as an example, since the delay is realized by the front-end circuit in the method, the number of output channels has no influence on the method, and the method consumes 1 [ (25-5) + (25-10) + (25-15) + (25-20) ] -8 =400 resources in total; in the conventional technology, delay is set on each convolution calculation unit of a convolution kernel, and one output channel corresponds to one convolution kernel, so that at least one more output channel needs to consume twice more resources, that is, at least 400 × 32=12800 resources need to be consumed in the conventional technology. Thus, the present approach greatly reduces the resources required for the delay process.
In this embodiment, the data groups except the first data group in the data to be processed are delayed by the front-end circuit, so as to sequentially obtain the delay timing information of the data groups except the first data group in the data to be processed, and then each data group in the data to be processed is sent to the convolution calculating unit based on the delay timing information, so as to perform the subsequent convolution operation step. By adopting the method, the delay processing is uniformly carried out on the data to be processed by arranging the front-end circuit, so that the resource for delay processing which needs to be consumed during multi-channel output can be greatly reduced, the circuit area is effectively reduced, and the power consumption of an integrated circuit chip is further reduced.
In an embodiment, in the step S103, according to the delay timing information of the data to be processed, the data to be processed is sent to the convolution calculating unit for convolution operation, so as to obtain the corresponding convolution operation result, which specifically includes the following steps: acquiring the weight corresponding to the data to be processed; and respectively inputting the data to be processed and the weight corresponding to the data to be processed into a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed, so as to obtain a convolution processing result output by the convolution calculation unit.
The convolution kernel includes an internal Memory (RAM), the internal Memory includes a weight RAM and a data RAM, and the weight RAM stores a weight corresponding to data to be processed. The data RAM is used for storing each data group of the data to be processed.
Specifically, the convolution operation accelerator queries a convolution kernel to obtain a weight corresponding to the data to be processed in the convolution kernel, and triggers the convolution calculation unit to perform convolution operation on the data to be processed and the weight corresponding to the data to be processed according to the delay time sequence information of the data to be processed to obtain an output convolution processing result.
Further, when single-channel input and multi-channel output are performed, as shown in fig. 2, the convolution operation accelerator inputs the data to be processed into a plurality of parallel convolution kernels respectively to perform convolution operation, where the weights of the data to be processed in each convolution kernel, the delay time sequence information of the data to be processed, and the data to be processed are equal.
In this embodiment, the convolution operation accelerator obtains the weight corresponding to the data to be processed, and according to the delay timing information of the data to be processed, the data to be processed and the weight corresponding to the data to be processed are respectively input to the convolution calculation unit for convolution operation, so as to obtain the convolution processing result output by the convolution calculation unit, thereby implementing the convolution operation of the data to be processed.
In one embodiment, the data to be processed and the weight corresponding to the data to be processed are input to the convolution calculation unit for convolution operation, so as to obtain a convolution processing result output by the convolution calculation unit, and the method specifically includes the following steps: according to each convolution calculation unit, performing multiplication and addition operation on each data group in the data to be processed and the weight corresponding to each data group respectively to obtain a multiplication and addition operation result output by each convolution calculation unit; each convolution unit carries out multiplication and addition operation in a parallel mode; and adding the multiplication and addition operation results output by the convolution calculation units to obtain a convolution processing result.
The multiply-add operation is implemented by a convolution calculation unit, for example, the DSP48 of one IP core can implement a multiply-add operation once, and the convolution operation of the data to be processed including at least two data sets is implemented by cascading a plurality of convolution calculation units.
Specifically, the convolution operation accelerator performs multiplication and addition operation on each data group in the data to be processed and the weight corresponding to each data group according to each convolution calculation unit to obtain a multiplication and addition operation result output by each convolution calculation unit, wherein each convolution unit performs multiplication and addition operation in a parallel mode, then determines a cascade layer corresponding to each convolution calculation unit according to the cascade relation among the convolution calculation units, and accumulates the multiplication and addition operation results output by the convolution calculation units in the cascade layer to obtain an accumulation result output by the cascade layer.
Fig. 5 shows a logical structure diagram of convolution operation performed by a convolution kernel, and as shown in fig. 5, a convolution calculation unit responsible for calculating D11, D12, and D13 is a cascade layer, a convolution calculation unit responsible for calculating D21, D22, and D23 is a cascade layer, a convolution calculation unit responsible for calculating D31, D32, and D33 is a cascade layer, and the accumulated results of each cascade are calculated first, and then the accumulated results of each cascade are added to obtain a final convolution processing result.
In this embodiment, the convolution operation accelerator performs multiply-add operation on each data group in the data to be processed and the weight corresponding to each data group in a parallel manner through each convolution unit, and then adds the multiply-add operation results output by each convolution calculation unit to obtain a convolution processing result, so that parallel operation of each data group in the data to be processed is realized, and compared with performing serial operation on each data group, the convolution operation accelerator can improve the processing efficiency of convolution operation.
In one embodiment, according to each convolution calculation unit, performing a multiply-add operation on each data group in the data to be processed and the weight corresponding to each data group, respectively, to obtain a multiply-add operation result output by each convolution calculation unit, specifically including the following steps: in each convolution calculation unit, performing multiplication and addition operation on the first data group and the weight corresponding to the first data group to obtain a first multiplication and addition operation result; triggering a convolution calculation unit to carry out multiply-add operation on the second data group, the weight corresponding to the second data group and the first multiply-add operation result according to the first delay time sequence information to obtain a second multiply-add operation result; and triggering the convolution calculation unit to carry out multiply-add operation on the third data group, the weight corresponding to the third data group and the second multiply-add operation result according to the second delay time sequence information to obtain a third multiply-add operation result until the multiply-add operation result is obtained.
It should be noted that the first data group, the second data group, the third data group, the first delay timing information, and the second delay timing information in the present embodiment have the same meanings as those of the first data group, the second data group, the third data group, the first delay timing information, and the second delay timing information in the above embodiment, and therefore, in the present embodiment, the multiplication and addition operation can be performed by directly using the first data group, the second data group, the third data group, the first delay timing information, and the second delay timing information obtained in the above embodiment.
Specifically, taking data to be processed including three data sets as an example, the convolution accelerator performs multiplication on weights corresponding to the first data set and the first data set in each convolution calculation unit to obtain a first multiplication result, and then obtains a default padding value from the data RAM and performs addition operation on the default padding value and the first multiplication result to obtain a first multiplication and addition result; triggering a convolution calculation unit to carry out multiplication operation on the second data group and the weight corresponding to the second data group according to the first delay time sequence information to obtain a second multiplication operation result, and carrying out addition operation on the second multiplication operation result and the first multiplication and addition operation result to obtain a second multiplication and addition operation result; and triggering the convolution calculation unit to perform multiplication operation on the third data group and the weight corresponding to the third data group according to the second delay time sequence information to obtain a third multiplication operation result, performing addition operation on the third multiplication operation result and the second multiplication and addition operation result to obtain a third multiplication and addition operation result, and taking the third multiplication and addition operation result as the multiplication and addition operation result. Wherein the default padding value may comprise 0.
Further, when the data group to be processed includes three or more data groups, the following steps may be repeatedly performed from the fourth data group: triggering a convolution calculation unit to carry out multiplication operation on the weights corresponding to the next data group and the next data group according to the next time sequence information to obtain a current multiplication operation result, and carrying out addition operation on the current multiplication operation result and the last multiplication and addition operation result to obtain a next multiplication and addition operation result; and when the last data group is executed, taking the multiplication and addition operation result of the current time as the multiplication and addition operation result.
In practical application, as shown in fig. 5, the convolution calculating unit includes a multiplier and an adder, and multiplies the data D11 by the weight W11 through the multiplier to obtain D11W 11, since D11 belongs to the first group of data and is not delayed, the convolution calculating unit needs to obtain a default value 0 from the RAM of the convolution kernel and adds the multiplication result of the data D11 and the weight W11 to obtain D11W 11+0, then receives the data D12 delayed by one clock cycle, obtains the weight W12 corresponding to the data D2 from the RAM of the convolution kernel, multiplies the data D12 by the weight W12 to obtain D12W 12, accumulates the data D12W 12 and D12W 12+ 0 to obtain D12 + W12, receives the data D12 and the data D12W 12, and convolves the data D12 with the weight W12, and obtains the weight W12 from the RAM 12 of the convolution kernel, and finally, adding the multiplication and addition results of all the cascaded layers in turn to obtain a convolution operation result D W + D W.
In this embodiment, when the convolution operation accelerator completes the multiply-add operation of the current data group and the weight corresponding to the current data group, the convolution operation accelerator performs the multiply-add operation of the weight corresponding to the next data group and the next data group according to the timing information, and controls the sequence of the data groups transmitted to the convolution calculation unit by using the timing information, thereby ensuring the normal operation of the convolution operation.
In one embodiment, as shown in fig. 6, another convolution data processing method is provided, which is described by taking the example that the method is applied to a convolution operation accelerator, and includes the following steps:
step S601 of receiving data to be processed input into the integrated circuit chip; and determining a data group corresponding to the data to be processed.
Step S602, when the data to be processed includes at least two data sets, send a first data set in the data to be processed to the convolution calculating unit.
Step S603, perform delay processing on the timing sequence of the data group except the first data group in the data to be processed, to obtain first delay timing sequence information of each data group except the first data group in the data to be processed.
Step S604, according to the first delay timing information, sending a second data group in the data to be processed to the convolution calculating unit.
Step S605, perform delay processing on the first delay timing information of each data group except the first data group and the second data group in the data to be processed, to obtain second delay timing information of each data group except the first data group and the second data group in the data to be processed.
Step S606, according to the second delay timing information, sending the third data group of the data to be processed to the convolution calculating unit until the last data group of the data to be processed is sent to the convolution calculating unit.
Step S607, the weight corresponding to the data to be processed is obtained.
Step S608, according to the delay timing information of the data to be processed, in each convolution calculation unit, performing a multiply-add operation on the first data group and the weight corresponding to the first data group, respectively, to obtain a first multiply-add operation result.
And step S609, triggering the convolution calculation unit to carry out multiply-add operation on the second data group, the weight corresponding to the second data group and the first multiply-add operation result according to the first delay time sequence information to obtain a second multiply-add operation result.
And step S610, according to the second delay time sequence information, triggering the convolution calculation unit to carry out multiply-add operation on the third data group, the weight corresponding to the third data group and the second multiply-add operation result, and obtaining a third multiply-add operation result until the multiply-add operation result is obtained.
In step S611, the multiplication and addition operation results output by the convolution calculation units are added to obtain a convolution processing result.
The convolution data processing method has the following beneficial effects:
(1) the delay processing is uniformly carried out on the data to be processed by arranging the front-end circuit, so that resources for delay processing, which need to be consumed in multi-channel output, can be greatly reduced, the circuit area is effectively reduced, and the power consumption of an integrated circuit chip is further reduced;
(2) and the parallel operation of each data group in the data to be processed is realized, and compared with the serial operation of each data group, the processing efficiency of the convolution operation can be improved.
It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the present application further provides a convolution data processing apparatus for implementing the above-mentioned convolution data processing method. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the convolution data processing apparatus provided below may refer to the limitations on the convolution data processing method in the foregoing, and details are not described here again.
In one embodiment, as shown in FIG. 7, there is provided a convolutional data processing apparatus 700, comprising: a data acquisition module 701, a front-end circuit module 702 and a convolution operation module 703, wherein:
a data obtaining module 701, configured to receive data to be processed input into an integrated circuit chip.
The front-end circuit module 702 is configured to perform delay processing on a timing sequence of the data to be processed, so as to obtain delay timing sequence information of the data to be processed.
And the convolution operation module 703 is configured to send the data to be processed to the convolution calculation unit for convolution operation according to the delay timing information of the data to be processed, so as to obtain a corresponding convolution operation result.
In one embodiment, the front-end circuit module 702 is further configured to determine a data group corresponding to the data to be processed; when the data to be processed comprises at least two data groups, sending a first data group in the data to be processed to the convolution calculation unit, and performing delay processing on data groups except the first data group in the data to be processed to obtain delay time sequence information of the data groups except the first data group in the data to be processed.
In one embodiment, the convolutional data processing apparatus 700 further includes a delay sending module, configured to perform delay processing on the timing of the data groups, except the first data group, in the data to be processed, so as to obtain first delay timing information of each data group, except the first data group, in the data to be processed; sending a second data group in the data to be processed to a convolution calculation unit according to the first delay time sequence information; delaying the first delay time sequence information of each data group except the first data group and the second data group in the data to be processed to obtain second delay time sequence information of each data group except the first data group and the second data group in the data to be processed; and sending a third data group of the data to be processed to the convolution calculating unit according to the second delay time sequence information until the last data group in the data to be processed is sent to the convolution calculating unit.
In one embodiment, the convolution operation module 703 is further configured to obtain a weight corresponding to the data to be processed; and respectively inputting the data to be processed and the weight corresponding to the data to be processed into a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed, so as to obtain a convolution processing result output by the convolution calculation unit.
In an embodiment, the convolution data processing apparatus 700 further includes a convolution calculation module, configured to perform, according to each convolution calculation unit, a multiply-add operation on each data group in the data to be processed and the weight corresponding to each data group, respectively, so as to obtain a multiply-add operation result output by each convolution calculation unit; each convolution unit carries out multiplication and addition operation in a parallel mode; and adding the multiplication and addition operation results output by the convolution calculation units to obtain a convolution processing result.
In one embodiment, the convolution data processing apparatus 700 further includes a multiply-add operation module, configured to perform a multiply-add operation on the first data group and the weight corresponding to the first data group in each convolution calculation unit, respectively, to obtain a first multiply-add operation result; triggering a convolution calculation unit to carry out multiply-add operation on the second data group, the weight corresponding to the second data group and the first multiply-add operation result according to the first delay time sequence information to obtain a second multiply-add operation result; and triggering the convolution calculation unit to carry out multiply-add operation on the third data group, the weight corresponding to the third data group and the second multiply-add operation result according to the second delay time sequence information to obtain a third multiply-add operation result until the multiply-add operation result is obtained.
In one embodiment, the convolution data processing apparatus 700 further includes a signal-tapping module for performing signal-tapping on the data to be processed through a register.
The various modules in the convolutional data processing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent of a processor in a convolution operation accelerator, and can also be stored in a memory in the convolution operation accelerator in a software form, so that the processor can call and execute the corresponding operations of the modules.
In one embodiment, there is also provided a convolution operation accelerator, including a memory and a processor, the memory storing a computer program, and the processor implementing the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It should be noted that the data referred to in the present application (including but not limited to data for analysis, stored data, presented data, etc.) are information and data that are fully authorized by each party.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (8)

1. A method of convolutional data processing, said method comprising:
receiving data to be processed input into an integrated circuit chip;
carrying out delay processing on the time sequence of the data to be processed to obtain delay time sequence information of the data to be processed;
sending the data to be processed to a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed to obtain a corresponding convolution operation result;
the delaying the time sequence of the data to be processed to obtain the delayed time sequence information of the data to be processed includes:
determining a data group corresponding to the data to be processed; when the data to be processed comprises at least two data groups, sending a first data group in the data to be processed to a convolution calculation unit, and performing delay processing on the time sequence of the data groups except the first data group in the data to be processed through a front-end circuit to obtain first delay time sequence information of each data group except the first data group in the data to be processed; sending a second data group in the data to be processed to the convolution calculation unit according to the first delay time sequence information; performing delay processing on first delay time sequence information of each data group except the first data group and the second data group in the data to be processed to obtain second delay time sequence information of each data group except the first data group and the second data group in the data to be processed; and sending a third data group of the data to be processed to the convolution calculating unit according to the second delay time sequence information until a last data group in the data to be processed is sent to the convolution calculating unit.
2. The method of claim 1, wherein the front-end circuit comprises a delay device; the delay means is constituted by at least one register.
3. The method according to claim 1, wherein the sending the data to be processed to a convolution calculation unit for convolution operation according to the delay timing information of the data to be processed to obtain a corresponding convolution operation result comprises:
acquiring the weight corresponding to the data to be processed;
and respectively inputting the data to be processed and the weight corresponding to the data to be processed into the convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed, so as to obtain the convolution operation result output by the convolution calculation unit.
4. The method according to claim 3, wherein the inputting the data to be processed and the weight corresponding to the data to be processed into the convolution calculation unit for convolution operation to obtain the convolution operation result output by the convolution calculation unit comprises:
according to each convolution calculation unit, performing multiplication and addition operation on each data group in the data to be processed and the weight corresponding to each data group respectively to obtain a multiplication and addition operation result output by each convolution calculation unit; each convolution computing unit carries out multiplication and addition operation in a parallel mode;
and adding the multiplication and addition operation results output by the convolution calculation units to obtain the convolution operation result.
5. The method according to claim 4, wherein the obtaining, according to each convolution calculation unit, a multiplication and addition operation for each data group in the data to be processed and the weight corresponding to each data group to obtain a multiplication and addition operation result output by each convolution calculation unit comprises:
in each convolution calculation unit, performing multiplication and addition operation on a first data group and the weight corresponding to the first data group to obtain a first multiplication and addition operation result;
triggering the convolution calculation unit to carry out multiply-add operation on a second data group, the weight corresponding to the second data group and the first multiply-add operation result according to first delay time sequence information to obtain a second multiply-add operation result;
and triggering the convolution calculation unit to carry out multiply-add operation on a third data group, the weight corresponding to the third data group and the second multiply-add operation result according to second delay time sequence information to obtain a third multiply-add operation result until the multiply-add operation result is obtained.
6. A convolutional data processing apparatus, comprising:
the data acquisition module is used for receiving data to be processed input into the integrated circuit chip;
the front-end circuit module is used for carrying out delay processing on the time sequence of the data to be processed to obtain delay time sequence information of the data to be processed;
the convolution operation module is used for sending the data to be processed to a convolution calculation unit for convolution operation according to the delay time sequence information of the data to be processed to obtain a corresponding convolution operation result;
the front-end circuit module is also used for determining a data group corresponding to the data to be processed; when the data to be processed comprises at least two data groups, sending a first data group in the data to be processed to a convolution calculation unit, and performing delay processing on the time sequence of the data groups except the first data group in the data to be processed through a front-end circuit to obtain first delay time sequence information of each data group except the first data group in the data to be processed; sending a second data group in the data to be processed to the convolution calculation unit according to the first delay time sequence information; performing delay processing on first delay time sequence information of each data group except the first data group and the second data group in the data to be processed to obtain second delay time sequence information of each data group except the first data group and the second data group in the data to be processed; and sending a third data group of the data to be processed to the convolution calculating unit according to the second delay time sequence information until a last data group in the data to be processed is sent to the convolution calculating unit.
7. A convolution operation accelerator comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 5.
8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.
CN202210434400.8A 2022-04-24 2022-04-24 Convolution data processing method and device, convolution operation accelerator and storage medium Active CN114528526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210434400.8A CN114528526B (en) 2022-04-24 2022-04-24 Convolution data processing method and device, convolution operation accelerator and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210434400.8A CN114528526B (en) 2022-04-24 2022-04-24 Convolution data processing method and device, convolution operation accelerator and storage medium

Publications (2)

Publication Number Publication Date
CN114528526A CN114528526A (en) 2022-05-24
CN114528526B true CN114528526B (en) 2022-08-02

Family

ID=81628205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210434400.8A Active CN114528526B (en) 2022-04-24 2022-04-24 Convolution data processing method and device, convolution operation accelerator and storage medium

Country Status (1)

Country Link
CN (1) CN114528526B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115982529B (en) * 2022-12-14 2023-09-08 北京登临科技有限公司 Convolution operation structure, convolution operation array and related equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021179289A1 (en) * 2020-03-13 2021-09-16 深圳市大疆创新科技有限公司 Operational method and apparatus of convolutional neural network, device, and storage medium
CN113705795A (en) * 2021-09-16 2021-11-26 深圳思谋信息科技有限公司 Convolution processing method and device, convolution neural network accelerator and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111199273B (en) * 2019-12-31 2024-03-26 深圳云天励飞技术有限公司 Convolution calculation method, device, equipment and storage medium
US20210241070A1 (en) * 2020-02-03 2021-08-05 Qualcomm Incorporated Hybrid convolution operation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021179289A1 (en) * 2020-03-13 2021-09-16 深圳市大疆创新科技有限公司 Operational method and apparatus of convolutional neural network, device, and storage medium
CN113705795A (en) * 2021-09-16 2021-11-26 深圳思谋信息科技有限公司 Convolution processing method and device, convolution neural network accelerator and storage medium

Also Published As

Publication number Publication date
CN114528526A (en) 2022-05-24

Similar Documents

Publication Publication Date Title
CN109416754B (en) Accelerator for deep neural network
CN106844294B (en) Convolution algorithm chip and communication equipment
US10810484B2 (en) Hardware accelerator for compressed GRU on FPGA
CN1103951C (en) Device for executing self-timing algorithm and method thereof
WO2021056390A1 (en) Synchronous training method and cluster for convolutional neural network model, and readable storage medium
CN114528526B (en) Convolution data processing method and device, convolution operation accelerator and storage medium
US10409556B2 (en) Division synthesis
CN112836813B (en) Reconfigurable pulse array system for mixed-precision neural network calculation
US8397054B2 (en) Multi-phased computational reconfiguration
Shu et al. High energy efficiency FPGA-based accelerator for convolutional neural networks using weight combination
US20070233772A1 (en) Modular multiplication acceleration circuit and method for data encryption/decryption
Pietras Hardware conversion of neural networks simulation models for neural processing accelerator implemented as FPGA-based SoC
Pham et al. High performance multicore SHA-256 accelerator using fully parallel computation and local memory
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
WO2023134507A1 (en) Stochastic calculation method, circuit, chip, and device
Liu et al. HiKonv: High throughput quantized convolution with novel bit-wise management and computation
US11429850B2 (en) Performing consecutive mac operations on a set of data using different kernels in a MAC circuit
CN116502691A (en) Deep convolutional neural network mixed precision quantization method applied to FPGA
Spagnolo et al. Designing fast convolutional engines for deep learning applications
CN104808086B (en) A kind of AD analog input cards and acquisition method with adaptation function
Wu et al. A novel low-communication energy-efficient reconfigurable CNN acceleration architecture
US20230367991A1 (en) Method and processing unit for generating an output feature map
US11842169B1 (en) Systolic multiply delayed accumulate processor architecture
CN112346703B (en) Global average pooling circuit for convolutional neural network calculation
Sudrajat et al. GEMM-Based Quantized Neural Network FPGA Accelerator Design

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant