CN116009813A - Data processing method and device and storage medium - Google Patents

Data processing method and device and storage medium Download PDF

Info

Publication number
CN116009813A
CN116009813A CN202111224717.0A CN202111224717A CN116009813A CN 116009813 A CN116009813 A CN 116009813A CN 202111224717 A CN202111224717 A CN 202111224717A CN 116009813 A CN116009813 A CN 116009813A
Authority
CN
China
Prior art keywords
data
computing unit
convolution kernel
mode
unit array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111224717.0A
Other languages
Chinese (zh)
Inventor
孙炜
祝叶华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN202111224717.0A priority Critical patent/CN116009813A/en
Publication of CN116009813A publication Critical patent/CN116009813A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The embodiment of the application provides a data processing method, a data processing device and a storage medium, wherein a computing unit array is deployed in the data processing device, and each computing unit in the computing unit array consists of a group of multipliers and an addition register connected with the group of multipliers; wherein, a multiplier is used for multiplying a convolution kernel data and an input characteristic data, and an addition register is used for accumulating and storing a group of data output by a group of multipliers; each group of computing units of the computing unit array in the horizontal direction transmits input characteristic data in a data pulsation mode; loading convolution kernel data in each computing unit in the computing unit array; the computing unit array performs an accumulation and output operation along at least one addition register of at least one computing unit in the vertical direction.

Description

Data processing method and device and storage medium
Technical Field
The present disclosure relates to the field of deep learning, and in particular, to a data processing method and apparatus, and a storage medium.
Background
The special accelerator facing the artificial intelligence field at present makes a sufficient trade-off between energy efficiency ratio and flexibility when the whole architecture is defined. For convolution operation with the largest proportion in the artificial intelligence algorithm, a special convolution acceleration unit is used for completing the convolution operation, and other operations with smaller proportion are completed by digital signal processing (Digital Signal Process, DSP) with high flexibility and price. For convolutional acceleration units, architecture design is typically employed according to parameters of the convolutional layer, on which the neural network is then run.
However, in a neural network, parameters of different convolution layers are different, and when the fixed hardware architecture is used for performing convolution calculation of different convolution layers, a calculation unit cannot be fully utilized, so that the problem of low resource utilization rate is caused.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device and a storage medium, which can fully utilize a computing unit in a hardware architecture and improve the resource utilization rate.
The technical scheme of the application is realized as follows:
in a first aspect, an embodiment of the present application proposes a data processing apparatus, the data processing apparatus deploying an array of computing units, each computing unit in the array of computing units being composed of a set of multipliers and one addition register connected to the set of multipliers; wherein, a multiplier is used for multiplying a convolution kernel data and an input characteristic data, and an addition register is used for accumulating and storing a group of data output by a group of multipliers;
each group of computing units of the computing unit array in the horizontal direction transmits input characteristic data in a data pulsation mode; loading convolution kernel data in each computing unit in the array of computing units;
The computing unit array performs an accumulation and output operation along at least one addition register of at least one computing unit in a vertical direction.
In a second aspect, an embodiment of the present application proposes a data processing method, applied to the above data processing apparatus, where the method includes:
acquiring algorithm layer parameters and a data scheduling mode of a first algorithm layer in a neural network model;
determining a loading mode of input characteristic data and convolution kernel data on a computing unit array and a corresponding result output mode of the computing unit array according to the algorithm layer parameters and the data scheduling mode;
according to the loading mode of the input characteristic data and the convolution kernel data on the computing unit array, loading the input characteristic data and the convolution kernel data into the computing unit array, and performing multiply-accumulate operation on the computing unit array to obtain multiply-accumulate data;
and processing the multiply-accumulate data based on the data output mode to obtain output characteristic data.
In a third aspect, an embodiment of the present application proposes a data processing apparatus, the apparatus including:
the acquisition unit is used for acquiring algorithm layer parameters and a data scheduling mode of a first algorithm layer in the neural network model;
The determining unit is used for determining a loading mode of input characteristic data and convolution kernel data on the computing unit array and a corresponding result output mode of the computing unit array according to the algorithm layer parameters and the data scheduling mode;
the loading unit is used for loading the input characteristic data and the convolution kernel into the computing unit array according to the loading mode of the input characteristic data and the convolution kernel data on the computing unit array, and performing multiply-accumulate operation on the computing unit array to obtain multiply-accumulate data;
and the processing unit is used for processing the multiply-accumulate data based on the data output mode to obtain output characteristic data.
In a fourth aspect, an embodiment of the present application proposes a data processing apparatus, the apparatus including: a processor, a memory, and a communication bus; the processor implements the data processing method as described above when executing the running program stored in the memory.
In a fifth aspect, embodiments of the present application provide a storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method as described above.
The embodiment of the application provides a data processing method, a data processing device and a storage medium, wherein a computing unit array is deployed in the data processing device, and each computing unit in the computing unit array consists of a group of multipliers and an addition register connected with the group of multipliers; wherein, a multiplier is used for multiplying a convolution kernel data and an input characteristic data, and an addition register is used for accumulating and storing a group of data output by a group of multipliers; each group of computing units of the computing unit array in the horizontal direction transmits input characteristic data in a data pulsation mode; loading convolution kernel data in each computing unit in the computing unit array; the computing unit array performs an accumulation and output operation along at least one addition register of at least one computing unit in the vertical direction. By adopting the implementation scheme, each calculation unit can carry out multiply-accumulate and storage operations, so that the data processing device can flexibly carry out configuration of different data flow directions according to different algorithm layer parameters, the calculation units can be fully utilized, and the resource utilization rate is improved.
Drawings
FIG. 1 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an exemplary computing unit according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating an exemplary data flow in a computing unit array according to an embodiment of the present disclosure;
FIG. 4 is a second exemplary data flow diagram in a computing unit array according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a multiplexer according to an embodiment of the present application;
FIG. 6 is a schematic diagram of an accumulator according to an embodiment of the present disclosure;
FIG. 7 is a flowchart of a data processing method according to an embodiment of the present application;
FIG. 8 is a schematic diagram of an exemplary output feature data multiplexing process according to an embodiment of the present application;
fig. 9 is a second schematic flow chart of output characteristic data multiplexing according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 11 is a schematic diagram of a second structure of a data processing apparatus according to an embodiment of the present application.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the application. And are not intended to limit the present application.
The embodiment of the application provides a data processing apparatus 1, as shown in fig. 1, an array of computing units 10 is arranged in the data processing apparatus 1, each computing unit 10 in the array of computing units 10 is composed of a set of multipliers 100 and an addition register 101 connected with the set of multipliers 100; wherein, a multiplier 100 is used for multiplying a convolution kernel data and an input characteristic data, and an addition register 101 is used for accumulating and storing a group of data output by a group of multipliers 100;
each group of computing units 10 in the horizontal direction propagates input characteristic data in a data pulsation manner; loading convolution kernel data in each computing unit in the array of computing units;
the array of computing units 10 performs an accumulation and output operation in at least one addition register of at least one computing unit in the vertical direction.
The data processing device provided by the embodiment of the application is deployed with a computing unit array, and the neural network model can run on the computing unit array.
In this embodiment of the present application, the neural network model may be a convolutional neural network (Convolutional Neural Networks, CNN), a deep neural network (Deep Neural Networks, DNN), a cyclic neural network (Recurrent Neural Network, RNN), or the like, which may specifically be selected according to actual situations, and the embodiment of the present application is not limited specifically.
In this embodiment of the present application, the hardware architecture of the data processing apparatus is in the form of a computing unit array, and by way of example, the computing unit array is formed by 14 computing units in a horizontal direction and 4 computing units in a vertical direction, and it should be noted that the foregoing is only an optional embodiment, and specifically, the number of computing units in the horizontal direction and the number of computing units in the vertical direction may be determined according to the total number of computing units and the shape of the hardware architecture, and the embodiment of the present application is not limited specifically.
It should be noted that, the computing unit in the computing unit array provided in this embodiment of the present application includes two portions, the first portion is a group of multipliers, the second portion is an addition register, where the number of the group of multipliers may be two, three, or the like, and may specifically be selected according to practical situations, where the embodiment of the present application does not specifically limit, and the corresponding number of input feature data and convolution kernel data are loaded into the group of multipliers to perform multiplication operations according to the number of the multipliers, at this time, each multiplier performs a multiplication operation on one input feature data and convolution kernel data, and then, a group of data obtained by calculating a group of multipliers is input into the addition register, where the addition register accumulates a group of data and stores an accumulated result.
For example, as shown in fig. 2, the computing unit array is a 16×4 array, in which each computing unit is composed of two multipliers and an addition register, each multiplier inputs a convolution kernel data W and an input feature data F, and the two multipliers input the computation results into one addition register, and the addition registers are added to obtain multiply-accumulate data. In the 16×4 array of fig. 2, circles represent adders, and ellipses connected to the circles represent multipliers.
It can be understood that, since each computing unit in the computing unit matrix can perform multiply-accumulate output operation, the computing unit matrix can implement different data scheduling modes and perform different configurations of data streams according to different algorithm parameters.
In the embodiment of the application, the input characteristic data are sequentially transmitted into the computing unit array along the horizontal direction in a data pulsation mode, wherein the data pulsation mode is determined by the number of convolution kernels of the first algorithm layer and size information of the input characteristic data.
For example, as shown in fig. 3, there are 8 convolution kernels of K0-K7, and the size information of the input feature data is smaller, then the 16×4 computing array is divided into 8 groups, each group includes two columns of computing units, then the two columns of computing units share one convolution kernel data, for the two columns of computing units of the K0 group, the input feature data corresponds to two window data of the input feature data, such as the first window and the second window in fig. 3, for the K0 group, one period may generate a multiplication and addition result of the data of the first column and the first window of K0, and a multiplication and addition result of the data of the second column and the second window of K0, and similarly, for the two columns of computing units of the K1 group, the input feature data corresponds to the two window data of the input feature data, such as the first window and the second window in fig. 3, for the K1 group, one period may generate a multiplication and addition result of the first column and the first window of K1, and a multiplication and addition result of the second column and the second window of K1, respectively, and so on. Then, the data in the first window is data-pulsed in the 16×4 computing array from right to left in the first, third, fifth, seventh, ninth, eleventh, tenth, and fifteenth columns, and the data in the second window is data-pulsed in the 16×4 computing array from right to left in the second, fourth, sixth, eighth, tenth, twelfth, fourteenth, and sixteenth columns.
For example, as shown in fig. 4, for the reduction of the number of convolution kernels and the increase of the size of the input feature image, the grouping of the computing units may be changed, for example, for four convolution kernels of K0-K3, the 16×4 computing array is divided into 4 groups, each group corresponds to one convolution kernel data, each group includes four columns of computing units, the four columns of computing units share one convolution kernel data, for four columns of computing units of K0 group, the input feature data corresponds to four window data of the feature image in the X direction, for example, the first window, the second window, the third window and the fourth window in fig. 4, then for K0 groups, one period may generate the multiplication result of the data of the first column and the first window of K0, the multiplication result of the data of the second column and the second window of K0, the multiplication result of the data of the third column and the third window of K0, and the multiplication result of the data of the fourth column and the fourth window of K0, respectively; also, for the four-column computing unit of the K1 group, whose input feature data corresponds to four window data of the feature map in the X direction, for the K1 group, a multiplication and addition result of the data of the first column and the first window of K1, a multiplication and addition result of the data of the second column and the second window of K1, a multiplication and addition result of the data of the third column and the third window of K1, and a multiplication and addition result of the data of the fourth column and the fourth window of K1 may be generated, respectively, and so on. Then, the data in the first window pulsates in the 16×4 computing array from right to left in the first, fifth, ninth, and tenth columns, the data in the second window pulsates in the 16×4 computing array from right to left in the second, sixth, tenth, and fourteenth columns, the data in the third window pulsates in the 16×4 computing array from right to left in the third, seventh, eleventh, and fifteenth columns, and the data in the fourth window pulsates in the 16×4 computing array from right to left in the fourth, eighth, twelfth, and sixteenth columns.
In the embodiment of the application, the convolution kernel data is also loaded in each computing unit in the computing unit array, wherein each computing unit in the computing unit array loads the convolution kernel data according to the convolution kernel number of the first algorithm layer and the size information of the input feature data.
It should be noted that, for the two data scheduling modes of weight data multiplexing and input feature data multiplexing, the pulse mode of the input feature data in the horizontal direction of the computing unit array and the grouping mode of the convolution kernel data in the horizontal direction of the computing unit array can be determined simultaneously according to the convolution kernel number of the first algorithm layer and the size information of the input feature data.
For example, referring to fig. 3, in the 16×4 computing array, the first column computing unit and the second column computing unit share K0 convolution kernel data in order from right to left, and then multi-channel data of the K0 convolution kernel is sequentially input into the first column and the second column; the third column computing unit and the fourth column computing unit share K1 convolution kernel data, and then multichannel data of the K1 convolution kernel are sequentially input into the third column and the fourth column; and the fifth column computing unit and the sixth column computing unit share K2 convolution kernel data, and sequentially inputting the multi-channel data of the K2 convolution kernel into the fifth column and the sixth column, and so on until the multi-channel data of the K7 convolution kernel is sequentially input into the fifteenth column and the sixteenth column.
For example, referring to fig. 4, in the 16×4 computing array, in order from right to left, the first column computing unit to the fourth column computing unit share K0 convolution kernel data, and then multi-channel data of the K0 convolution kernel are sequentially input into the first column to the second column; the fifth to eighth row computing units share the K1 convolution kernel data, and then sequentially input the multi-channel data of the K1 convolution kernel into the fifth to eighth rows; the ninth to twelfth columns of calculation units share the K2 convolution kernel data, and then the multi-channel data of the K2 convolution kernel are sequentially input into the ninth to twelfth columns; the thirteenth to sixteenth columns of calculation units share the K3 convolution kernel data, and then the multi-channel data of the K3 convolution kernel are sequentially input into the tenth and sixteenth columns.
It should be noted that, for the multiplexing mode of the output feature data, two kinds of distribution modes of the convolution kernel data can be determined according to the number of convolution kernels of the first algorithm layer and the size information of the input feature data, the first is that the number of convolution kernels is large, and the size of the input feature data is small, one convolution kernel data can be loaded in each calculation unit, at this time, the input feature data can be sequentially input into each calculation unit to perform convolution operation; the second is that the number of convolution kernels is small and the size of the input feature data is large, the same plurality of convolution kernel data can be loaded in each row of calculation units, and at this time, one input feature data can be input in each row of calculation units to perform convolution operation.
In the embodiment of the present application, if the data scheduling manner of the first algorithm layer is weight data multiplexing or input feature data multiplexing, a plurality of computing units along the vertical direction sequentially process convolution kernel data corresponding to a group of channels and input feature data corresponding to a group of channels in a convolution kernel; accumulating and outputting multiply-accumulate characteristic data generated by each round in a plurality of addition registers along the vertical direction; wherein the number of the plurality of calculation units is determined by the number of channels and convolution kernel size information of the first algorithm layer, and the number of multipliers in one calculation unit; for example, in the case where two multipliers are included in one calculation unit of 4 channels, 3×3 convolution kernels, then 4×3×3/2=18 calculation units are included in the vertical direction.
In the embodiment of the present application, if the data scheduling manner of the first algorithm layer is output feature data multiplexing, each computing unit in the computing unit array processes convolution kernel data corresponding to a group of channels in a convolution kernel and input feature data corresponding to a group of channels; an addition register of each computing unit in the computing unit array performs accumulation and output operation on the multi-round output characteristic data. At this time, each computing unit may output a final output characteristic data.
Optionally, as shown in fig. 5, the data processing apparatus 1 further includes a multiplexer 11, where the multiplexer 11 is used to select a data source of the input feature data in each computing unit 10, and the multiplexer 11 includes an output 110 and at least one input 111; wherein, an output 110 of the multiplexer 11 is connected to a multiplier 100 of one computing unit, and at least one output 111 of the multiplexer 11 is connected to the multiplier 100 of at least one computing unit, and the at least one computing unit is a computing unit that is in the same horizontal direction as the one computing unit, inputs feature data in the same order, and performs the operation in a preceding order than the one computing unit.
The number of at least one input terminal is determined by the pulsation pattern of the input characteristic data in the horizontal direction. Referring to fig. 3, the manner of the data pulsation in the first window of the input feature data is a first column, a third column, a fifth column, a seventh column, a ninth column, an eleventh column, a tenth column, and a fifteenth column, and for the third column, the input end of the multiplexer is a multiplier of the first column, the output end of the multiplexer is a multiplier of the third column, for the fifth column, the input end of the multiplexer is a multiplier of the first column and a multiplier of the third column, and the output end of the multiplexer is a multiplier of the fifth column, that is, the input feature data of the fifth column may be the input feature data pulsation of the first column or the input feature data pulsation of the third column.
Optionally, the data processing apparatus further includes: and an accumulation output unit 12, wherein the accumulation output unit 12 is used for accumulating the characteristic data stored in at least one addition register 101 in the vertical direction in the computing unit array and generating output characteristic data.
The accumulation output 12 includes a multi-layer addition register 120 therein; two input ends of each bottom-layer addition register in the multi-layer addition register 120 are respectively connected with two addition registers 101 of two computing units 10 in the vertical direction, and two input ends of each other-layer addition register in the multi-layer addition register 120 are respectively connected with output ends of two next-layer addition registers;
the output end of each layer of addition register and at least one addition register in the multi-layer addition register also respectively generate output result data.
In this embodiment, referring to fig. 6, an accumulation output device is set for a row of 4 computing units, where P3, P4, P5, and P6 respectively represent accumulation results of one computing unit, P0 and P1 respectively represent accumulation results of two multiply-accumulate units, and P2 represents accumulation results of four multiply-accumulate units, so that the computing unit array can adapt to different numbers of multiply-accumulate operations in the input channel direction.
It should be noted that, the accumulation output device provided in the embodiment of the present application may enable each computing unit to output data respectively, or may accumulate and output a preset number of computing units in a vertical direction according to the number of channels of the convolution kernel.
It can be understood that each computing unit can perform multiply-accumulate and store operations, so that the data processing device can flexibly perform configuration of different data flow directions according to different algorithm layer parameters, so that the computing units can be fully utilized, and the resource utilization rate is improved.
Based on the above data processing apparatus, the embodiment of the present application further provides a data processing method, as shown in fig. 7, where the method may include:
s101, acquiring algorithm layer parameters and a data scheduling mode of a first algorithm layer in a neural network model.
The data processing method provided by the embodiment of the application is suitable for a scene of running a neural network model in a computing unit array.
In this embodiment of the present invention, each algorithm layer in the neural network model corresponds to a different algorithm layer parameter, and according to the difference of the algorithm layer parameters, the data scheduling modes applicable to the first algorithm layer are also different, and as can be known from the above embodiment, the computing unit array running the neural network model can implement different data scheduling modes, so that the algorithm layer parameter and the data scheduling mode of the first algorithm layer are acquired before the computing unit array implements the algorithm logic of the first algorithm layer.
In this embodiment of the present application, the first algorithm layer may be a neural network layer such as a convolutional layer, and specifically may be selected according to actual situations, which is not specifically limited in this embodiment of the present application.
In this embodiment of the present application, the algorithm layer parameters of the first algorithm layer include parameters such as the number of convolution kernels, input feature size information, the number of channels, and convolution kernel size information, which may be specifically selected according to actual situations, and the embodiment of the present application is not specifically limited.
The input feature size information includes width information and height information of the output feature, and the convolution kernel size information includes width information and height information of the convolution kernel.
In this embodiment of the present application, the data scheduling manner may be input feature data multiplexing, weight data multiplexing, or output feature data multiplexing, which is specifically selected according to actual situations, and the embodiment of the present application is not specifically limited.
It can be understood that, for different algorithm layer parameters and data scheduling modes of different algorithm layers, different data flow directions are configured for the input characteristic data and the convolution kernel data, so that the execution time of the algorithm can be improved, and the overall calculation performance is further improved.
S102, determining a loading mode of input characteristic data and convolution kernel data on a computing unit array and a corresponding result output mode of the computing unit array according to algorithm layer parameters and a data scheduling mode.
In an alternative embodiment, if the data scheduling mode is input feature data multiplexing or weight data multiplexing, determining a data pulsation mode of the input feature data in the horizontal direction of the computing unit array and a grouping mode of the convolution kernel data in the horizontal direction in the computing unit array according to the number of convolution kernels and the input feature size information; determining a first number of computing units performing accumulation operation in each vertical direction in the computing unit array according to the channel number and the convolution kernel size information; determining a data pulsation mode and a grouping mode as a loading mode of input characteristic data and convolution kernel data on a computing unit array; the first number sum is accumulated for each calculation unit in the vertical direction and output is determined as a result output mode.
For example, as shown in fig. 3, if there are 8 convolution kernels of K0-K7 and the size information of the input feature data is small, the 16×4 computing array is divided into 8 groups, each group includes two columns of computing units, and the two columns of computing units share one convolution kernel data, and at this time, the input feature data of two windows is input at a time, the data pulse mode of the first window in the computing unit array is from right to left, the first column, the third column, the fifth column, the seventh column, the ninth column, the eleventh column, the tenth column and the fifteenth column, and the data pulse mode of the first window in the computing unit array is from right to left, the second column, the fourth column, the sixth column, the eighth column, the tenth column, the twelfth column, the fourteenth column and the sixteenth column.
As shown in fig. 4, if the size information of the input feature data of the four convolution kernels is small, the 16×4 computing array is divided into 4 groups, each group includes four columns of computing units, and the four columns of computing units share one convolution kernel data, and at this time, the input feature data of four windows is input at a time, the data of the first window is in the first column, the fifth column, the ninth column and the tenth column from right to left in the computing unit array, the data of the second window is in the second column, the sixth column, the tenth column and the fourteenth column from right to left in the computing unit array, the data of the third window is in the third column, the seventh column, the eleventh column and the fifteenth column from right to left in the computing unit array, and the data of the fourth window is in the fourth column, the eighth column, the twelfth column and the sixteenth column from right to left in the computing unit array.
By the above two examples, a procedure of determining the data pulsation pattern of the input feature data in the horizontal direction of the calculation unit array and the grouping pattern of the convolution kernel data in the horizontal direction of the calculation unit array based on the number of convolution kernels and the input feature size information is described, and next, for a procedure of determining the first number of calculation units performing the accumulation operation in each vertical direction of the calculation unit array based on the number of channels and the convolution kernel size information, the number of multipliers in one calculation unit is divided by the number of channels multiplied by the convolution kernel size information, for example: in the case where two multipliers are included in one calculation unit of 4 channels, 3×3 convolution kernels, then 4×3×3/2=18 calculation units are included in the vertical direction.
In another alternative embodiment, if the data scheduling mode is output feature data multiplexing, determining a data flow direction of each horizontal computing unit of the input feature data in the computing unit array and a distribution mode of the convolution kernel data in the computing unit array according to the number of convolution kernels and the input feature size information; determining the data flow direction and the distribution mode as the loading mode of the input characteristic data and the convolution kernel data on the computing unit array; and determining a result output mode of the computing unit array, and accumulating and outputting the multi-round multiply-accumulate data for each computing unit.
In this embodiment of the present application, if the number of convolution kernels is large and the input feature size is small, one convolution kernel data is loaded in each computing unit, and at the same time, one window data of the input feature data is sequentially input into the computing unit array in order from right to left, which is illustrated in fig. 8, and the input feature data size is smaller, and at this time, one convolution kernel is loaded in each computing unit in the computing unit array, and the window data of the first window in the input feature data is input into the computing unit array in order from right to left.
If the number of convolution kernels is small and the input feature size is large, the same set of convolution kernel data is loaded in each row of computing units, and at the same time, different window data of the input feature data are respectively input into each row of computing units, as shown in fig. 9, for example, the 16 convolution kernels including K0-K15 are included, and the input feature data size is large, at this time, the 16 convolution kernels including K0-K15 are loaded in each row of computing units, and window data of a first window in the input feature data are input into the first row of computing units in order from right to left, window data of a second window in the input feature data are input into the second row of computing units in order from right to left, window data of a third window in the input feature data are input into the third row of computing units in order from right to left, and window data of a fourth window in the input feature data are input into the fourth row of computing units in order from right to left.
It can be understood that, according to the number of convolution kernels, the number of channels, the size of the input feature data and the size of the convolution kernels of the first algorithm layer, the data flow directions of the input feature data and the convolution kernel data are configured, so that the access times can be reduced, the MAC utilization rate can be improved, and further the power consumption can be reduced.
S103, according to the loading mode of the input characteristic data and the convolution kernel data on the computing unit array, loading the input characteristic data and the convolution kernel into the computing unit array, and performing multiply-accumulate operation on the computing unit array to obtain multiply-accumulate data.
In the embodiment of the application, after determining the loading mode of the input feature data and the convolution kernel data on the computing array, the input feature data and the convolution kernel data are loaded into the computing unit array according to the loading mode of the input feature data and the convolution kernel data on the computing unit array, and multiply-accumulate operation is performed on the computing unit array to obtain multiply-accumulate data.
S104, processing the multiply-accumulate data based on a result output mode to obtain output characteristic data.
In the embodiment of the application, after determining the result output mode corresponding to the computing unit array according to the algorithm layer parameters and the data scheduling mode, the multiply-accumulate data is processed based on the result output mode to obtain the output characteristic data.
In an alternative embodiment, if the data scheduling mode is weight data multiplexing or input feature data multiplexing, multiply-accumulate data output by the computing unit in the vertical direction is accumulated to obtain output feature data.
In another alternative embodiment, if the data scheduling mode is output characteristic data multiplexing, each computing unit accumulates the multiple-round multiply-accumulate data therein to directly obtain the output characteristic data.
It can be understood that each computing unit can perform multiply-accumulate and store operations, so that the data processing device can flexibly perform configuration of different data flow directions according to different algorithm layer parameters and data scheduling modes, so that the computing units can be fully utilized, and the resource utilization rate is improved.
Based on the above data processing method, the embodiment of the present application further provides a data processing apparatus 1. As shown in fig. 10, the data processing apparatus 1 includes:
the acquiring unit 13 is used for acquiring algorithm layer parameters and a data scheduling mode of a first algorithm layer in the neural network model;
the determining unit 14 is configured to determine a loading mode of the input feature data and the convolution kernel data on the computing unit array and a result output mode corresponding to the computing unit array according to the algorithm layer parameter and the data scheduling mode;
the loading unit 15 is configured to load the input feature data and the convolution kernel into the computing unit array according to a loading manner of the input feature data and the convolution kernel data on the computing unit array, and perform a multiply-accumulate operation on the computing unit array to obtain multiply-accumulate data;
And the processing unit 16 is used for processing the multiply-accumulate data based on the data output mode to obtain output characteristic data.
Optionally, the algorithm layer parameters of the first algorithm layer include a number of convolution kernels, input feature size information, a number of channels, and convolution kernel size information.
Optionally, the determining unit 14 is further configured to determine, if the data scheduling manner is input feature data multiplexing or weight data multiplexing, a data pulsation manner of the input feature data in a horizontal direction of the computing unit array and a grouping manner of the convolution kernel data in the horizontal direction of the computing unit array according to the number of convolution kernels and the input feature size information; determining a first number of computing units performing accumulation operation in each vertical direction in the computing unit array according to the channel number and the convolution kernel size information; determining the data pulsation mode and the grouping mode as a loading mode of input characteristic data and convolution kernel data on a computing unit array; and accumulating the first quantity and the calculation units in each vertical direction and outputting and determining the result as the result output mode.
Optionally, the determining unit 14 is further configured to determine, if the data scheduling manner is output feature data multiplexing, a data flow direction of each horizontal computing unit of the computing unit array of the input feature data and a distribution manner of the convolution kernel data in the computing unit array according to the number of convolution kernels and the input feature size information; determining the data flow direction and the distribution mode as a loading mode of input characteristic data and convolution kernel data on a computing unit array; and determining a result output mode of the computing unit array to accumulate and output the multi-round multiply-accumulate data for each computing unit.
The data processing device acquires algorithm layer parameters and a data scheduling mode of a first algorithm layer in a neural network model; determining a loading mode of input characteristic data and convolution kernel data on a computing unit array and a corresponding result output mode of the computing unit array according to the algorithm layer parameters and the data scheduling mode; according to the loading mode of the input characteristic data and the convolution kernel data on the computing unit array, loading the input characteristic data and the convolution kernel data into the computing unit array, and performing multiply-accumulate operation on the computing unit array to obtain multiply-accumulate data; and processing the multiply-accumulate data based on a data output mode to obtain output characteristic data. Therefore, in the data processing device provided by the embodiment, each computing unit can perform multiply-accumulate and store operations, so that the data processing device can flexibly perform configuration of different data flow directions according to different algorithm layer parameters, the computing units can be fully utilized, and the resource utilization rate is improved.
Fig. 11 is a schematic diagram of a second component structure of the data processing apparatus 1 according to the embodiment of the present application, in practical application, based on the same disclosure concept of the above embodiment, as shown in fig. 11, the data processing apparatus 1 of the present embodiment includes: a processor 17, a memory 18 and a communication bus 19.
In a specific embodiment, the acquiring unit 13, the determining unit 14, the loading unit 15, and the processing unit 16 may be implemented by a processor 17 located on the apparatus 1, where the processor 17 may be at least one of an application specific integrated circuit (ASIC, application Specific Integrated Circuit), a digital signal processor (DSP, digital Signal Processor), a digital signal processing image processing device (DSPD, digital Signal Processing Device), a programmable logic image processing device (PLD, programmable Logic Device), a field programmable gate array (FPGA, field Programmable Gate Array), a CPU, a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronics for implementing the above-described processor functions may be other for different devices, and the present embodiment is not particularly limited.
In the embodiment of the present application, the above-mentioned communication bus 19 is used to implement connection communication between the processor 17 and the memory 18; the processor 17 executes an operation program stored in the memory 18 to realize the following data processing method:
Acquiring algorithm layer parameters and a data scheduling mode of a first algorithm layer in a neural network model; determining a loading mode of input characteristic data and convolution kernel data on a computing unit array and a corresponding result output mode of the computing unit array according to the algorithm layer parameters and the data scheduling mode; according to the loading mode of the input characteristic data and the convolution kernel data on the computing unit array, loading the input characteristic data and the convolution kernel data into the computing unit array, and performing multiply-accumulate operation on the computing unit array to obtain multiply-accumulate data; and processing the multiply-accumulate data based on the data output mode to obtain output characteristic data.
Further, the algorithm layer parameters of the first algorithm layer include the number of convolution kernels, input feature size information, the number of channels, and convolution kernel size information.
Further, the above processor 17 is further configured to determine, if the data scheduling manner is input feature data multiplexing or weight data multiplexing, a data pulsation manner of the input feature data in a horizontal direction of the computing unit array and a grouping manner of the convolution kernel data in the horizontal direction of the computing unit array according to the number of convolution kernels and the input feature size information; determining a first number of computing units performing accumulation operation in each vertical direction in the computing unit array according to the channel number and the convolution kernel size information; determining the data pulsation mode and the grouping mode as a loading mode of input characteristic data and convolution kernel data on a computing unit array; and accumulating the first quantity and the calculation units in each vertical direction and outputting and determining the result as the result output mode.
Further, the above processor 17 is further configured to determine, if the data scheduling manner is output feature data multiplexing, a data flow direction of each horizontal computing unit of the computing unit array of the input feature data and a distribution manner of the convolution kernel data in the computing unit array according to the number of convolution kernels and the input feature size information; determining the data flow direction and the distribution mode as a loading mode of input characteristic data and convolution kernel data on a computing unit array; and determining a result output mode of the computing unit array to accumulate and output the multi-round multiply-accumulate data for each computing unit.
The embodiment of the application provides a storage medium, on which a computer program is stored, the computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors and applied to a data processing apparatus, the computer program implementing a data processing method as described above.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present disclosure may be embodied essentially or in a part contributing to the related art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing an image display device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present disclosure.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application.

Claims (15)

1. A data processing apparatus, wherein an array of computational units is arranged in the data processing apparatus, each computational unit in the array of computational units consisting of a set of multipliers and an addition register connected to the set of multipliers; wherein, a multiplier is used for multiplying a convolution kernel data and an input characteristic data, and an addition register is used for accumulating and storing a group of data output by a group of multipliers;
Each group of computing units of the computing unit array in the horizontal direction transmits input characteristic data in a data pulsation mode; loading convolution kernel data in each computing unit in the array of computing units;
the computing unit array performs an accumulation and output operation along at least one addition register of at least one computing unit in a vertical direction.
2. The apparatus of claim 1, wherein the data processing apparatus further comprises: a multiplexer for selecting a data source of the input characteristic data in each computing unit, the multiplexer including an output and at least one input; one output end of the multiplexer is connected with a multiplier of one computing unit, at least one output end of the multiplexer is connected with the multiplier of at least one computing unit, and the at least one computing unit is a computing unit which is in the same horizontal direction as the one computing unit, inputs characteristic data in the same way, and has an execution sequence which is earlier than that of the one computing unit.
3. The apparatus of claim 2, wherein the number of the at least one input terminal is determined by a pulsing manner of the input characteristic data in a horizontal direction, the pulsing manner being determined by a number of convolution kernels of the first algorithm layer and size information of the input characteristic data.
4. A device according to any one of claims 1-3, wherein the data processing device further comprises: and the accumulation output device is used for accumulating the multiply-accumulate data stored in at least one addition register in the vertical direction in the computing unit array and generating output characteristic data.
5. The apparatus of claim 4 wherein said accumulation output comprises a multi-layer addition register; two input ends of each bottom layer addition register in the multi-layer addition registers are respectively connected with two addition registers of two calculation units in the vertical direction, and two input ends of each other layer addition register in the multi-layer addition registers are respectively connected with output ends of two next layer addition registers;
the output end of each layer of addition register and at least one addition register in the multi-layer addition register also respectively generate output multiply-accumulate data.
6. The apparatus of claim 5, wherein if the data scheduling manner of the first algorithm layer is weight data multiplexing or input feature data multiplexing, the plurality of computing units in the vertical direction sequentially process convolution kernel data corresponding to a group of channels and input feature data corresponding to a group of channels in a convolution kernel; the number of the plurality of calculation units is determined by the number of channels and convolution kernel size information of the first algorithm layer and the number of multipliers in one calculation unit;
And the accumulation output device is used for accumulating and outputting the multiply-accumulate characteristic data generated by each round in the plurality of addition registers along the vertical direction.
7. The apparatus of claim 5, wherein if the data scheduling manner of the first algorithm layer is output feature data multiplexing, each computing unit in the computing unit array processes convolution kernel data corresponding to a set of channels and input feature data corresponding to a set of channels in a convolution kernel;
an addition register in each of the compute units in the compute unit array performs an accumulation operation on the multi-round multiply-accumulate data and an output operation through the accumulation output device.
8. The apparatus of claim 5, wherein each compute unit of the array of compute units loads the convolution kernel data in accordance with a number of convolution kernels of the first algorithm layer, size information of the input feature data, and a data scheduling manner.
9. A data processing method, applied to the data processing apparatus of any one of claims 1 to 7, the method comprising:
acquiring algorithm layer parameters and a data scheduling mode of a first algorithm layer in a neural network model;
Determining a loading mode of input characteristic data and convolution kernel data on a computing unit array and a corresponding result output mode of the computing unit array according to the algorithm layer parameters and the data scheduling mode;
according to the loading mode of the input characteristic data and the convolution kernel data on the computing unit array, loading the input characteristic data and the convolution kernel data into the computing unit array, and performing multiply-accumulate operation on the computing unit array to obtain multiply-accumulate data;
and processing the multiply-accumulate data based on the data output mode to obtain output characteristic data.
10. The method of claim 9, wherein the algorithm layer parameters of the first algorithm layer include a number of convolution kernels, input feature size information, a number of channels, and convolution kernel size information.
11. The method of claim 10, wherein determining a loading mode of input feature data and convolution kernel data on the computing unit array and a corresponding result output mode of the computing unit array according to the algorithm layer parameters and the data scheduling mode comprises:
if the data scheduling mode is input characteristic data multiplexing or weight data multiplexing, determining a data pulsation mode of the input characteristic data in the horizontal direction of the computing unit array and a grouping mode of convolution kernel data in the horizontal direction of the computing unit array according to the convolution kernel number and the input characteristic size information;
Determining a first number of computing units performing accumulation operation in each vertical direction in the computing unit array according to the channel number and the convolution kernel size information;
determining the data pulsation mode and the grouping mode as a loading mode of input characteristic data and convolution kernel data on a computing unit array; and accumulating the first quantity and the calculation units in each vertical direction and outputting and determining the result as the result output mode.
12. The method according to claim 10, wherein determining a loading mode of input feature data and convolution kernel data on the computing unit array and a corresponding result output mode of the computing unit array according to the algorithm layer parameters and the data scheduling mode includes:
if the data scheduling mode is output characteristic data multiplexing, determining the data flow direction of each horizontal computing unit in the computing unit array of the input characteristic data and the distribution mode of the convolution kernel data in the computing unit array according to the convolution kernel number and the input characteristic size information;
determining the data flow direction and the distribution mode as a loading mode of input characteristic data and convolution kernel data on a computing unit array;
And determining a result output mode of the computing unit array to accumulate and output the multi-round multiply-accumulate data for each computing unit.
13. A data processing apparatus, the apparatus comprising:
the acquisition unit is used for acquiring algorithm layer parameters and a data scheduling mode of a first algorithm layer in the neural network model;
the determining unit is used for determining a loading mode of input characteristic data and convolution kernel data on the computing unit array and a corresponding result output mode of the computing unit array according to the algorithm layer parameters and the data scheduling mode;
the loading unit is used for loading the input characteristic data and the convolution kernel into the computing unit array according to the loading mode of the input characteristic data and the convolution kernel data on the computing unit array, and performing multiply-accumulate operation on the computing unit array to obtain multiply-accumulate data;
and the processing unit is used for processing the multiply-accumulate data based on the data output mode to obtain output characteristic data.
14. A data processing apparatus, the apparatus comprising: a processor, a memory, and a communication bus; the processor, when executing a memory-stored operating program, implements the method according to any one of claims 9-12.
15. A storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 9-12.
CN202111224717.0A 2021-10-20 2021-10-20 Data processing method and device and storage medium Pending CN116009813A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111224717.0A CN116009813A (en) 2021-10-20 2021-10-20 Data processing method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111224717.0A CN116009813A (en) 2021-10-20 2021-10-20 Data processing method and device and storage medium

Publications (1)

Publication Number Publication Date
CN116009813A true CN116009813A (en) 2023-04-25

Family

ID=86025336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111224717.0A Pending CN116009813A (en) 2021-10-20 2021-10-20 Data processing method and device and storage medium

Country Status (1)

Country Link
CN (1) CN116009813A (en)

Similar Documents

Publication Publication Date Title
US10691996B2 (en) Hardware accelerator for compressed LSTM
CN108205701B (en) System and method for executing convolution calculation
US20220138577A1 (en) Batch Processing In A Neural Network Processor
US11816532B2 (en) Performing kernel striding in hardware
US11531540B2 (en) Processing apparatus and processing method with dynamically configurable operation bit width
CN107862374B (en) Neural network processing system and processing method based on assembly line
EP3179415B1 (en) Systems and methods for a multi-core optimized recurrent neural network
US20140344203A1 (en) Neural network computing apparatus and system, and method therefor
CN110807522B (en) General calculation circuit of neural network accelerator
CN109284824B (en) Reconfigurable technology-based device for accelerating convolution and pooling operation
CN108446761A (en) A kind of neural network accelerator and data processing method
EP3674982A1 (en) Hardware accelerator architecture for convolutional neural network
TW202123093A (en) Method and system for performing convolution operation
CN108960414B (en) Method for realizing single broadcast multiple operations based on deep learning accelerator
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
CN108985449B (en) Control method and device for convolutional neural network processor
CN110909872B (en) Integrated circuit chip device and related products
CN116009813A (en) Data processing method and device and storage medium
CN112308217B (en) Convolutional neural network acceleration method and system
CN116090518A (en) Feature map processing method and device based on systolic operation array and storage medium
CN112418417A (en) Convolution neural network acceleration device and method based on SIMD technology
CN111985628B (en) Computing device and neural network processor comprising same
CN113627587A (en) Multichannel convolutional neural network acceleration method and device
CN116805155B (en) LSTM network processing method, device, equipment and readable storage medium
CN110765413A (en) Matrix summation structure and neural network computing platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination