CN115469826B - Data processing method and device, computer equipment and computer readable storage medium - Google Patents

Data processing method and device, computer equipment and computer readable storage medium Download PDF

Info

Publication number
CN115469826B
CN115469826B CN202211128343.7A CN202211128343A CN115469826B CN 115469826 B CN115469826 B CN 115469826B CN 202211128343 A CN202211128343 A CN 202211128343A CN 115469826 B CN115469826 B CN 115469826B
Authority
CN
China
Prior art keywords
unit
weight
register
operation unit
arithmetic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211128343.7A
Other languages
Chinese (zh)
Other versions
CN115469826A (en
Inventor
钱祎剑
张斌
沈小勇
吕江波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Smartmore Technology Co Ltd
Original Assignee
Shenzhen Smartmore Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Smartmore Technology Co Ltd filed Critical Shenzhen Smartmore Technology Co Ltd
Priority to CN202211128343.7A priority Critical patent/CN115469826B/en
Publication of CN115469826A publication Critical patent/CN115469826A/en
Application granted granted Critical
Publication of CN115469826B publication Critical patent/CN115469826B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a data processing method, a data processing device, computer equipment and a computer readable storage medium, and relates to the technical field of data processing. The method comprises the following steps: responding to a pulse signal in the current operation period, inputting a target characteristic diagram data set to a multiplier of a multiplication unit, and inputting a target weight data set in a tail register of the multiplication unit to the multiplier to obtain a dot product data set of the operation unit under the pulse signal; determining a dot product matrix according to a dot product data set of the operation unit under a plurality of pulse signals in the current operation period; determining output results corresponding to the operation unit columns based on a plurality of dot product matrixes respectively corresponding to a plurality of operation units in the operation unit columns; and rearranging the output result corresponding to each operation unit column based on the rearrangement signal of the current operation period to obtain a target result. By adopting the method and the device, the consumption of logic resources and the number of signal lines are reduced, and the wiring difficulty of the convolutional neural network accelerating circuit is simplified.

Description

Data processing method, data processing device, computer equipment and computer readable storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, a computer device, and a computer-readable storage medium.
Background
The convolutional neural network is applied to various scenes, and relates to a large number of multiplication operations, wherein the Winograd convolution algorithm performs specific data domain conversion on a feature map and weights to complete an equivalent convolution operation task and reduce the multiplication times in the convolution operation process.
In the prior art, the internal structure of a multiplication unit in a convolutional neural network accelerating circuit is shown in fig. 1, and weight data loading of a multiplier is realized through a weight input and a data selector which are connected in parallel, so that the consumption of logic resources is increased, and the number of signal lines is increased; because the multiplication units included in the convolutional neural network accelerating circuit are related to the number of input and output channels of the convolutional neural network, and the logic resources and signal lines added by a single multiplication unit increase the wiring difficulty of the convolutional neural network accelerating circuit, the convolutional neural network accelerating circuit is not favorable for being used in small programmable logic arrays and application-specific integrated circuits.
Disclosure of Invention
The application provides a data processing method, a data processing device, computer equipment and a computer readable storage medium, which can reduce the consumption of logic resources and the number of signal lines, and further simplify the wiring difficulty of a convolutional neural network accelerating circuit.
In a first aspect, the present application provides a data processing method, including:
in response to a pulse signal in the current operation cycle, for each multiplication unit in each operation unit, inputting a target characteristic diagram data set in the characteristic diagram matrix into a multiplier of the multiplication unit, and inputting a target weight data set in a tail register of the multiplication unit into the multiplier of the multiplication unit to obtain a dot product data set of the operation unit under the pulse signal; wherein, the pulse signal is sent according to a preset pulse period; the target weight data group in the tail register is a previous register connected with the tail register in series and is transmitted to the tail register in response to a previous pulse signal;
determining a dot product matrix of the operation unit based on all dot product data sets of the operation unit under a plurality of pulse signals included in the current operation period;
for each arithmetic unit column, determining an output result corresponding to the arithmetic unit column based on a plurality of dot product matrixes corresponding to a plurality of arithmetic units in the arithmetic unit column one by one; the arithmetic unit row comprises a plurality of arithmetic units distributed into a row;
and determining a rearrangement signal based on the current operation period, and rearranging the output result corresponding to each operation unit column based on the rearrangement signal to obtain a target result.
In a second aspect, the present application further provides a data processing apparatus comprising a processing unit and a multiplier array;
the multiplier array includes: the device comprises a rearranged data selector, a plurality of output operators and a plurality of operation units distributed in an array; wherein, the rearrangement data selector is connected with a plurality of output arithmetic units; the plurality of output arithmetic units correspond to the plurality of arithmetic unit rows one by one, and the plurality of arithmetic units in each arithmetic unit row are connected with the corresponding output arithmetic units; the operation units in each operation unit row are connected in series end to end, and each operation unit comprises a plurality of multiplication units connected in parallel; for each multiplication unit in each operation unit, the multiplication unit comprises a multiplier and a plurality of registers connected in series end to end, and a tail register in the plurality of registers is connected with the multiplier;
the processing unit is used for responding to the pulse signal in the current operation cycle, and for each multiplication unit in each operation unit, inputting the target characteristic diagram data group in the characteristic diagram matrix to the multiplier of the multiplication unit; wherein, the pulse signal is sent according to a preset pulse period;
a tail register of each multiplication unit in each operation unit, for inputting the target weight data set in the tail register of the multiplication unit to the multiplier of the multiplication unit; the target weight data group in the tail register is a last register connected with the tail register in series and is transmitted to the tail register in response to a last pulse signal;
the multiplier of each multiplication unit in each operation unit is used for calculating a target characteristic diagram data set and a target weight data set to obtain a dot product data set of the operation unit under a pulse signal;
each operation unit is used for determining a dot product matrix of the operation unit based on all dot product data groups of the operation unit under a plurality of pulse signals in the current operation period;
an output operator corresponding to each operation unit column, for determining an output result corresponding to the operation unit column based on a plurality of dot product matrixes corresponding to a plurality of operation units in the operation unit column one by one; the arithmetic unit row comprises a plurality of arithmetic units distributed into a row;
the processing unit is also used for determining a rearrangement signal based on the current operation period;
and the rearrangement data selector is used for rearranging the output result corresponding to each operation unit column based on the rearrangement signal to obtain a target result.
In a third aspect, the present application further provides a computer device, where the computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the data processing method when executing the computer program.
In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the data processing method described above.
In a fifth aspect, the present application further provides a computer program product, which comprises a computer program, and when being executed by a processor, the computer program realizes the steps of the data processing method.
In the data processing method, in response to a ripple signal in a current operation cycle, for each multiplication unit in each operation unit, a target feature map data set and target weight data in a tail register are input to a multiplier of the multiplication unit, a dot product data set of the operation unit under the ripple signal is determined based on the dot product data set of each operation unit in the current operation cycle, an output result corresponding to each operation unit column is determined based on the current operation cycle, a rearrangement signal is determined based on the current operation cycle, and the output result corresponding to each operation unit column is rearranged based on the rearrangement signal to obtain a target result. In the data processing method, the target weight data group in the tail register is the last register connected with the tail register in series and is transmitted to the tail register in response to the previous pulse signal, namely, the weight matrix is loaded through the registers connected in series, so that the multiplication unit only has one weight input end, and the multiplication unit does not comprise a data selector, thereby reducing the consumption of logic resources and the number of signal lines, simplifying the wiring difficulty of the convolutional neural network accelerating circuit, and being used in a small programmable logic array and a special integrated circuit.
Drawings
FIG. 1 is a schematic diagram of a multiplication unit in the prior art;
fig. 2 is a schematic structural diagram of a multiplier array according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an arithmetic unit according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a data processing method according to an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a serial connection of the operation units of the operation unit row from beginning to end according to an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a connection of multiplication units in two operation units connected in series according to an embodiment of the present application;
FIG. 7 is a schematic diagram illustrating the connection of registers in two multiplication units connected in series according to an embodiment of the present application;
FIG. 8 is a diagram illustrating the connection between the tail register of the tail ALU column and the head register of the head ALU column according to an embodiment of the present application;
fig. 9 is a schematic connection diagram of a weight data selection unit according to an embodiment of the present application;
FIG. 10 is a block diagram of a multiplication unit row and a corresponding weight data selector according to an embodiment of the present application;
FIG. 11 is a block diagram of a data processing apparatus according to an embodiment of the present application;
FIG. 12 is a diagram illustrating an internal structure of a computer device according to an embodiment of the present invention;
fig. 13 is an internal structural diagram of a computer-readable storage medium according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The data processing method provided by the embodiment of the application can be applied to a data processing device, and the data processing device comprises: a processing unit and a multiplier array; as shown in fig. 2, the multiplier array includes: the device comprises a rearranged data selector MUX-P, a plurality of output operators S, a plurality of weight data selection units MUX-Q and a plurality of operation units X distributed in an array; the rearranged data selector MUX-P is connected with a plurality of output operators S; the plurality of output operators S correspond to the plurality of operation unit columns one by one, the plurality of operation units in each operation unit column are connected with the corresponding output operators S, and the plurality of weight data selection units MUX-Q correspond to the plurality of operation unit rows one by one; a plurality of arithmetic units X in each arithmetic unit row are connected in series end to end; as shown in fig. 3, each arithmetic unit X includes a plurality of multiplication units C connected in parallel; for each multiplication unit C in each arithmetic unit X, the multiplication unit C comprises a multiplier Mul and a plurality of registers R connected in series end to end, the tail register of the plurality of registers being connected to the multiplier.
In some embodiments, as shown in fig. 4, there is provided a data processing method comprising the steps of:
step S402, responding to a pulse signal in the current operation cycle, inputting a target characteristic diagram data set in a characteristic diagram matrix to a multiplier of a multiplication unit and inputting a target weight data set in a tail register of the multiplication unit to the multiplier of the multiplication unit for each multiplication unit in each operation unit to obtain a dot product data set of the operation unit under the pulse signal; wherein, the pulse signal is sent according to a preset pulse period; the target weight data group in the tail register is transmitted to the tail register in response to the last pulse signal from the last register connected in series with the tail register.
The data processing method is used for completing the calculation shown in the formula (1).
S=A T [(GgG T )⊙(BdB T )]A (1)
Where G is the convolution kernel, G is the convolution kernel transform matrix, G T Is a convolution kernel transform transpose matrix, ggG T For representing a weight matrix; d is the input signal, B is the input transformation matrix, B T Is an input transform transpose matrix; bdB T For representing the characteristic diagram matrix, A is the output transformation matrix, A T Is the output transform transpose matrix.
The embodiment of the present application is described by taking a convolution operation of Winograd optimized F (2, 3) as an example of a data processing method, in which case, the parameters in formula (1) are:
Figure BDA0003849888700000041
Figure BDA0003849888700000042
for ease of illustration, the signature matrix BdB will be described T Is marked as B d The weight matrix GgG T Is marked as g', B d Is a 4 × 4 matrix and the g' weight matrix is a 4 × 4 matrix. Taking one row or one column of data in the characteristic diagram matrix as a characteristic diagram data group, B d And 4 feature map data groups are included, and one row or column of data in g 'is taken as a group of matrix data, so that g' includes 4 weight data groups. The number of the multiplication units connected in parallel in each operation unit and the number of the registers in each multiplication unit can be determined according to the size of the convolution operation to be performed by the data processing apparatus.
The ripple signal is used for triggering the weight data set to flow in the multiplier array; the pulse signal is sent according to a preset pulse period, that is, the pulse signal is sent once every other pulse period, and the pulse period can be set according to actual requirements, for example, the pulse signal is sent once every other 0.1 second, that is, the pulse period is set to 0.1 second.
The current operation cycle comprises a plurality of pulse signals, and a target characteristic diagram data set is obtained in the characteristic diagram matrix based on the times of the pulse signals in the current operation cycle, for example, the characteristic diagram data of a first row is obtained in the characteristic diagram matrix as the target characteristic diagram data set in response to the pulse signals sent for the first time in the current operation cycle; and responding to a pulse signal sent for the third time in the current operation cycle, and acquiring characteristic diagram data of the third row in the characteristic diagram matrix as a target characteristic diagram data set.
Since the dot product operation of the feature map matrix and the weight matrix is a dot product operation of one feature map data set in the feature map matrix and one weight data set at a corresponding position in the weight matrix, if the target feature map data set is feature map data of a first row in the feature map matrix under the pulsating signal, the target weight data set is weight data of the first row in the weight matrix, and if the target feature map data set is feature map data of a third row in the feature map matrix under the pulsating signal, the target weight data set is weight data of the third row in the weight matrix, correspondingly.
The target weight data group in the tail register is a last register connected with the tail register in series and is transmitted to the tail register in response to a last pulse signal, and the target weight data group is a weight data group in the weight matrix.
Specifically, for each multiplication unit in each operation unit, the last register (connected in series with the tail register) transfers the weight data set therein to the tail register in response to the last ripple signal in the current operation cycle, and the tail register transfers the weight data set therein (target weight data set) to the multiplier in response to the current ripple signal in the current operation cycle.
And for the multiplier of each multiplication unit in each operation unit, realizing the dot product of the target feature map data set and the target weight data set through the multiplier to obtain an initial dot product value under the pulse signal, and obtaining the dot product data set of the operation unit under the pulse signal through the initial dot product value obtained by each multiplication unit in the operation unit.
Illustratively, the target feature map data set B d _1=(B d 1.1,B d 1.2,B d 1.3,B d 1.4 Target weight data set g '_1= (g' 1.1,g '1.2,g '1.3,g ' 1.4); the four multiplication units included in the arithmetic unit are a first multiplication unit c1, a second multiplication unit c2, a third multiplication unit c3 and a fourth multiplication unit c4 respectively, B d _1and g' _1 are input to c1, c2, c3 and c4, respectively; c1 execution of g'1.1 and B d 1.1, obtaining m1.1 by dot product operation, wherein m1.1 is an initial dot product value obtained by c1 under a pulse signal according to a target characteristic diagram data group and a target weight data group; c2 execution of g'1.2 and B d 1.2, obtaining m1.2 by dot product operation, wherein m1.2 is an initial dot product value obtained by c2 under a pulse signal according to a target characteristic diagram data group and a target weight data group; c3 execution of g'1.3 and B d 1.3, obtaining m1.3 by dot product operation, wherein m1.3 is an initial dot product value obtained by c3 under a pulse signal according to a target characteristic diagram data group and a target weight data group; c4 performs g'1.4 and B d 1.4, obtaining m1.4 by dot product operation, wherein m1.4 is an initial dot product value obtained by c4 under a pulse signal according to a target characteristic diagram data group and a target weight data group; m1.1, m1.2, m1.3 and m1.4 form a dot product data set m _1= (m 1.1, m1.2, m1.3, m 1.4).
In step S404, a dot product matrix of the operation unit is determined based on all the dot product data sets of the operation unit under the plurality of pulse signals included in the current operation cycle.
Wherein the current operation cycle comprises a plurality of pulse signals.
Specifically, the dot product data set is a row data set or a column data set in the dot product matrix. For each operation unit, under a plurality of pulse signals, a plurality of dot product data sets are calculated and obtained through a plurality of multiplication units included in the operation unit, and a dot product matrix of the operation unit is obtained according to the plurality of dot product data sets.
Illustratively, one operation cycle comprises 4 pulse signals, and a first characteristic diagram data group B in the characteristic diagram matrix is obtained in response to the pulse signal sent for the first time in the current operation cycle d 1, a dot product data set m _1 corresponding to the first weight data set g' _1 in the weight matrix; obtaining a second one of the feature map matrices in response to the second transmitted pulse signal in the current operation cycleFeature map data set B d -2, a dot product data set m _2 corresponding to the second weight data set g' _2 in the weight matrix; responding to the pulse signal sent for the third time in the current operation period to obtain a third characteristic diagram data group B in the characteristic diagram matrix d A _3, a dot product data set m _3 corresponding to the third weight data set g' _3 in the weight matrix; responding to the pulse signal sent for the fourth time in the current operation period to obtain a fourth feature map data group B in the feature map matrix d _4, and a dot product data set m _4 corresponding to the fourth weight data set g' _4 in the weight matrix; and calculating a dot product matrix m of the operation unit in the current operation period according to m _1, m _2, m _3 and m _ 4.
Step S406, determining an output result corresponding to the operation unit column based on a plurality of dot product matrixes corresponding to a plurality of operation units in the operation unit column one by one for each operation unit column; the arithmetic unit row comprises a plurality of arithmetic units distributed into a row.
Specifically, the number of the plurality of operation units included in the multiplier array is related to the number of input channels and the number of output channels of the convolutional neural network; the convolutional neural network comprises N I An input channel, N O For each output channel, the multiplier array includes: n is a radical of I ×N O An arithmetic unit; the operation unit column comprises a plurality of operation units distributed into a column, the operation unit row comprises a plurality of operation units distributed into a row, and when the operation units are arranged as a unit, the multiplier array comprises N O The multiplier array comprises N units of arithmetic units I A row of arithmetic units.
For each arithmetic unit column, the output arithmetic unit corresponding to the arithmetic unit column performs accumulation processing on the dot product matrixes of the arithmetic units in the arithmetic unit column to obtain the dot product result of the arithmetic unit column, and obtains the output result of the arithmetic unit column based on the output conversion matrix, the output conversion transpose matrix and the dot product result of the arithmetic unit column.
Illustratively, for the operation unit X of the ith row and the ith column (i,o) Calculating the dot product matrix in the current operation periodIs m (i,o) The operation unit X of the o-th column (·,o) The dot product matrix m calculated under the current period (·,o) Adding up to obtain the operation unit X of the o-th column (·,o) The dot product result M _ O in the current period is converted into a transposed matrix A according to the output conversion matrix A and the output T And M _ O, the output result YO of the operation unit in the O-th column is obtained, YO = A T M _ OA. In the same manner, the arithmetic unit X of the 1 st column can be obtained (·,1) Corresponding output result Y1, operation unit X of column 2 (·,2) Corresponding output result Y2, \8230;, no. N O Column arithmetic unit
Figure BDA0003849888700000061
Corresponding output result YN O 。/>
Step S408, determining a rearrangement signal based on the current operation cycle, and rearranging the output result corresponding to each operation unit column based on the rearrangement signal to obtain a target result.
Specifically, if the output result corresponding to each operation unit column is in the preset sequence in the current operation period, the output result corresponding to each operation unit column does not need to be rearranged, and the output result corresponding to each operation unit column is directly used as the target result; if the output result corresponding to each arithmetic unit column is not in the preset sequence in the current arithmetic cycle, determining a rearrangement signal based on the cycle number between the current arithmetic cycle and the preset initial arithmetic cycle and the total column number of all the arithmetic unit columns, and rearranging the output result corresponding to each arithmetic unit column based on the rearrangement signal and the rearrangement data selector to obtain a target result.
The preset sequence refers to that subscripts of output results corresponding to each arithmetic unit column are the same as column serial numbers of the arithmetic unit columns.
In a preset initial operation period, i.e. the operation period w =0, the output result of the operation unit in the 1 st column is Y1, the output result of the operation unit in the o th column is YO, and the output result of the operation unit in the N th column is YO O The output result of the column arithmetic unit is YN O Output corresponding to each operation unit columnThe results are: { Y1, Y2., YO.,. YN O And at this time, the output result corresponding to each arithmetic element column is in a preset sequence, and the output result corresponding to each arithmetic element column does not need to be rearranged.
In the next operation period of the initial operation period, i.e., the operation period w =1, the output result of the operation unit in the 1 st column is YN O The output result of the operation unit in the o-th column is YO, N-th O The output result of the arithmetic units of the column-1 is YN O -1, the output result corresponding to each operation unit column is: { YN O ,Y1,...,YO-1,...,YN O -1}, in which the output result corresponding to each operation unit column is not in the preset sequence, and the output result corresponding to each operation unit column needs to be rearranged;
thus, the Nth operation cycle after the initial operation cycle O In one operating cycle, i.e. operating cycle w = N O When the output result of the arithmetic element in the 1 st column is Y1, the output result of the arithmetic element in the o-th column is YO, and the N-th column is O The output result of the column arithmetic unit is YN O And the output result corresponding to each operation unit column is as follows: { Y1, Y2., YO.,. YN O And output results corresponding to each arithmetic element column do not need to be rearranged.
That is, if the current operation cycle is an integer multiple of the total number of the rows of the multiplier array (all the operation unit rows), the output result corresponding to each operation unit row does not need to be rearranged; if the current operation period is not an integral multiple of the number of the rows of the multiplier array, determining a rearrangement signal based on the number of the periods between the current operation period and the preset initial operation period and the total number of the rows of all the operation units, and rearranging the output result corresponding to each operation unit row based on the rearrangement signal so that the output result corresponding to each operation unit row is in a preset sequence, thereby obtaining a target result.
In the data processing method, in response to a ripple signal in a current operation cycle, for each multiplication unit in each operation unit, a target feature map data set and target weight data in a tail register are input to a multiplier of the multiplication unit, a dot product data set of the operation unit under the ripple signal is determined based on the dot product data set of each operation unit in the current operation cycle, an output result corresponding to each operation unit column is determined based on the current operation cycle, a rearrangement signal is determined based on the current operation cycle, and the output result corresponding to each operation unit column is rearranged based on the rearrangement signal to obtain a target result. In the data processing method, the target weight data group in the tail register is the last register connected with the tail register in series and is transmitted to the tail register in response to the previous pulse signal, namely, the weight matrix is loaded through the registers connected in series, so that the multiplication unit only has one weight input end, and the multiplication unit does not comprise a data selector, thereby reducing the consumption of logic resources and the number of signal lines, simplifying the wiring difficulty of the convolutional neural network accelerating circuit, and being used in a small programmable logic array and a special integrated circuit.
In some embodiments, the data processing method further comprises:
in response to the ripple signal in the current operation cycle, for each register in the multiplication unit, the weight data set in the register is transferred to the next register in series with the register.
Specifically, the weight data set of the weight matrix is transmitted in a plurality of registers connected in series in a flowing manner according to the ripple signal in the current operation period.
For each multiplication unit, the multiplication unit comprises a multiplier and a plurality of registers connected in series, the register in the first bit of the multiplication unit in series is marked as a head register, the register in the last bit of the multiplication unit in series is marked as a tail register, and the tail register is connected with the multiplier of the multiplication unit; the transfer direction of a weight data set in the registers of the multiplier unit is from the head register of the multiplier unit to the tail register in turn.
Illustratively, the multiplication unit includes 4 registers in series, a head register, a second register, a third register, and a tail register, respectively. Transferring the weight data set in the tail register to a next register in response to the pulse signal, as will be described in detail in the next embodiment; and transmitting the weight data group in the third register to the tail register, transmitting the weight data group in the second register to the third register, and transmitting the weight data group in the first register to the second register.
Before the ripple signal sent for the first time in the current operation cycle, the weight data set in the header register of the multiplication unit is: g' _4, the weight data set in the second register is: g' _3, the weight data set in the third register is: g' _2, the weight data set in the tail register is: g' _1;
in response to the ripple signal sent for the first time in the current operation cycle, the tail register transmits g '_1 as a target weight data set to the multiplier of the multiplication unit, and transmits g' _1 to the next register; the third register transfers g' _2 into the tail register; the second register transfers g' _3 into the third register; the header register transfers g' _4 into the second register; correspondingly, B in the feature map matrix d 1 is transmitted as the target feature map data set to the multiplier of the multiplication unit.
In response to the ripple signal sent for the second time in the current operation cycle, the tail register transfers g '_2 as a target weight data set to the multiplier of the multiplication unit, and transfers g' _2 to the next register; the third register transfers g' _3 into the tail register; the second register transfers g' _4 into the third register; correspondingly, B in the feature map matrix d A 2 is transmitted as the target feature map data set to the multiplier of the multiplication unit.
In some embodiments, transferring the set of weight data in the register to a next register in series with the register comprises:
when the register is a tail register in the multiplication unit, transmitting the weight data group in the register to a head register in the next multiplication unit; the tail register in the multiplication unit is connected with the head register in the next multiplication unit in series; alternatively, the first and second electrodes may be,
when the register is a tail register in the tail arithmetic unit column, transmitting the weight data group in the register to a head register in a head arithmetic unit; the tail register in the tail arithmetic unit column is connected with the head register in the head arithmetic unit in series.
Specifically, each operation unit of each operation unit row is connected in series end to end, that is, the weight output end of the operation unit at the tail of each operation unit row is connected with the weight input end of the operation unit row; exemplarily, as shown in fig. 5, an arithmetic unit
Figure BDA0003849888700000091
And the operation unit X (1,1) Is connected to the weight input.
For the two arithmetic units connected in series in each arithmetic unit row, the plurality of multiplication units of the first arithmetic unit in the two arithmetic units are respectively connected with the plurality of multiplication units of the second arithmetic unit in the two arithmetic units. As shown in FIG. 6, an arithmetic unit X (1,1) Multiplication unit c1 of (1,1) AND operation unit X (1,2) Multiplication unit c1 of (1,2) Connecting; arithmetic unit X (1,1) Multiplication unit c2 (1,1) AND operation unit X (1,2) Multiplication unit c2 (1,2) Connecting; 823060, 8230; arithmetic unit X (1,1) Multiplication unit c4 of (1,1) AND operation unit X (1,2) Multiplication unit c4 of (1,2) And (4) connecting.
The first multiplication unit of the first operation unit comprises four registers connected in series, the second multiplication unit of the second operation unit comprises four registers connected in series, wherein a tail register in the first multiplication unit is connected in series with a head register in the second multiplication unit; as shown in fig. 7, the first multiplication unit c1 of the first arithmetic unit (1,1) Includes four registers: r11 (1,1) ,r12 (1,1) ,r13 (1,1) ,r14 (1,1) Wherein r14 (1,1) Is c1 (1,1) The tail register of (1); first multiplication unit c1 of second arithmetic unit (1,2) Comprises four registersThe device comprises: r11 (1,2) ,r12 (1,2) ,r13 (1,2) ,r14 (1,2) Wherein r11 (1,2) Is c1 (1,2) R14 of (1,1) And r11 (1,2) Are connected in series.
When the first arithmetic unit is the tail register of the tail arithmetic unit column, the second arithmetic unit is the head register in the head arithmetic unit column. It can be understood that the tail register of the tail arithmetic unit column is the tail register of any multiplication unit in the tail arithmetic unit in one arithmetic unit row; the head register in the head arithmetic unit column is a head register which is connected with the tail register of any multiplication unit in the head arithmetic unit in the arithmetic unit row in series. As shown in FIG. 8, the first arithmetic unit is
Figure BDA0003849888700000092
In a first multiplication unit +>
Figure BDA0003849888700000093
Includes four registers:
Figure BDA0003849888700000094
wherein it is present>
Figure BDA0003849888700000095
Is->
Figure BDA0003849888700000096
The tail register of (1); the second arithmetic unit is X (1,1) ,X (1,1) First multiplying unit c1 (1,1) Includes four registers: r11 (1,1) ,r12 (1,1) ,r13 (1,1) ,r14 (1,1) Wherein, r11 (1,1) Is c1 (1,1) Is selected based on the status of the head register, <>
Figure BDA0003849888700000097
Weight output terminal of (1) and r11 (1,2) Are connected to the weight inputs of.
In the above embodiment, each arithmetic unit includes a plurality of registers connected in series, the tail register of one arithmetic unit is connected in series with the head register of the next arithmetic unit to realize the series connection between the arithmetic units, the tail arithmetic unit of each arithmetic unit row is connected in series with the head arithmetic unit of the arithmetic unit to form the arithmetic unit rows connected in series end to end, the weight data group is transmitted in the arithmetic unit rows connected in series end to end according to the pulse signal flow to realize the loading of the weight matrix, so that each multiplication unit has only one weight input end, the multiplication unit does not include a data selector, the consumption of logic resources and the number of signal lines are reduced, the wiring difficulty of the convolutional neural network accelerating circuit is simplified, and the convolutional neural network accelerating circuit can be used in small programmable logic arrays and application-specific integrated circuits.
In some embodiments, the data processing method further comprises:
for each arithmetic unit row, inputting the weight matrix output by the tail arithmetic unit of the plurality of arithmetic units in the arithmetic unit row to the first input end of the weight data selection unit corresponding to the arithmetic unit row; the arithmetic unit row comprises a plurality of arithmetic units distributed in a row;
determining a weight matrix set corresponding to the operation unit row in a plurality of pre-stored weight matrix sets; inputting a plurality of weight matrices in the weight matrix set to a second input end of the weight data selection unit;
under the condition that the signal input end of the weight data selection unit is a first weight loading signal, outputting a weight matrix set through the weight data selection unit;
based on the plurality of pulse signals, the weight matrix set comprising the plurality of weight matrices is sequentially input to the operation unit rows, so that each operation unit in the operation unit rows comprises a corresponding weight matrix.
Specifically, the present embodiment is a process performed before S302, and before the present embodiment is performed, there is no weight matrix in each operation unit, and the corresponding weight matrix is included in each operation unit by performing the present embodiment.
A plurality of weight matrix sets prestoredThe method can be obtained from a storage unit, wherein the storage unit is off-chip storage or on-chip cache; the second input end of the weight data selection unit is connected with the off-chip storage or on-chip cache so as to obtain a weight matrix set corresponding to the operation unit row in the off-chip storage or on-chip cache; the multiple weight matrix sets and the multiple operation unit rows of the multiplier array correspond to one another, and the multiplier array exemplarily comprises N I Each operation unit row corresponds to N I A set of weight matrices. When the signal input end of the weight data selection unit is the first weight loading signal, the weight data selection unit takes the data of the second input end as output data, namely, the weight matrix set corresponding to the operation unit row is input to the head operation unit of the operation unit row.
Each weight matrix set comprises a plurality of weight matrixes which are in one-to-one correspondence with the operation units in the operation unit row corresponding to the weight matrix set, and exemplarily, the Nth weight matrix set comprises a plurality of weight matrixes which are in one-to-one correspondence with the operation units in the operation unit row corresponding to the weight matrix set I In a row of operation units, N O An arithmetic unit, nth I The weight matrix set corresponding to each operation unit row comprises N O And the operation units are in one-to-one correspondence with the matrixes.
Illustratively, as shown in FIG. 9, for the 1 st arithmetic unit row X (1,·) ,X (1,·) The tail arithmetic unit in (1) is
Figure BDA0003849888700000101
The head arithmetic unit is X (1,1) Weight data selecting unit Q (1,·) Is coupled to the first input q1 and->
Figure BDA0003849888700000102
The weight output ends of the two-way valves are connected; weight data selection unit Q (1,·) Is connected with the on-chip memory or on-chip cache, and the operation unit line X is connected with the second input end q2 of the first input end (1,·) Corresponding weight matrix set g' (1,·) Input to q2; weight data selection unit Q (1,·) Output terminals q3 and X of (1,1) Are connected to the weight inputs of. Weight data selection unit Q (1,·) For signal input terminal q4The first weight loading signal or the second weight loading signal is received.
According to a plurality of pulse signals, the weight matrix set comprises a plurality of weight matrixes which are sequentially input to the operation unit rows, and the method comprises the following steps: the multiple weight matrixes in the weight matrix set are arranged according to a preset flowing sequence, and the weight data groups in each weight matrix are arranged according to a preset operation sequence to obtain a weight data group sequence.
The sequence of weight data sets includes a number of weight data sets equal to the product of the number of rows (or columns) of the weight matrix and the number of arithmetic elements in the row of arithmetic elements. And inputting one weight data set in the weight data set sequence into the operation unit row every time one pulse signal is responded, and inputting all weight data sets in the weight data set sequence into the operation unit row after a plurality of pulse signals so that each operation unit in the operation unit row comprises a corresponding weight matrix.
The plurality of weight matrices are arranged in a preset flow order, that is, the weight matrices are arranged in the order from the largest column number of the operation units corresponding to the weight matrices, the weight matrix corresponding to the operation unit with the largest column number (tail operation unit) is arranged at the first position, and the weight matrix corresponding to the operation unit with the smallest column number (head operation unit) is arranged at the last position.
The weight data sets in each weight matrix are arranged according to a preset operation sequence, which means that the operation sequence of the weight data sets when calculating the dot product matrices of the weight matrix and the feature map matrix is the preset operation sequence of the weight data sets, for example, when calculating the dot product matrices of the weight matrix and the feature map matrix, a first dot product data set of the dot product matrix is calculated according to the weight data set of a first row in the weight matrix, and then a second dot product data set of the dot product matrix is calculated according to the weight data set of a second row in the weight matrix, and then in the preset operation sequence, the weight data set of the first row in the weight matrix is arranged before the weight data set of the second row.
And sequentially inputting each weight data group in the weight data group sequence to the operation unit row according to a plurality of pulse signals. In this applicationIn the embodiment, each weight matrix includes 4 weight data sets, and one operation unit row includes N O A computing unit for computing a weight data set sequence including 4 XN O A weight data set.
Illustratively, for the 1 st arithmetic unit row X (1,·) ,X (1,·) The method comprises the following steps:
Figure BDA00038498887000001118
X (1,1) is a header arithmetic unit>
Figure BDA0003849888700000111
Is a tail arithmetic unit. />
Figure BDA0003849888700000112
There is no corresponding weight matrix in. Arithmetic unit row X (1,·) The corresponding set of weight matrices includes: />
Figure BDA0003849888700000113
The weight data set sequence is:
Figure BDA0003849888700000114
/>
Figure BDA0003849888700000115
wherein it is present>
Figure BDA0003849888700000116
Is->
Figure BDA0003849888700000117
Corresponding weight matrix +>
Figure BDA0003849888700000118
The matrix data set in (1); (g' (1,2)_1 ,g′ (1,2)_2 ,g′ (1,2)_3 ,g′ (1,2)_4 ) Is X (1,2) Corresponding weight matrix g' (1,2) Matrix data set of (g)' (1,1)-1 ,g′ (1,1)-2 ,g′ (1,1)-3 ,g′ (1,1)-4 ) Is X (1,1) Corresponding weight matrix g' (1,1) The matrix data set of (1). Will +in response to the 1 st pulse signal>
Figure BDA0003849888700000119
Is inputted into X (1,1) Will be responsive to the 4 th pulse signal>
Figure BDA00038498887000001110
Is inputted to X (1,1) To this point X (1,1) Including->
Figure BDA00038498887000001111
The other operation units do not comprise the weight matrix; will pickup in response to the 5 th pulse signal>
Figure BDA00038498887000001112
Is inputted to X (1,2) The header of the first register of (a) is, will be/are>
Figure BDA00038498887000001113
Is inputted to X (1,1) The header register of (1); will pickup in response to the 8 th pulse signal>
Figure BDA00038498887000001114
Is inputted to X (1,2) Will->
Figure BDA00038498887000001115
Is inputted into X (1,1) Header register of, up to this point, X (1,2) Including +>
Figure BDA00038498887000001116
X (1,1) Comprises
Figure BDA00038498887000001117
The other operation units do not comprise the weight matrix; in response to the 4 th (N) O -1) pulsesActivation signal->
Figure BDA0003849888700000121
Including->
Figure BDA0003849888700000122
X (1,2) Comprising g' (1,3) ,X (1,1) Comprising g' (1,2) ,/>
Figure BDA0003849888700000123
Does not include the weight matrix; in response to the 4 XN O A pulse signal>
Figure BDA0003849888700000124
Including->
Figure BDA0003849888700000125
Including +>
Figure BDA0003849888700000126
X (1,2) Comprising g' (1,2) ,X (1,1) Comprising g' (1,1) I.e. operation unit row X (1,·) Each arithmetic unit in (b) comprises a corresponding weight matrix.
In some embodiments, a weight data selection unit includes a plurality of weight data selectors, and the number of the plurality of weight data selectors in a weight data selection unit is the same as the number of the plurality of multiplication units in an operation unit.
Illustratively, each arithmetic unit comprises 4 parallel multiplication units, each multiplication unit comprises 4 serial registers, and for two arithmetic units connected in series in each arithmetic unit row, the plurality of multiplication units of the first arithmetic unit in the two arithmetic units are respectively connected with the plurality of multiplication units of the second arithmetic unit in the two arithmetic units. It is understood that each operation unit row comprises 4 multiplication unit rows, the multiplication units in each multiplication unit row are connected in series end to end, and the weight data groups are synchronously transmitted in the four multiplication unit rows according to the pulse signals.
Correspondingly, each weight data selection unit comprises 4 weight data selectors connected in parallel, and each multiplication unit row corresponds to one weight data selector; the first input of the weight data selection unit comprises: a first input of 4 weight data selectors, a second input of the weight data selection unit comprising: 4 second inputs of the weight data selector, the signal input of the weight data selector unit comprising: signal inputs of 4 weight data selectors.
For each multiplication unit row, the weight output end of the tail register of the tail multiplication unit of the multiplication unit row is connected with the first input end of the weight data selector corresponding to the multiplication unit row, so that the weight matrix output by the tail multiplication unit of the multiplication unit row is input to the first input end of the weight data selector of the multiplication unit row.
In the weight matrix set corresponding to the operation unit row, determining a weight data set corresponding to each multiplication unit row included in the operation unit row; for each multiplication unit row, the weight data group set corresponding to the multiplication unit row is input to the second input end of the weight data selector corresponding to the multiplication unit row.
In the weight matrix set corresponding to the operation unit row, determining the weight data set corresponding to each multiplication unit row included in the operation unit row comprises: dividing each weight matrix in the weight matrix set into a plurality of rows of weight data sets (or a plurality of columns of weight data sets), and determining a weight data set according to the weight data sets with the same rows (or the weight data sets with the same columns); and determining a plurality of weight data set sets respectively corresponding to a plurality of multiplication unit rows in the operation unit row according to the weight matrix set corresponding to the operation unit row.
Illustratively, the arithmetic unit row X (1,·) The corresponding set of weight matrices includes:
Figure BDA0003849888700000127
the weight data set sequence is:
Figure BDA0003849888700000128
Figure BDA0003849888700000129
X (1,·) row c1 of intermediate multiplication units (1,·) Corresponding set of weight data groups g' (1,·)_1 Comprises the following steps: />
Figure BDA00038498887000001210
Multiplication unit row c2 (1,·) Corresponding set of weight data groups g' (1,·)_2 Comprises the following steps: />
Figure BDA00038498887000001211
Multiplication unit row c3 (1,·) Corresponding set of weighted data sets g' (1,·)_3 Comprises the following steps: />
Figure BDA0003849888700000131
Multiplication unit row c4 (1,·) Corresponding set of weighted data sets g' (1,·)_4 Comprises the following steps: />
Figure BDA0003849888700000132
According to the pulse signal
Figure BDA0003849888700000133
Input to the multiplication cell row c1 (1,·) (ii) a According to the pulse signal
Figure BDA0003849888700000134
Input to the multiplication cell row c2 (1,·) (ii) a According to the pulse signal
Figure BDA0003849888700000135
Input to the multiplication cell row c3 (1,·) (ii) a According to the pulse signal
Figure BDA0003849888700000136
Input to the multiplication cell row c4 (1,·)
In the weight data selection unit, the weight selection signals of the signal input ends of the weight data selectors in each of the weight data selectors are the same, and when the signal input ends of the weight data selectors in the multiplication unit row are the first weight loading signals, the weight data selector in the multiplication unit row outputs the weight data group set corresponding to the multiplication unit row, and the weight data group set is sequentially input to the multiplication unit row according to a plurality of pulse signals.
As shown in FIG. 10, arithmetic element row X (1,·) Comprises 4 multiplication unit rows, which are respectively: c1 (1,·) 、c2 (1,·) 、c3 (1,·) And c4 (1,·) Wherein, c1 (1,·) Corresponding to a weight data selector Q1 (1,·) ,Q1 (1,·) First input end and tail operation unit
Figure BDA0003849888700000137
Is coupled to the tail multiply unit->
Figure BDA0003849888700000138
The tail register of (1) is connected; q1 (1,·) Is c1 is inputted to the first input terminal of (1,·) Corresponding set of weighted data sets g' (1,·)_1
c2 (1,·) Corresponding to a weight data selector Q2 (1,·) ,Q2 (1,·) First input end and tail operation unit
Figure BDA0003849888700000139
Is coupled to the tail multiply unit->
Figure BDA00038498887000001310
Tail register connections (not shown); q2 (1,·) Is c2 is inputted to the first input terminal of (1,·) Corresponding set of weight data groups g' (1,·)_2
c3 (1,·) Corresponding to a weight data selector Q3 (1,·) ,Q3 (1,·) First input terminal and tail arithmetic unit
Figure BDA00038498887000001311
Is coupled to the tail multiply unit->
Figure BDA00038498887000001312
Tail register connections (not shown); q3 (1,·) Is c3 (1,·) Corresponding set of weight data groups g' (1,·)_3
c4 (1,·) Corresponding to a weight data selector Q4 (1,·) ,Q4 (1,·) First input terminal and tail arithmetic unit
Figure BDA00038498887000001313
Trailing multiplication unit of>
Figure BDA00038498887000001314
Tail register connections (not shown); q4 (1,·) Is c4 (1,·) Corresponding set of weight data groups g' (1,·)_4
In some embodiments, after sequentially inputting the weight matrix set including the plurality of weight matrices to the operation unit row based on the plurality of pulse signals, so that each operation unit in the operation unit row includes the corresponding weight matrix, the method further includes:
and inputting the acquired second weight loading signal to a signal input end of the weight data selection unit so that the weight data selection unit outputs the weight matrix output by the tail operation unit.
Specifically, when the signal input end of the weight data selection unit is the second weight loading signal, the weight data selection unit outputs the data of the first input end thereof, and when the signal input end of the weight data selection unit is the first weight loading signal, the weight data selection unit outputs the data of the second input end thereof.
For each operation unit row, after each operation unit comprises the corresponding weight matrix, inputting a second weight loading signal to the signal input end of the weight data selection unit so that the weight data selection unit outputs the weight matrix output by the tail operation unit of the operation unit row.
Illustratively, in the case where the signal input terminal of the weight data selection unit is the first weight loading signal, N is passed O One operation period (4 XN) O Pulse signals), each operation unit in each operation unit row comprises a corresponding weight matrix, and a second weight loading signal is input to the weight data selection unit of each operation unit row; for example, X (1,1) Comprising g' (1,1) ,X (1,2) Comprising g' (1,2) ,X (1,o) Comprising g' (1,o)
Figure BDA0003849888700000141
Including->
Figure BDA0003849888700000142
Evaluation unit>
Figure BDA0003849888700000143
Is ^ based on the weight matrix>
Figure BDA0003849888700000144
In response to the (4 XN) th O ) +1 pulse signals will >>
Figure BDA0003849888700000145
The weight data group output by the tail register of (1): />
Figure BDA0003849888700000146
Is inputted into X (1,1) So that the weight matrix g' (1,·) In operation unit row X (1,·) Medium circulation flow transmission; through N O After +1 cycles of operation, X (1,1) Including->
Figure BDA0003849888700000147
X (1,2) Comprising g' (1,1) ,X (1,o) Comprising g' (1,o-1) ,/>
Figure BDA0003849888700000148
Including->
Figure BDA0003849888700000149
Operated->
Figure BDA00038498887000001410
In a weight matrix of->
Figure BDA00038498887000001411
It should be noted that the preset initial operation period is an operation period corresponding to an operation period in which each operation unit in each operation unit row includes a corresponding weight matrix, for example, the first weight loading signal is input to an operation period corresponding to a weight data selection unit in each operation unit row as the 1 st operation period, and the preset initial operation period is the nth operation period O One operation cycle. And inputting the second weight loading signal to the operation period corresponding to the weight data selection unit of each operation unit row to be used as the 1 st operation period from the initial operation period.
In the data processing method, in response to a ripple signal in a current operation cycle, for each multiplication unit in each operation unit, a target feature map data set and target weight data in a tail register are input to a multiplier of the multiplication unit, a dot product data set of the operation unit under the ripple signal is determined based on the dot product data set of each operation unit in the current operation cycle, an output result corresponding to each operation unit column is determined based on the current operation cycle, a rearrangement signal is determined based on the current operation cycle, and the output result corresponding to each operation unit column is rearranged based on the rearrangement signal to obtain a target result. In the data processing method, the target weight data group in the tail register is the last register connected with the tail register in series and is transmitted to the tail register in response to the previous pulse signal, namely, the weight matrix is loaded through the registers connected in series, so that the multiplication unit only has one weight input end, and the multiplication unit does not comprise a data selector, thereby reducing the consumption of logic resources and the number of signal lines, simplifying the wiring difficulty of the convolutional neural network accelerating circuit, and being used in a small programmable logic array and a special integrated circuit.
It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially shown as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the application also provides a data processing device. The implementation scheme for solving the problem provided by the data processing apparatus is similar to the implementation scheme described in the above method, so the specific limitations in the embodiments of the data processing apparatus provided below may refer to the limitations on the data processing method in the foregoing, and are not described herein again.
In some embodiments, as shown in fig. 11, there is provided a data processing apparatus comprising: a processing unit 1101 and a multiplier array 1102;
the multiplier array 1102 includes: the device comprises a rearranged data selector MUX-P, a plurality of output operators S and a plurality of operation units X distributed in an array; wherein, the rearrangement data selector MUX-P is connected with a plurality of output arithmetic units S; the plurality of output operators S correspond to the plurality of operation unit rows one by one, and the plurality of operation units in each operation unit row are connected with the corresponding output operators S; the operation units in each operation unit row are connected in series end to end, and each operation unit comprises a plurality of multiplication units connected in parallel; for each multiplication unit C in each arithmetic unit X, the multiplication unit C comprises a multiplier Mul and a plurality of registers R connected end to end in series, and the tail register of the plurality of registers is connected with the multiplier Mul.
A processing unit 1101, configured to, in response to the ripple signal in the current operation cycle, for each multiplication unit in each operation unit, input a target feature map data set in the feature map matrix to a multiplier Mul of the multiplication unit; wherein, the pulse signal is sent according to a preset pulse cycle;
a tail register of each multiplication unit in each operation unit X, for inputting the target weight data set in the tail register of the multiplication unit to the multiplier of the multiplication unit; the target weight data group in the tail register is a previous register connected with the tail register in series and is transmitted to the tail register in response to a previous pulse signal;
the multiplier Mul of each multiplication unit in each operation unit X is used for calculating a target characteristic diagram data group and a target weight data group to obtain a dot product data group of the operation unit under a pulse signal;
each operation unit X is used for determining a dot product matrix of the operation unit X based on all dot product data sets of the operation unit under a plurality of pulse signals included in the current operation period;
an output operator S corresponding to each operation unit column, for determining an output result corresponding to the operation unit column based on a plurality of dot product matrixes corresponding to a plurality of operation units in the operation unit column one by one; the arithmetic unit row comprises a plurality of arithmetic units distributed into a row;
a processing unit 1101, further configured to determine a rearrangement signal based on the current operation cycle;
and the rearrangement data selector MUX-P is used for rearranging the output result corresponding to each operation unit column based on the rearrangement signal to obtain a target result.
In some embodiments, each register in each arithmetic unit is configured to transmit the weight data set in the register to a next register in series with the register in response to the ripple signal in the current arithmetic cycle.
In some embodiments, for a first arithmetic unit and a second arithmetic unit of two arithmetic units in series, a tail register in the first arithmetic unit is in series with a corresponding head register in the second arithmetic unit; when the first arithmetic unit is in the tail arithmetic unit column, the second arithmetic unit is in the head arithmetic unit column;
and the tail register in the first operation unit is used for transmitting the weight data group in the tail register in the first operation unit to the corresponding head register in the second operation unit.
In some embodiments, multiplier array 1102 further includes: the device comprises a storage unit CA and a plurality of weight data selection units MUX-Q, wherein the weight data selection units MUX-Q correspond to the operation unit rows one by one; the storage unit CA is off-chip storage or on-chip cache.
For the weight data selection unit MUX-Q corresponding to each operation unit row, a first input end of the weight data selection unit MUX-Q is connected with a weight output end of a tail operation unit of the corresponding operation unit row, a second input end of the weight data selection unit MUX-Q is connected with the storage unit CA, and an output end of the weight data selection unit MUX-Q is connected with a weight input end of a head operation unit of the corresponding operation unit row; the weight data selection unit MUX-Q comprises a signal input end;
the processing unit 1101 is further configured to obtain a first weight loading signal, and input the first weight loading signal to a signal input end of the weight data selecting unit MUX-Q;
under the condition that the signal input end of the weight data selection unit MUX-Q is a first weight loading signal, the weight data selection unit MUX-Q is used for outputting a weight matrix set output by the storage unit;
and the storage unit CA is used for sequentially inputting the plurality of weight matrixes included by the weight matrix set to the operation unit row based on the plurality of pulse signals, so that each operation unit in the operation unit row comprises a corresponding weight matrix.
In some embodiments, the processing unit 1101 is further configured to obtain a second weight loading signal, and input the second weight loading signal to a signal input terminal of the weight data selecting unit MUX-Q;
and under the condition that the signal input end of the weight data selection unit MUX-Q is the second weight loading signal, the weight data selection unit MUX-Q is used for inputting the weight matrix output by the tail operation unit of the corresponding operation unit row to the head operation unit of the corresponding operation unit.
In some embodiments, the weight data selecting unit MUX-Q includes a plurality of weight data selectors MUX-Q, and the number of the plurality of weight data selectors MUX-Q in the weight data selecting unit MUX-Q is the same as the number of the plurality of multiplying units C in the arithmetic unit X.
For each multiplication unit row, the weight output end of the tail register of the tail multiplication unit C of the multiplication unit row is connected with the first input end of the weight data selector MUX-q corresponding to the multiplication unit row, so that the weight matrix output by the tail multiplication unit C of the multiplication unit row is input to the first input end of the weight data selector MUX-q of the multiplication unit row. A second input of the weight data selector MUX-q is connected to the memory unit CA.
The various modules in the data processing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 12. The computer equipment comprises a processor, a memory, an Input/Output (I/O) interface, an Input device, a display unit and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the input device and the display unit are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The input/output interface of the computer device is used for exchanging information between the processor and an external device, and the communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement the steps of the data processing method described above.
It will be appreciated by those skilled in the art that the configuration shown in fig. 12 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In some embodiments, a computer device is provided, comprising a memory in which a computer program is stored and a processor for implementing the steps of the above-described data processing method when the processor executes the computer program.
In some embodiments, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned data processing method.
In some embodiments, a computer program product is provided, comprising a computer program which, when executed by a processor, carries out the steps of the above-described data processing method.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which is stored in a computer-readable storage medium, and can be executed by the associated hardware, as shown in fig. 13. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (11)

1. A data processing method, comprising:
in response to a ripple signal in a current operation cycle, for each multiplication unit in each operation unit, inputting a target feature map data set in a feature map matrix to a multiplier of the multiplication unit, and inputting a target weight data set in a tail register of the multiplication unit to the multiplier of the multiplication unit to obtain a dot product data set of the operation unit under the ripple signal; wherein, the pulse signal is sent according to a preset pulse period; the target weight data group in the tail register is a previous register connected with the tail register in series and is transmitted to the tail register in response to a previous pulse signal;
determining a dot product matrix of the operation unit based on all dot product data sets of the operation unit under a plurality of pulse signals included in the current operation period;
for each arithmetic element column, determining an output result corresponding to the arithmetic element column based on a plurality of dot product matrixes in one-to-one correspondence with a plurality of arithmetic elements in the arithmetic element column; the arithmetic unit row comprises a plurality of arithmetic units distributed into a row;
and determining a rearrangement signal based on the current operation cycle, and rearranging the output result corresponding to each operation unit column based on the rearrangement signal to obtain a target result.
2. The method of claim 1, further comprising:
and responding to the pulse signal in the current operation period, and transmitting the weight data group in each register to the next register connected with the register in series.
3. The method of claim 2, wherein transferring the set of weight data in the register to a next register in series with the register comprises:
when the register is a tail register in the multiplication unit, transmitting the weight data group in the register to a head register in the next multiplication unit; wherein the tail register in the multiplication unit is connected in series with the head register in the next multiplication unit; alternatively, the first and second electrodes may be,
when the register is a tail register in a tail arithmetic unit column, transmitting the weight data group in the register to a head register in a head arithmetic unit column; wherein the tail register in the tail arithmetic unit column is connected in series with the head register in the head arithmetic unit column.
4. The method of claim 1, further comprising:
for each arithmetic unit row, inputting the weight matrix output by the tail arithmetic unit of a plurality of arithmetic units in the arithmetic unit row to the first input end of the weight data selection unit corresponding to the arithmetic unit row; the arithmetic unit row comprises a plurality of arithmetic units distributed in a row;
determining a weight matrix set corresponding to the operation unit row in a plurality of pre-stored weight matrix sets; inputting a plurality of weight matrices in the set of weight matrices to a second input of the weight data selection unit;
under the condition that the signal input end of the weight data selection unit is a first weight loading signal, outputting the weight matrix set through the weight data selection unit;
and sequentially inputting a plurality of weight matrixes included in the weight matrix set to the operation unit rows on the basis of a plurality of pulse signals, so that each operation unit in the operation unit rows comprises a corresponding weight matrix.
5. The method of claim 4, wherein after the weight matrix set comprising a plurality of weight matrices is sequentially input to the operation unit row based on a plurality of pulse signals so that each operation unit in the operation unit row comprises a corresponding weight matrix, the method further comprises:
and inputting the acquired second weight loading signal to a signal input end of the weight data selection unit so that the weight data selection unit outputs the weight matrix output by the tail operation unit.
6. A data processing apparatus comprising a processing unit and a multiplier array;
the multiplier array comprises: the device comprises a rearranged data selector, a plurality of output operators and a plurality of operation units distributed in an array; wherein the rearranged data selector is connected with the plurality of output operators; the plurality of output arithmetic devices correspond to the plurality of arithmetic unit columns one by one, and the plurality of arithmetic units in each arithmetic unit column are connected with the corresponding output arithmetic devices; the operation units in each operation unit row are connected in series end to end, and each operation unit comprises a plurality of multiplication units connected in parallel; for each multiplication unit in each operation unit, the multiplication unit comprises a multiplier and a plurality of registers connected end to end in series, and a tail register in the plurality of registers is connected with the multiplier;
the processing unit is used for responding to a pulse signal in the current operation cycle, and for each multiplication unit in each operation unit, inputting a target characteristic diagram data set in a characteristic diagram matrix to a multiplier of the multiplication unit; wherein, the pulse signal is sent according to a preset pulse period;
a tail register of each multiplication unit in each operation unit, configured to input the target weight data set in the tail register of the multiplication unit to the multiplier of the multiplication unit; the target weight data group in the tail register is a last register connected with the tail register in series and is transmitted to the tail register in response to a last pulse signal;
the multiplier of each multiplication unit in each operation unit is used for calculating the target characteristic diagram data group and the target weight data group to obtain a dot product data group of the operation unit under the pulse signal;
each arithmetic unit is used for determining a dot product matrix of the arithmetic unit based on all dot product data sets of the arithmetic unit under a plurality of pulse signals included in the current arithmetic cycle;
the output operator corresponding to each operation unit column is used for determining an output result corresponding to the operation unit column based on a plurality of point product matrixes which are in one-to-one correspondence with a plurality of operation units in the operation unit column; the arithmetic unit row comprises a plurality of arithmetic units distributed into a row;
the processing unit is further configured to determine a rearrangement signal based on the current operation cycle;
and the rearrangement data selector is used for rearranging the output result corresponding to each operation unit column based on the rearrangement signal to obtain a target result.
7. The apparatus of claim 6, wherein each register in each operation unit is configured to transmit the weight data set in the register to a next register connected in series with the register in response to the pulse signal in the current operation cycle.
8. The apparatus of claim 7, wherein for a first arithmetic unit and a second arithmetic unit of two arithmetic units in series, a tail register in the first arithmetic unit is in series with a corresponding head register in the second arithmetic unit; when the first arithmetic unit is in the tail arithmetic unit column, the second arithmetic unit is in the head arithmetic unit column;
and the tail register in the first arithmetic unit is used for transmitting the weight data group in the tail register in the first arithmetic unit to the corresponding head register in the second arithmetic unit.
9. The apparatus of claim 6, wherein the multiplier array further comprises: the device comprises a storage unit and a plurality of weight data selection units, wherein the weight data selection units correspond to a plurality of operation unit rows one to one;
for the weight data selection unit corresponding to each operation unit row, the first input end of the weight data selection unit is connected with the weight output end of the tail operation unit of the corresponding operation unit row, the second input end of the weight data selection unit is connected with the storage unit, and the output end of the weight data selection unit is connected with the weight input end of the head operation unit of the corresponding operation unit row; the weight data selection unit comprises a signal input end;
the processing unit is further configured to obtain a first weight loading signal, and input the first weight loading signal to a signal input end of the weight data selection unit;
the weight data selection unit is used for outputting the weight matrix set output by the storage unit under the condition that the signal input end of the weight data selection unit is a first weight loading signal;
the storage unit is configured to sequentially input the multiple weight matrices included in the weight matrix set to the operation unit row based on multiple pulse signals, so that each operation unit in the operation unit row includes a corresponding weight matrix.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 5.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.
CN202211128343.7A 2022-09-16 2022-09-16 Data processing method and device, computer equipment and computer readable storage medium Active CN115469826B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211128343.7A CN115469826B (en) 2022-09-16 2022-09-16 Data processing method and device, computer equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211128343.7A CN115469826B (en) 2022-09-16 2022-09-16 Data processing method and device, computer equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN115469826A CN115469826A (en) 2022-12-13
CN115469826B true CN115469826B (en) 2023-04-07

Family

ID=84333719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211128343.7A Active CN115469826B (en) 2022-09-16 2022-09-16 Data processing method and device, computer equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN115469826B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SU1509934A1 (en) * 1987-11-17 1989-09-23 Институт Технической Кибернетики Ан Бсср Optimum filter
US5745398A (en) * 1994-11-08 1998-04-28 Sgs-Thomson Microelectronics S.A. Method for the implementation of modular multiplication according to the Montgomery method
CN108629406A (en) * 2017-03-24 2018-10-09 展讯通信(上海)有限公司 Arithmetic unit for convolutional neural networks
CN109937416A (en) * 2017-05-17 2019-06-25 谷歌有限责任公司 Low time delay matrix multiplication component
US10417460B1 (en) * 2017-09-25 2019-09-17 Areanna Inc. Low power analog vector-matrix multiplier
CN111291323A (en) * 2020-02-17 2020-06-16 南京大学 Matrix multiplication processor based on systolic array and data processing method thereof
WO2020196407A1 (en) * 2019-03-28 2020-10-01 株式会社エヌエスアイテクス Convolutional computation device
CN111897579A (en) * 2020-08-18 2020-11-06 腾讯科技(深圳)有限公司 Image data processing method, image data processing device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3085517B1 (en) * 2018-08-31 2020-11-13 Commissariat Energie Atomique CALCULATOR ARCHITECTURE OF A CONVOLUTION LAYER IN A CONVOLUTIONAL NEURON NETWORK

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SU1509934A1 (en) * 1987-11-17 1989-09-23 Институт Технической Кибернетики Ан Бсср Optimum filter
US5745398A (en) * 1994-11-08 1998-04-28 Sgs-Thomson Microelectronics S.A. Method for the implementation of modular multiplication according to the Montgomery method
CN108629406A (en) * 2017-03-24 2018-10-09 展讯通信(上海)有限公司 Arithmetic unit for convolutional neural networks
CN109937416A (en) * 2017-05-17 2019-06-25 谷歌有限责任公司 Low time delay matrix multiplication component
US10417460B1 (en) * 2017-09-25 2019-09-17 Areanna Inc. Low power analog vector-matrix multiplier
WO2020196407A1 (en) * 2019-03-28 2020-10-01 株式会社エヌエスアイテクス Convolutional computation device
CN111291323A (en) * 2020-02-17 2020-06-16 南京大学 Matrix multiplication processor based on systolic array and data processing method thereof
CN111897579A (en) * 2020-08-18 2020-11-06 腾讯科技(深圳)有限公司 Image data processing method, image data processing device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN115469826A (en) 2022-12-13

Similar Documents

Publication Publication Date Title
CN107689948B (en) Efficient data access management device applied to neural network hardware acceleration system
CN111897579B (en) Image data processing method, device, computer equipment and storage medium
CN109800876B (en) Data operation method of neural network based on NOR Flash module
CN109074845A (en) Matrix multiplication and its use in neural network in memory
CN107704916A (en) A kind of hardware accelerator and method that RNN neutral nets are realized based on FPGA
CN107909148A (en) For performing the device of the convolution algorithm in convolutional neural networks
CN110543939B (en) Hardware acceleration realization device for convolutional neural network backward training based on FPGA
CN102947818A (en) Neural processing unit
CN108170640B (en) Neural network operation device and operation method using same
CN111291323A (en) Matrix multiplication processor based on systolic array and data processing method thereof
CN110580519B (en) Convolution operation device and method thereof
US20230041850A1 (en) Adaptive matrix multiplication accelerator for machine learning and deep learning applications
CN112732222A (en) Sparse matrix accelerated calculation method, device, equipment and medium
CN113486298B (en) Model compression method based on Transformer neural network and matrix multiplication module
JPH07117948B2 (en) Computer equipment
KR20220154764A (en) Inference engine circuit architecture
CN115469826B (en) Data processing method and device, computer equipment and computer readable storage medium
CN112966729B (en) Data processing method and device, computer equipment and storage medium
CN107368459B (en) Scheduling method of reconfigurable computing structure based on arbitrary dimension matrix multiplication
CN113516225A (en) Neural network computing device with systolic array
CN115424038A (en) Multi-scale image processing method, system and device and computer equipment
CN109343826B (en) Reconfigurable processor operation unit for deep learning
CN114997392B (en) Architecture and architectural methods for neural network computing
WO2024032220A1 (en) In-memory computing circuit-based neural network compensation method, apparatus and circuit
WO2023116431A1 (en) Matrix calculation method, chip, and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant