CN111401522B - Pulsation array variable speed control method and variable speed pulsation array micro-frame system - Google Patents

Pulsation array variable speed control method and variable speed pulsation array micro-frame system Download PDF

Info

Publication number
CN111401522B
CN111401522B CN202010171246.0A CN202010171246A CN111401522B CN 111401522 B CN111401522 B CN 111401522B CN 202010171246 A CN202010171246 A CN 202010171246A CN 111401522 B CN111401522 B CN 111401522B
Authority
CN
China
Prior art keywords
int
processing
precision
unit
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010171246.0A
Other languages
Chinese (zh)
Other versions
CN111401522A (en
Inventor
宋卓然
梁晓峣
景乃锋
官惠泽
江昭明
吴飞洋
江子山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202010171246.0A priority Critical patent/CN111401522B/en
Publication of CN111401522A publication Critical patent/CN111401522A/en
Application granted granted Critical
Publication of CN111401522B publication Critical patent/CN111401522B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30094Condition code generation, e.g. Carry, Zero flag
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30134Register stacks; shift registers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a pulsation array variable speed control method and a variable speed pulsation array micro-frame system, wherein the method comprises the following steps: the systolic array comprises at least two processing units; the method comprises the following steps: obtaining the precision modes to be converted of all the processing units in the pulse array; determining the longest processing period required by the processing unit in all the precision modes to be converted, and taking the longest processing period as the next running period of the pulse array; a number of blocking periods are set after the next actual processing period of the processing units such that the duration of the next processing period of all processing units is equal to the duration of the next run period of the systolic array. The invention can realize flexible switching under various precision, and the precision of the processing unit in the pulse array can be different under each fixed period, thereby accelerating the reasoning process of the convolutional neural network.

Description

Pulsation array variable speed control method and variable speed pulsation array micro-frame system
Technical Field
The invention relates to the technical field of neural networks, in particular to a pulsation array variable speed control method and a variable speed pulsation array micro-frame system.
Background
Many hardware companies now develop various neural network accelerators to improve the reasoning performance, where systolic arrays are an efficient way to compute convolutions because they can reuse the acquired data as much as possible.
Google proposes a TPU architecture based on classical pulsation arrays, which consists of a large number of operation units, wherein the processing units can fix weights or characteristic values, and the memory bandwidth is reduced and the high energy efficiency is realized by sliding the characteristic values or the weight values among the processing units. There are several types of neural network accelerators that support hybrid precision computing to further increase the computational efficiency. For example, a bit-serial multiplier may support 1-bit to 8-bit multiplication for reconfigurable computation. On the Xilinx PYNQ-Z1 plate, its peak performance can reach 6.5TOPS. NVIDIA proposes a Turing architecture that supports INT4, INT8 and INT32 integer computations, but these precision operations cannot run simultaneously. Bitfusion consists of a two-dimensional systolic array containing fusion units that can implement INT2, INT4, INT8 and INT16 operations, but the fusion units inside the systolic array must be set to the same precision, thus reducing the flexibility of the hybrid precision calculation.
Therefore, the reasoning process of the convolution neural network with mixed precision can not be realized by the existing pulsation array.
Disclosure of Invention
The invention aims to solve the technical problems that the conventional neural network accelerator cannot realize the reasoning process of the convolutional neural network with mixed precision, and the conventional pulsation array cannot realize variable speed control, so that the flexibility of mixed precision calculation is low.
In order to solve the technical problems, the invention provides a pulsation array variable speed control method, wherein the pulsation array comprises at least two processing units;
the method comprises the following steps:
obtaining precision modes to be converted of all the processing units in the pulsation array;
determining the longest processing period required by the processing unit in all the precision modes to be converted, and taking the longest processing period as the next running period of the pulse array;
setting a plurality of blocking periods after the next actual processing period of the processing units so that the duration of the next processing period of all the processing units is equal to the duration of the next running period of the pulse array;
wherein, each processing unit processes a blocking period, which means that the processing unit is blocked for a fixed period, and the number of the blocking periods is a natural number.
Preferably, the precision mode to be converted comprises an INT N-INT N mode, an INT N-INT 2N mode, an INT 2N-INT N mode and an INT 2N-INT 2N mode.
Preferably, if the precision modes of all the processing units in the systolic array are INT x INT y modes, and the actual processing period of the processing units for processing the INT x INT y modes is n fixed periods;
when the precision modes to be converted are all INT x '. Times.INT y' modes, the next operation period of the pulse array is (x '/x) (. Times.y'/y) n fixed periods, and the blocking period set in the next processing period of all the processing units is zero.
Preferably, if the precision modes of all the processing units in the systolic array are INT x INT y modes, and the actual processing period of the processing units for processing the INT x INT y modes is n fixed periods;
when the precision mode to be converted comprises an INT x '/INT y ' mode and an INT x '/INT y ' mode, judging the sizes of n, (x '/x)/(y '/y) n and (x '/x) (/ y) n;
if n is greater than (x '/x)/(y'/y) n and (x "/x)/(y"/y) n respectively, the next operation cycle of the pulse array is n fixed cycles, n- (x '/x)/(y'/y) n blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision being the INT x '/INT y' mode, and n- (x "/x) (" y "/y) n blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision being the INT x"/INT y "mode;
if (x '/x)/(y) n is greater than n and (x '/x)/(y '/y) n respectively, the next operation cycle of the pulse array is (x '/x)/(y '/y) n fixed cycles, zero blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision of INT x '/INT y ' mode, and (x '/x)/(y '/y) n blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision of INT x '/INT y ' mode;
if (x "/x)/(y) n is greater than n and (x '/x) (y '/y) n respectively, the next operation cycle of the pulse array is (x"/x)/(y '/y) n fixed cycles, and (x "/x)/(y) n blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision being the INT x '/INT y ' mode, and zero blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision being the INT x"/INT y "mode.
In order to solve the technical problems, the invention provides a variable speed pulsation array micro-frame system, which comprises at least one processing unit group and a global shared cache module which is in communication connection with all the processing unit groups;
the processing unit group comprises a cooperative control unit, an input buffer unit, a variable speed pulse array, an output buffer unit, an activating unit and a predicting unit, wherein the input buffer unit, the variable speed pulse array, the output buffer unit, the activating unit and the predicting unit are connected with the cooperative control unit respectively;
the cooperative control unit is used for generating a control instruction based on the mask cache signal output by the prediction unit and the variable speed pulsation array speed control method and sending the control instruction to the variable speed pulsation array.
Preferably, the input buffer unit is configured to obtain eigenvalue data and weight data from the global shared buffer module, perform precision conversion on the eigenvalue data, and transmit the eigenvalue data and weight data after precision conversion to the variable speed pulse array.
Preferably, the variable speed pulsation array is used for carrying out convolution calculation on the characteristic value data and the weight data according to the control instruction to obtain partial data and data, and transmitting the partial data and the data to the output buffer unit.
Preferably, the output buffer feature unit is configured to store the portion and the data, accumulate all the received portion and data to obtain a final convolution value, and send the final convolution value to the activation unit.
Preferably, the activating unit is configured to perform an activating operation on the accumulated result to obtain activating data, and transmit the activating data to the prediction unit.
Preferably, the prediction unit is configured to perform a pooling operation on the activation data to obtain pooled data, obtain feature value data of a next-layer convolution and a mask cache signal based on the pooled data, and transmit the mask cache signal to the cooperative control unit, and transmit the feature value data of the next-layer convolution to the global shared cache module.
One or more embodiments of the above-described solution may have the following advantages or benefits compared to the prior art:
by applying the variable speed control method for the pulsation array provided by the embodiment of the invention, the pulsation array can be flexibly switched under various precision by realizing the variable speed control of the pulsation array, and the precision of a processing unit in the pulsation array can be different under each fixed period, so that the reasoning process of the convolution neural network is accelerated. The invention not only supports the weight to have the mixing precision, but also supports the characteristic value to be the mixing precision, thereby improving the flexibility of the calculation of the mixing precision.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention, without limitation to the invention. In the drawings:
FIG. 1 is a flow chart of a pulsation array variable speed control method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a variable speed systolic array micro-frame system according to a second embodiment of the invention;
fig. 3 is a schematic diagram showing a specific structure of a processing unit in the second embodiment of the present invention.
Detailed Description
The following will describe embodiments of the present invention in detail with reference to the drawings and examples, thereby solving the technical problems by applying technical means to the present invention, and realizing the technical effects can be fully understood and implemented accordingly. It should be noted that, as long as no conflict is formed, each embodiment of the present invention and each feature of each embodiment may be combined with each other, and the formed technical solutions are all within the protection scope of the present invention.
Existing neural networks have become key technologies to address various problems, such as image recognition, natural language processing, and biomedical issues. Google proposes a TPU architecture based on classical pulsation arrays, which consists of a large number of operation units, wherein the processing units can fix weights or characteristic values, and the memory bandwidth is reduced and the high energy efficiency is realized by sliding the characteristic values or the weight values among the processing units. There are several types of neural network accelerators that support hybrid precision computing to further increase the computational efficiency. For example, a bit-serial multiplier may support 1-bit to 8-bit multiplication for reconfigurable computation. On the Xilinx PYNQ-Z1 plate, its peak performance can reach 6.5TOPS. NVIDIA proposes a Turing architecture that supports INT4, INT8 and INT32 integer computations, but these precision operations cannot run simultaneously. Bitfusion consists of a two-dimensional systolic array containing fusion units that can implement INT2, INT4, INT8 and INT16 operations, but the fusion units inside the systolic array must be set to the same precision, thus reducing the flexibility of the hybrid precision calculation.
Example 1
In order to solve the technical problems in the prior art, the embodiment of the invention provides a pulsation array variable speed control method.
FIG. 1 is a flow chart of a pulsation array variable speed control method according to an embodiment of the present invention; referring to fig. 1, the pulsation array in the pulsation array variable speed control method according to the embodiment of the present invention includes at least two processing units, and the processing units may support mixed precision processing. The pulsation array variable speed control method comprises the following steps.
Step S101, obtaining the precision mode to be converted of all processing units in the systolic array.
The time length required by the processing unit to process one precision mode is set to be one processing period, and the time length required by the processing unit to process different precision modes is generally different, so that if the precision mode processed by the processing unit is changed, the processing period is generally changed correspondingly. The unified processing of the processing periods of all the processing units in the pulse array speed changing process is an important point for realizing the pulse array speed changing, and based on the important point, the precision mode to be converted of the next operation period of all the processing units in the next operation period of the pulse array is required to be obtained before the pulse array speed changing is realized. Preferably, all processing units in the systolic array of the embodiment of the present invention support a mixed operation of an INT N precision mode, an INT N x INT 2N precision mode, an INT 2N x INT N precision mode, and an INT 2N x INT 2N precision mode, so that all precision modes to be converted by the processing units include an INT N x INT N precision mode, an INT N x INT 2N precision mode, an INT 2N x INT N precision mode, and an INT 2N x INT 2N precision mode.
Step S102, determining the longest processing period required by the processing unit in all the modes of precision to be converted, and taking the longest processing period as the next running period of the pulse array.
After the to-be-converted precision modes of all the processing units in the pulse array are acquired, processing periods required by the processing units for processing the to-be-converted precision modes are respectively determined, processing period duration required by the processing units for processing the to-be-converted precision modes is compared, and the processing period with the longest processing period required by the processing units in the to-be-converted precision modes is screened out and used as the next running period of the pulse array. In this embodiment, if the variable speed control of the systolic array is to be implemented, the operation period of the systolic array is set to be synchronous with the processing periods of all the processing units, so that when the systolic array completes one operation period, all the processing units in the systolic array complete one processing period, and when the systolic array enters the next operation period, all the processing units also enter the next operation period. Thus, after the next run-time of the systolic array is determined, the next processing cycle of all processing units in the systolic array is also determined.
Step S103, setting a plurality of blocking periods after the next actual processing period of the processing units so that the duration of the next processing period of all the processing units is equal to the duration of the next running period of the pulse array.
Since the actual processing cycle time length required for the processing units to process different precision modes is different, if the next processing cycle of all the processing units is to be unified, it is necessary to set a blocking cycle in the processing cycles of part of the processing units. Specifically, with the next running period of the pulse array as a reference, comparing the processing period of the processing unit for processing the to-be-converted precision mode with the next running period in sequence, and if the processing period is smaller than the next running period, setting a plurality of blocking periods after the actual processing period of the to-be-converted precision mode, so that the processing period of the to-be-converted precision mode is equal to the next running period. Wherein, each processing unit processes a blocking period, which means that the processing unit is blocked for a fixed period, and the number of blocking periods set in each processing period is a natural number. The duration of the fixed period can be set according to specific situations.
In order to further describe the pulse array variable speed control method according to the embodiment of the present invention, the following two transition cases are specifically described based on the to-be-transitioned precision modes including an INT N precision mode, an INT N x INT 2N precision mode, an INT 2N x INT N precision mode, and an INT 2N x INT 2N precision mode.
Transition case one: if the precision mode of all the processing units in the current pulse array is INT x INT y mode, and the actual processing period of the processing unit for processing the INT x INT y mode is n fixed periods. When the precision mode to be converted of all the processing units in the pulse array is an INT x '. INT y' mode, setting the next running period of the pulse array as (x '/x) (. Y'/y) n fixed periods, and setting the blocking period in the next processing period of all the processing units as zero.
The above-described transition mode is suitable for transition between any two precision modes of the INT N x INT N precision mode, the INT N x INT 2N precision mode, the INT 2N x INT N precision mode, and the INT 2N x INT 2N precision mode.
Transition condition two:
if the precision modes of all the processing units in the current pulse array are INT x INT y modes, and the actual processing period of the processing units for processing the INT x INT y modes is n fixed periods; when the precision mode to be converted includes an INT x '×int y' mode and an INT x "×int y" mode, the magnitudes of n, (x '/x) ×y'/y) n and (x "/x) ×y"/y) n are determined, which specifically includes the following cases:
if n is greater than (x '/x) ×y '/y) n and (x "/x) ×y"/y) n, respectively, the next operation cycle of the pulse array is n fixed cycles, n- (x '/x) ×y/n blocking cycles are set in the next processing cycle of the partial processing unit with the accuracy to be converted to the INT x ' ×int y ' mode, and n- (x "/x) ×y"/y blocking cycles are set in the next processing cycle of the partial processing unit with the accuracy to be converted to the INT x "×int y" mode.
If (x '/x)/(y'/y) n is greater than n and (x '/x)/(y'/y) n, respectively, the next operation cycle of the pulse array is (x '/x)/(y'/y) n fixed cycles, zero blocking cycles are set in the next processing cycle of the partial processing unit with the accuracy of INT x '/INT y' to be converted, and (x '/x)/(y'/y) n blocking cycles are set in the next processing cycle of the partial processing unit with the accuracy of INT x '/INT y' to be converted.
If (x "/x)/(y)/(n) is greater than n and (x '/x)/(y '/y) n, respectively, the next operation cycle of the pulse array is (x"/x)/(y '/y) n fixed cycles, and (x "/x)/(y"/y) n blocking cycles are set in the next processing cycle of the partial processing unit to be converted to the INT x '/INT y ' mode, and zero blocking cycles are set in the next processing cycle of the partial processing unit to be converted to the INT x "/INT y" mode.
The above-described conversion method is applicable to a case where any one precision mode among the INT N x INT N precision mode, the INT N x INT 2N precision mode, the INT 2N x INT N precision mode, and the INT 2N x INT 2N precision mode is converted into any two precision modes.
It should be noted that, the INT N precision mode, the INT N x INT 2N precision mode, the INT 2N x INT N precision mode, and the INT 2N x INT 2N precision mode may be arbitrarily converted from each other, and the detailed manner refers to steps S101-S103 and the first and second conversion cases, and will not be described herein.
According to the variable speed control method for the pulsation array, provided by the embodiment of the invention, the pulsation array can be flexibly switched under various precision by realizing variable speed control of the pulsation array, and the precision of a processing unit in the pulsation array can be different under each fixed period, so that the reasoning process of a convolutional neural network is accelerated. The invention not only supports the weight to have the mixing precision, but also supports the characteristic value to be the mixing precision, thereby improving the flexibility of the calculation of the mixing precision.
Example two
In order to solve the technical problems in the prior art, the embodiment of the invention provides a variable speed pulsation array micro-frame system.
Fig. 2 shows a schematic diagram of a variable speed systolic array micro-frame system according to a second embodiment of the invention. Referring to fig. 2, the variable speed systolic array micro-frame system according to the embodiment of the present invention includes a plurality of processing unit groups and a global shared cache module that is respectively communicatively connected to all the processing unit groups.
The processing unit group comprises a cooperative control unit, an input buffer memory unit, a variable speed pulse array, an output buffer memory unit, an activating unit and a predicting unit which are respectively connected with the cooperative control unit, wherein the input buffer memory unit, the variable speed pulse array, the output buffer memory unit, the activating unit and the predicting unit are sequentially connected.
The global shared buffer module is used for storing the characteristic value data and the weight data, and the characteristic value data and the weight data can be transmitted to the input buffer through the Im2col/pack engine.
The cooperative control unit is used for generating a control instruction based on the mask cache signal output by the prediction unit and the variable speed pulsation array speed control method and sending the control instruction to the variable speed pulsation array. The method for controlling the speed change of the systolic array is identical to the method for controlling the speed change of the systolic array provided in the first embodiment, and the method can refer to the first embodiment, and even if the acquisition of the precision mode to be converted of all the processing units in the systolic array is obtained based on the mask cache signal, the specific mode of generating the control command based on the mask cache signal and the method for controlling the speed change of the systolic array is not described herein again.
The input buffer unit is used for connecting the global shared buffer module and the variable speed pulse array. Under the control of the cooperative control unit, the input buffer unit can acquire the characteristic value data from the global shared buffer module, convert the characteristic value data into INT N or INT 2N precision according to the signal of the mask buffer signal, and then transmit the characteristic value data after precision conversion into the leftmost processing unit of the variable speed pulse array to start convolution operation.
It should be noted that, when the weight data is stored in the global shared buffer module, the storage format and precision of the weight data are already divided into INT N or INT 2N, so that the storage mode of the weight data in the input buffer unit does not need to be determined by a mask buffer signal, but is determined according to the storage mode of the weight data in the global shared buffer module; secondly, the mask buffer signal acquires the precision characteristics of the weight data before the convolution of each layer starts, so that the mask buffer signal can determine the working mode of the processing unit according to the precision characteristics of the characteristic data and the weight data.
Meanwhile, it should be noted that the eigenvalue data is arranged in the input buffer obliquely in an im2col arrangement form, that is, k×k (K is the size of the convolution kernel) data pairs Ji Chenglie in each convolution window. Each line in the data cache is delayed from the previous line by a period start count to form the accumulation of partial sums and data flows in the array.
Meanwhile, under the control of the cooperative control unit, the input buffer unit can also acquire weight data from the global shared buffer module. The weight data is stored in the input buffer according to the high and low precision of the weight, the low precision data is stored in an N-bit wide register, such as W00, and the high precision data is of INT 2N type, so that the low precision data is stored in a 2N-bit wide register, such as W03. The arrangement rule of the weight data is consistent with the characteristic value data.
The variable speed pulse array is mainly used for carrying out convolution calculation on the characteristic value data and the weight data according to the control instruction to obtain a part of data and the data, and transmitting the part of data to the output buffer unit. The variable speed pulsation array is composed of a plurality of processing units supporting mixed precision, and each processing unit right shifts the characteristic value data to the processing unit of the next position after the processing unit of the leftmost column of each processing period receives new characteristic value data from an input buffer memory and completes the operation of the processing period. During the convolution calculation process, the weight data will reside in the processing unit until the complete convolution calculation of a piece of feature map data is completed.
Fig. 3 is a schematic diagram showing a specific structure of a processing unit in the second embodiment of the present invention. Specifically, 3 registers are set in the processing unit: an 8N-bit P register for storing the partial and phase values; a 2N-bit W register for storing a weight fixed to 2 Nbit; and an F register for storing 2N-bits of the eigenvalue data of Nbit or 2 Nbit. The processing units support the mixed operation of an INT N precision mode, an INT N INT 2N precision mode, an INT 2N INT N precision mode and an INT 2N precision mode.
To further illustrate that the processing unit processes the INT N x INT N precision mode, the INT N x INT 2N precision mode, the INT 2N x INT N precision mode, and the INT 2N x INT 2N precision mode, respectively, the specific procedure is illustrated:
the processing unit works as a normal MAC unit in INT N mode, intercepts the lower N bit of the W, F register to perform multiplication operation, then adds with the upper 4N bit part in the P register, and stores the upper 4N bit of the P register to complete multiplication and addition operation in INT N mode.
In the INT N x INT 2N mode, the processing unit takes 2 fixed periods to complete the multiplication of the INT N and INT 2N values, and writes the result into the P register, which is shown in cycle t and cycle t+1 of fig. 3: at cycle t, the MAC unit reads the data of the lower N bits in the W, F register for multiplication, and stores the result into the lower 4N bits of the register P; and when the cycle is plus 1, the MAC unit reads the data of the lower N bits in the W register and the upper N bits in the F register, multiplies the data, shifts the data by N bits leftwards, accumulates the data with the stage value of the lower 4N bits stored in the P register, stores the accumulated data back into the lower 4N bits of the P register, accumulates the accumulated data with the partial sum of the data with the upper 4N bits, and stores the accumulated data back into the P register to obtain a final multiplication and addition result.
In the INT 2N-by-INT N mode, the processing unit takes 2 fixed cycles to complete the multiplication of the INT 2N and the INT N values, and writes the result into the P register, which includes the following steps: in the period, the MAC unit reads the data of the lower N bits in the W, F register, multiplies the data, and stores the result into the lower 4N bits of the register P; and when the cycle is +2, the MAC unit reads the data of the upper N bits in the W register and the lower N bits in the F register, multiplies the data, shifts the data by N bits leftwards, accumulates the data with the stage value of the lower 4N bits stored in the P register, stores the accumulated data back into the lower 4N bits of the P register, accumulates the accumulated data with the partial sum of the data of the upper 4N bits, and stores the accumulated data back into the P register to obtain a final multiplication and addition result.
In the INT 2N mode, the processing unit takes 4 fixed cycles to complete the multiplication of two INT 2N values, and writes the result into the P register, as shown in fig. 3: in the period, the MAC unit reads the data of the lower N bits in the W, F register, multiplies the data, and stores the result into the lower 4N bits of the register P; when the cycle is +1, the MAC unit reads the low N bit data in the W register and the high N bit data in the F register, multiplies the low N bit data by the high N bit data, shifts the left N bit data, accumulates the low 4N bit phase value stored in the P register before and stores the accumulated low 4N bit phase value back into the P register; in the case of cycle +2, which is similar to the last cycle, the MAC unit reads the upper N bits of the W register, the lower N bits of the F register, multiplying and shifting N bits left, accumulating with the stage value of the lower 4N bits in the P register, and storing back to the P register; in the period +3, the MAC unit reads the high N bit data of W, F register, multiplies the high N bit data by 2N bit, accumulates with the low 4N bit stage value in P register, accumulates with the high 4N bit part and accumulates, and stores back to P register to obtain the final multiplication and addition result of INT 2N mode.
The output buffer characteristic unit is used for storing part and data, accumulating all the received part and data to obtain a final convolution value, and sending the final convolution value to the activation unit. Still further, it may temporarily store a partial sum of the output eigenvalue data and complete "in-situ" accumulation by an accumulation unit to obtain the final value of the convolution.
The activating unit is used for carrying out activating operation on the accumulated result to obtain activating data, and transmitting the activating data to the predicting unit.
The prediction unit is used for carrying out pooling operation on the activation data to obtain pooled data, obtaining characteristic value data of the next-layer convolution and a mask cache signal based on the pooled data, transmitting the mask cache signal to the cooperative control unit, and transmitting the characteristic value data of the next-layer convolution to the global shared cache module.
It should be noted that, the accuracy of the feature value data input to the systolic array is predicted in real time by the prediction unit, and since the output feature value of the layer is the input feature value of the next layer, the prediction result can be used as a guide of the accuracy of the next layer. The specific characteristic value data prediction mode comprises the following steps: the prediction unit predicts each region located at x y in the feature map, assuming that the size of the pooled region is n x n, x and y satisfy the following condition:
when x or y is smaller than n, the invention sets the size x and y of the prediction unit area to 1, so the invention only needs to input the activation result into the comparator to be compared with the threshold value, when the activation result is higher than the threshold value, the characteristic value data is high-precision, and when the activation result is lower than the threshold value, the characteristic value data is low-precision; when x and y are both greater than n, the prediction unit needs to perform addition operation on the pooling result of the x-by-y area to calculate a mean value, then input the mean value into the comparator, compare the mean value with a threshold value, and when the mean value is higher than the threshold value, the characteristic value data is high-precision, and when the mean value is lower than the threshold value, the characteristic value data is low-precision.
It should be noted that, when x and y are both greater than n, the prediction unit needs to multiplex the pooling unit, and the prediction process of the prediction unit for completing a piece of feature map data may be summarized as follows (taking pooling area n=2, prediction area size x=y=4 as an example), where the pooling unit pools the results according to rows first, and sends the results into a buffer connected to the adder in the prediction unit, where the partial sums of the results and partial sums belonging to a prediction window are accumulated to obtain the partial sums of the prediction window. If the prediction window has completed accumulation of all the pooling results, comparing the accumulated value with a threshold value to obtain a judgment result; if the pooling window in the prediction window is not accumulated, the accumulated value is sent back to the cache for waiting the pooling result.
It should be noted that, the prediction results of the case where x or y is smaller than n and the case where x and y are both greater than n are both sent to the mask cache subunit set in the prediction unit, so that the mask cache subunit may obtain the mask cache signal according to the preset result, so as to facilitate subsequent sending to the cooperative control unit. The framework determines each region of the characteristic value data of the next layer to be high or low precision according to the prediction result in the mask cache.
According to the variable speed pulsation array micro-frame system provided by the embodiment of the invention, the pulsation array can be flexibly switched under various precision by realizing the variable speed control of the pulsation array, and the precision of a processing unit in the pulsation array can be different under each fixed period, so that the reasoning process of a convolutional neural network is accelerated. The invention not only supports the weight to have the mixing precision, but also supports the characteristic value to be the mixing precision, thereby improving the flexibility of the calculation of the mixing precision.
Although the embodiments of the present invention are disclosed above, the embodiments are only used for the convenience of understanding the present invention, and are not intended to limit the present invention. Any person skilled in the art can make any modification and variation in form and detail without departing from the spirit and scope of the present disclosure, but the scope of the present disclosure is still subject to the scope of the present disclosure as defined by the appended claims.

Claims (10)

1. A pulsation array variable speed control method is characterized in that the pulsation array comprises at least two processing units;
the method comprises the following steps:
obtaining precision modes to be converted of all the processing units in the pulsation array;
determining the longest processing period required by the processing unit in all the precision modes to be converted, and taking the longest processing period as the next running period of the pulse array;
setting a plurality of blocking periods after the next actual processing period of the processing units so that the duration of the next processing period of all the processing units is equal to the duration of the next running period of the pulse array;
wherein, each processing unit processes a blocking period, which means that the processing unit is blocked for a fixed period, and the number of the blocking periods is a natural number.
2. The method of claim 1, wherein the precision mode to be converted comprises INT N mode, INT N INT 2N mode, INT 2N INT N mode, and INT 2N mode.
3. The method according to claim 2, wherein if the precision mode of all processing units in the systolic array is the INT x INT y mode, and the actual processing period of the processing units for processing the INT x INT y mode is n fixed periods;
when the precision modes to be converted are all INT x '. Times.INT y' modes, the next operation period of the pulse array is (x '/x) (. Times.y'/y) n fixed periods, and the blocking period set in the next processing period of all the processing units is zero.
4. The method according to claim 2, wherein if the precision mode of all processing units in the systolic array is the INT x INT y mode, and the actual processing period of the processing units for processing the INT x INT y mode is n fixed periods;
when the precision mode to be converted comprises an INT x '/INT y ' mode and an INT x '/INT y ' mode, judging the sizes of n, (x '/x)/(y '/y) n and (x '/x) (/ y) n;
if n is greater than (x '/x)/(y'/y) n and (x "/x)/(y"/y) n respectively, the next operation cycle of the pulse array is n fixed cycles, n- (x '/x)/(y'/y) n blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision being the INT x '/INT y' mode, and n- (x "/x) (" y "/y) n blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision being the INT x"/INT y "mode;
if (x '/x)/(y) n is greater than n and (x '/x)/(y '/y) n respectively, the next operation cycle of the pulse array is (x '/x)/(y '/y) n fixed cycles, zero blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision of INT x '/INT y ' mode, and (x '/x)/(y '/y) n blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision of INT x '/INT y ' mode;
if (x "/x)/(y) n is greater than n and (x '/x) (y '/y) n respectively, the next operation cycle of the pulse array is (x"/x)/(y '/y) n fixed cycles, and (x "/x)/(y) n blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision being the INT x '/INT y ' mode, and zero blocking cycles are set in the next processing cycle of the partial processing unit with the to-be-converted precision being the INT x"/INT y "mode.
5. A variable speed systolic array micro-frame system comprising at least one processing unit group and a global shared cache module communicatively coupled to all of the processing unit groups;
the processing unit group comprises a cooperative control unit, an input buffer unit, a variable speed pulse array, an output buffer unit, an activating unit and a predicting unit, wherein the input buffer unit, the variable speed pulse array, the output buffer unit, the activating unit and the predicting unit are connected with the cooperative control unit respectively;
the cooperative control unit is configured to generate a control instruction based on the mask buffer signal output by the prediction unit and the variable speed pulse array speed control method according to any one of claims 1 to 4, and send the control instruction to the variable speed pulse array.
6. The variable speed systolic array micro-frame system according to claim 5, wherein the input buffer unit is configured to obtain eigenvalue data and weight data from the global shared buffer module, perform precision conversion on the eigenvalue data, and transmit the eigenvalue data and weight data after precision conversion to the variable speed systolic array.
7. The variable speed systolic array micro-frame system of claim 6, wherein the variable speed systolic array is configured to convolve the eigenvalue data and the weight data according to the control command to obtain a part of sum data, and transmit the part of sum data to the output buffer unit.
8. The variable speed systolic array micro-frame system of claim 7, wherein the output buffer feature unit is configured to store the portion and the data, accumulate all the received portion and data to obtain a final convolution value, and send the final convolution value to the activation unit.
9. The variable speed systolic array micro-frame system of claim 8, wherein the activation unit is configured to perform an activation operation on the accumulated result to obtain activation data, and transmit the activation data to the prediction unit.
10. The variable speed systolic array micro-frame system according to claim 9, wherein the prediction unit is configured to perform a pooling operation on the activation data to obtain pooled data, obtain feature value data of a next-layer convolution and a mask buffer signal based on the pooled data, and transmit the mask buffer signal to the cooperative control unit, and transmit the feature value data of the next-layer convolution to the global shared buffer module.
CN202010171246.0A 2020-03-12 2020-03-12 Pulsation array variable speed control method and variable speed pulsation array micro-frame system Active CN111401522B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010171246.0A CN111401522B (en) 2020-03-12 2020-03-12 Pulsation array variable speed control method and variable speed pulsation array micro-frame system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010171246.0A CN111401522B (en) 2020-03-12 2020-03-12 Pulsation array variable speed control method and variable speed pulsation array micro-frame system

Publications (2)

Publication Number Publication Date
CN111401522A CN111401522A (en) 2020-07-10
CN111401522B true CN111401522B (en) 2023-08-15

Family

ID=71428599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010171246.0A Active CN111401522B (en) 2020-03-12 2020-03-12 Pulsation array variable speed control method and variable speed pulsation array micro-frame system

Country Status (1)

Country Link
CN (1) CN111401522B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108902B (en) * 2023-02-22 2024-01-05 成都登临科技有限公司 Sampling operation implementation system, method, electronic device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6675187B1 (en) * 1999-06-10 2004-01-06 Agere Systems Inc. Pipelined linear array of processor elements for performing matrix computations
CN110232441A (en) * 2019-06-18 2019-09-13 南京大学 A kind of stacking-type based on unidirectional systolic arrays is from encoding system and method
CN110249237A (en) * 2017-12-22 2019-09-17 索尼半导体解决方案公司 Sensor chip, electronic equipment and device
CN110263925A (en) * 2019-06-04 2019-09-20 电子科技大学 A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA
CN110321157A (en) * 2018-03-29 2019-10-11 英特尔公司 Instruction for the fusion-multiply-add operation with variable precision input operand
CN110705703A (en) * 2019-10-16 2020-01-17 北京航空航天大学 Sparse neural network processor based on systolic array

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2541431A1 (en) * 2005-10-07 2013-01-02 Altera Corporation Data input for systolic array processors

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6675187B1 (en) * 1999-06-10 2004-01-06 Agere Systems Inc. Pipelined linear array of processor elements for performing matrix computations
CN110249237A (en) * 2017-12-22 2019-09-17 索尼半导体解决方案公司 Sensor chip, electronic equipment and device
CN110321157A (en) * 2018-03-29 2019-10-11 英特尔公司 Instruction for the fusion-multiply-add operation with variable precision input operand
CN110263925A (en) * 2019-06-04 2019-09-20 电子科技大学 A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA
CN110232441A (en) * 2019-06-18 2019-09-13 南京大学 A kind of stacking-type based on unidirectional systolic arrays is from encoding system and method
CN110705703A (en) * 2019-10-16 2020-01-17 北京航空航天大学 Sparse neural network processor based on systolic array

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
熊先奎等.移动边缘计算规模部署的技术制约因素和对策.《中兴通讯技术》.2019,全文. *

Also Published As

Publication number Publication date
CN111401522A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN110390384B (en) Configurable general convolutional neural network accelerator
CN110210610B (en) Convolution calculation accelerator, convolution calculation method and convolution calculation device
CN110348574B (en) ZYNQ-based universal convolutional neural network acceleration structure and design method
KR101781057B1 (en) Vector processing engine with merging circuitry between execution units and vector data memory, and related method
CN108509270B (en) High-performance parallel implementation method of K-means algorithm on domestic Shenwei 26010 many-core processor
CN111898733A (en) Deep separable convolutional neural network accelerator architecture
KR20160085337A (en) Vector processing engines employing a tapped-delay line for filter vector processing operations, and related vector processor systems and methods
CN111860773B (en) Processing apparatus and method for information processing
CN112905530A (en) On-chip architecture, pooled computational accelerator array, unit and control method
CN114897133A (en) Universal configurable Transformer hardware accelerator and implementation method thereof
CN111401522B (en) Pulsation array variable speed control method and variable speed pulsation array micro-frame system
CN116167419A (en) Architecture compatible with N-M sparse transducer accelerator and acceleration method
US20230252600A1 (en) Image size adjustment structure, adjustment method, and image scaling method and device based on streaming architecture
CN114707649B (en) General convolution operation device
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
CN116245149A (en) Accelerated computing device and method based on RISC-V instruction set expansion
CN114912596A (en) Sparse convolution neural network-oriented multi-chip system and method thereof
CN112766453A (en) Data processing device and data processing method
CN111258641B (en) Operation method, device and related product
CN111860788A (en) Neural network computing system and method based on data flow architecture
CN114237551B (en) Multi-precision accelerator based on pulse array and data processing method thereof
CN112346703B (en) Global average pooling circuit for convolutional neural network calculation
CN116882467B (en) Edge-oriented multimode configurable neural network accelerator circuit structure
CN117291240B (en) Convolutional neural network accelerator and electronic device
CN112418419B (en) Data output circuit structure processed by neural network and scheduled according to priority

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant