KR20170052432A - Calcuating method and apparatus to skip operation with respect to operator having value of zero as operand - Google Patents

Calcuating method and apparatus to skip operation with respect to operator having value of zero as operand Download PDF

Info

Publication number
KR20170052432A
KR20170052432A KR1020160017819A KR20160017819A KR20170052432A KR 20170052432 A KR20170052432 A KR 20170052432A KR 1020160017819 A KR1020160017819 A KR 1020160017819A KR 20160017819 A KR20160017819 A KR 20160017819A KR 20170052432 A KR20170052432 A KR 20170052432A
Authority
KR
South Korea
Prior art keywords
matrix
buffer
row
elements
zero
Prior art date
Application number
KR1020160017819A
Other languages
Korean (ko)
Other versions
KR101843243B1 (en
Inventor
박기호
기민관
Original Assignee
세종대학교산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 세종대학교산학협력단 filed Critical 세종대학교산학협력단
Publication of KR20170052432A publication Critical patent/KR20170052432A/en
Application granted granted Critical
Publication of KR101843243B1 publication Critical patent/KR101843243B1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3243Power saving in microcontroller unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • G06F9/30069Instruction skipping instructions, e.g. SKIP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30101Special purpose registers

Abstract

An operation method and an operation apparatus for skipping an operation for an operator having a zero value as an operand are disclosed. Recently, a special MCU (Micro Controller Unit) such as a sensor hub SoC (System on Chip) has been employed in mobile and wearable portable devices to process data transmitted by various sensors. Embodiments of the present invention are directed to a hardware accelerator that detects motion direction based on a six-axis sensor. The architecture of the hardware accelerator may be designed based on the profiling of the sensor fusion algorithm. In the performance evaluation, the hardware accelerator according to the embodiments of the present invention shows that execution time is improved by 100% or more.

Description

TECHNICAL FIELD [0001] The present invention relates to an operation method and an operation apparatus for skipping an operation on an operator having a zero value as an operand,

The following description relates to a computation method and an arithmetic unit for skipping an operation on an operator having a zero value as an operand.

With the development of IoT (Internet of Things) service, various sensors are being used in smart devices recently. 1 is a view showing an example of a sensor hub in the prior art. A gyro sensor, a motion sensor, an ambient light sensor, an accelerometer sensor, a temperature humidity sensor, a pressure sensor The number of sensors 110 such as the number of sensors 110 is increasing and the waiting time of the application processor 110 included in the smart device 100 for processing the sensing data of the sensors 110 is gradually decreasing , Energy consumption increases. At this point, the increase in the number of sensors 110 in the smart device 100 consumes more energy, especially when the processing of the sensor data is handled in the AP 110 with high performance and high power consumption. Thus, a sensor hub microcontroller unit (MCU) 120 can be introduced into the smart device 100 to process sensor data with much less energy consumption based on a low-power embedded processor. The sensor hub MCU 120 is a low-power MCU specialized for processing data acquired by various sensors. The algorithm executed in the sensor hub calculates the direction of the device or the user through the sensor signal, Therefore, sensor fusion algorithm using Kalman filter is also utilized. Such a Kalman filter has high complexity because it uses various complementary sensors.

There are prior art techniques for designing an efficient hardware accelerator for an overall Kalman filter algorithm with an FPGA (Field Programmable Gate Array). While dedicated hardware accelerators may be efficient for certain target systems or sensors, the lack of programmability has the disadvantage that other systems, such as systems that use new sensors and / or algorithms, limit the use of such solutions. Therefore, a sensor hub MCU architecture (including embedded processors and hardware accelerators) is required that can achieve performance improvements in flexibility and programmability.

Reference literature: S. Cruz, D.M. Munoz, M. Conde an C.H. Llanos, G.A. Borges, "FPGA implementation of a sequential extended Kalman filter algorithm applied to mobile robotics localization problem," Circuits and Systems, pp. 1-4, Feb 2013.>

Embodiments of the present invention relate to a hardware accelerator for a microcontroller unit (MCU) of a sensor hub. To improve accuracy of direction estimation and reduce energy consumption for direction estimation, a complex Kalman filter ) In accordance with the present invention.

Further, the present invention provides an operation method and apparatus that have more programmability and can improve the performance of the Kalman filter processing time by more than 100%.

When a matrix is stored in a register from a memory, a zero bit register for storing whether the element of the corresponding row is zero is provided. If the element to be operated has a value of 0, the operation for the element is omitted The present invention provides an operation method and an apparatus capable of reducing an operation execution time and a power consumption required for an operation.

A method of operating a computing device, the method comprising: identifying a first operand having a zero value among a plurality of first operands in a computing device and indicating a first operand having a zero value through a zero bit verification buffer; The computing device sequentially broadcasts the plurality of first operands to a plurality of computing devices included in the computing device, and skips the broadcasting of the first operand determined to have a zero value through the zero bit verification buffer ; And processing an operation between the plurality of second operands transmitted in correspondence with each of the plurality of operators in each of the plurality of operators and the broadcasted first operand. do.

According to one aspect, the plurality of first operands are elements of an n-th row of a first matrix composed of a rows and b columns, and the plurality of second operands are elements of a second matrix consisting of c rows and d columns And a, b, c, and d are natural numbers, n is a natural number equal to or smaller than a, and m is a natural number equal to or smaller than the value of c.

According to another aspect, the broadcasting step sequentially broadcasts the elements of the n-th row of the first matrix, and the step of processing the operation further comprises a step of, in each of the plurality of operators, And a multiplication operation between one of the elements of the row and the corresponding element among all the elements of the m-th row of the second matrix.

According to another aspect of the present invention, the zero bit confirmation buffer stores a bit string for displaying elements which are zero values among elements of an nth row of the first matrix.

According to another aspect, the method includes loading a third matrix, which comprises e rows and f columns, into a first matrix buffer; And loading the second matrix as a transposed matrix of the third matrix into a second matrix buffer such that an mth row of the third matrix is substituted into an mth column of the second matrix, And e and f are natural numbers.

According to another aspect, the method further comprises loading the first matrix into the first matrix buffer after the second matrix is loaded into the second matrix buffer as a transpose matrix of the third matrix . ≪ / RTI >

According to another aspect, the third matrix is the same matrix as the first matrix, the values of a and e are the same, and the values of b and f are the same.

According to another aspect of the present invention, the calculating method further includes accumulating and storing the operation results of each of the plurality of operators in a result buffer, and the result buffer includes a plurality of stores corresponding to each of the plurality of operators And the operation result of the corresponding arithmetic unit is stored.

A method for computing a matrix between a first matrix and a second matrix of a computing device, the first matrix being composed of a rows and b columns, the second matrix being composed of c rows and d columns, Loading the nth row of the first matrix into the first matrix buffer and loading the second matrix into the second matrix buffer in the computing device; Calculating a multiplication between the i-th element of the n-th row of the first matrix and each of all the elements of the m-th row of the second matrix in the computing device; And accumulating and accumulating the multiplication result between the i-th element of the n-th row of the first matrix and the j-th element of the m-th row of the second matrix in the j-th storage of the buffer storing the result matrix, Wherein when the value of the i-th element of the n-th row of the first matrix is 0, the multiplication between all the elements of the m-th row of the second matrix is performed Wherein a, b, c, and d are natural numbers, n is a natural number less than or equal to a, i is a natural number less than or equal to b, m is a natural number less than or equal to c, j Is a natural number equal to or smaller than d.

A zero-bit verification unit for identifying a first operand having a zero value among a plurality of first operands and indicating a first operand having a zero value through a zero-bit verification buffer; A broadcasting unit for broadcasting the plurality of first operands sequentially to a plurality of operators included in the arithmetic unit, and for skipping broadcasting of a first operand determined to have a zero value through the zero bit confirmation buffer; ; And a plurality of arithmetic operators for processing arithmetic operations between the plurality of second operands transmitted in correspondence with each of the plurality of arithmetic operators and the broadcasted first operand.

A computing device for a matrix operation between a first matrix and a second matrix, the first matrix being composed of a rows and b columns, the second matrix being composed of c rows and d columns, A first matrix buffer for loading an nth row of the first matrix; A second matrix buffer for loading the second matrix; A multiplier for calculating a multiplication between the i-th element of the n-th row of the first matrix and each of all the elements of the m-th row of the second matrix; And an accumulator for accumulating the multiplication result between the i-th element of the n-th row of the first matrix and the j-th element of the m-th row of the second matrix in a jth storage, The a, the b, the c, and the d are set to be equal to or less than a predetermined value when the value of the i-th element of the n-th row of the matrix is 0 Wherein m is a natural number equal to or smaller than the value of c, and j is a natural number equal to or smaller than the value of d. do.

And a sensor hub MCU (Micro Controller Unit) including the computing device.

A hardware accelerator for a microcontroller unit (MCU), which can process a complex Kalman filter of a sensor fusion algorithm to improve the accuracy of direction estimation and reduce energy consumption for direction estimation .

In addition, with greater programmability, the performance over Kalman filter processing time can be improved by more than 100%.

When a matrix is stored in a register from a memory, a zero bit register for storing whether the element of the corresponding row is zero is provided. If the element to be operated has a value of 0, the operation for the element is omitted skip), it is possible to reduce the operation execution time and the power consumption required for the operation.

1 is a view showing an example of a sensor hub in the prior art.
2 is a diagram showing an example of the overall structure of a sensor hub MCU having a hardware accelerator in an embodiment of the present invention.
3 is a diagram illustrating an example of a process of transmitting matrix information in an embodiment of the present invention.
4 is a diagram illustrating an example of a process of loading data of matrices in an embodiment of the present invention.
5 is a diagram illustrating an example of a processing procedure of a MAC operation in an embodiment of the present invention.
6 is a diagram illustrating an example of a process of storing calculation results in an embodiment of the present invention.
7 is a diagram showing an example of an element processing structure (PE Architecture) in an embodiment of the present invention.
8 to 13 are diagrams illustrating an example of a process of matrix multiplication in an embodiment of the present invention.

Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.

1. Structure

A. Sensor fusion algorithm analysis

Most sensor fusion mechanisms use a 6- or 9-axis (accelerometer, gyroscope and magnetometer) with a Kalman filter. A Kalman filter is used to accurately predict directions based on data obtained from various sensors. The key operations of SensorFusion based on the Kalman filter can be divided into two parts. The first part predicts the current state based on the predefined state equation, and the second part calibrates the predicted direction using the Kalman gain value. An example of such a sensor fusion algorithm is the open-source Freescale sensor fusion software (see Freescale Sensor Fusion, http://www.freescale.com/) and the DS-5 tool for profiling ARM Development Tools, see http://ds.arm.com/). The profiling results based on the DS-5 identify two major performance bottleneck functions, Calman gain calculation and error covariance matrix calculation, as shown in Table 1 below. The main operations of these functions are matrix multiplication and matrix transpose, and these two operations use about 80% of the total execution time.

Figure pat00001

Table 1 shows an example of execution time for each function.

B. Hardware Accelerator and Sensor Hub Architecture

As we have seen, the main operations of Sensor Fusion are matrix manipulation operations such as matrix multiplication and matrix transposition. Since other sensor-related operations are not usually based on matrix manipulation operations, but because these operations are the main kernels of the target sensor fusion algorithm, the structure of the hardware accelerator may be selected focusing on matrix manipulation operations. Although the matrix manipulation operations are well known and many studies are being performed, the structure of the hardware accelerator for the sensor hub microcontroller unit (MCU) is prudent because sensor fusion processing can use a different number of sensors . The various sizes of the matrix must be manipulated even in the same sensor fusion algorithm. For example, when the sensor fusion algorithm is implemented with 9-axis sensors, multiplication of matrices such as a multiplication between a 12 × 12 matrix and a 12 × 6 matrix and a multiplication between a 6 × 12 matrix and a 12 × 12 matrix must be performed. The architecture of the proposed hardware accelerator can be designed to take advantage of these features of sensor fusion processing. The proposed hardware accelerator may employ an element's broadcasting scheme and adaptive control mechanism in matrix A to handle various sizes of matrices.

2 is a diagram showing an example of the overall structure of a sensor hub MCU having a hardware accelerator in an embodiment of the present invention. The proposed hardware accelerator 200 can store the entire matrix B and two rows of the matrix A to multiply the matrix A by the matrix B.

When the multiplication is started, the DMA (Direct Memory Access) 201 can load one row of the matrix B and one row of the matrix A, and sequentially load the remaining rows of the matrix A as the multiplication proceeds. The PEs (Processing Elements) 202 may perform multiplier-accumulator (MAC) operations and the control unit 203 may provide control signals to the elements for operations for various matrix sizes.

1) The embedded core 220 can transmit matrix information (matrix size, matrix address) to the control unit 203 through a host interface 204. 3 is a diagram illustrating an example of a process of transmitting matrix information in an embodiment of the present invention. The bold arrows in FIG. 3 illustrate the process of transmitting matrix information such as a matrix size and a matrix address for performing operations from the embedded core 220 to the control unit 203 through the host interface 204.

2) The two row data of the matrix A and all the data of the matrix B can be loaded into the matrix A buffer 205 and the matrix B buffer 206 via the DMA 201 in the static random access memory (SRAM) 230 . 4 is a diagram illustrating an example of a process of loading data of matrices in an embodiment of the present invention. The bold arrows in FIG. 4 indicate that two rows of the matrix A are loaded into the matrix A buffer 205 from the same storage as the SRAM 230 via the DMA 201 and the matrix B is loaded into the matrix B buffer 206 . Matrix A - Row 207 may refer to one row of matrix A.

3) PEs 202 may load data from internal registers (matrix A buffer 205 and matrix B buffer 206) and perform a multiplier-accumulator (MAC) operation. 5 is a diagram illustrating an example of a processing procedure of a MAC operation in an embodiment of the present invention. Through the MUXs (multiplexers 208) for matrix A and the MUXs 209 for matrix B, the respective elements for one row of matrix A and the respective elements for one row of matrix B are stored in PEs 202, Lt; / RTI > The MAC operation may mean a cumulative operation of the multiplication and multiplication operations on the first row of matrix A and the entire row of matrix B, respectively. These MAC operations are described in more detail below.

4) When the MAC operation is completed, the result buffer 210 can store the operation result in the SRAM 230 through the DMA 201. [ 6 is a diagram illustrating an example of a process of storing calculation results in an embodiment of the present invention.

7 is a diagram showing an example of an element processing structure (PE Architecture) in an embodiment of the present invention. In the entire sensor hub architecture including the proposed hardware accelerator, one element of matrix A may be broadcast to be an operand for all multipliers (the plurality of "MULs" shown in FIG. 7). The other operands for the multipliers can be each of the elements of the entire rows of the matrix B. [ When all the elements stored in the matrix A-row buffer 207 are broadcast and multiplied and accumulated with all the rows of the matrix B stored in the matrix B buffer 206, accumulators (a plurality of adders "+ &Quot; and a plurality of registers "Reg ") may include a row of multiplication results for one element of one row of matrix A and one each row of matrix B. These results for one row of the object matrix may be sent to the SRAM 230 via the DMA 201. [ This calculation may be repeated for all the rows of matrix A to calculate the matrix multiplication of the entire matrix A and matrix B. [ The signal "is_rawA_end" may be a signal to signal that all elements of one row of matrix A have been computed.

The method of computing a computing device includes: identifying a first operand having a zero value among a plurality of first operands in a computing device and indicating a first operand having a zero value through a zero bit verification buffer; Sequentially broadcasting the plurality of first operands to a plurality of operators included in the arithmetic unit and skipping broadcasting of a first operand determined to have a zero value through the zero bit confirmation buffer; And processing an operation between the plurality of second operands transmitted in correspondence with each of the plurality of operators in each of the plurality of operators and the broadcasted first operand. The computing device for this purpose includes a zero bit checking unit for identifying a first operand having a zero value among a plurality of first operands and indicating a first operand having a zero value through a zero bit checking buffer, 1 operands sequentially to a plurality of operators included in the arithmetic unit and skipping the broadcasting of a first operand determined to have a zero value through the zero bit verification buffer, And a plurality of operators operable to process an operation between the plurality of second operands transmitted in correspondence with each of the operators and the broadcasted first operand. For example, the zero bit checking unit may correspond to the zero bit checking unit 213 described above, and the broadcasting unit may correspond to the matrix A - row 207 and the MUXs 208. In addition, the plurality of operators may correspond to the multipliers included in the PEs 202.

To describe a more specific example, the first matrix may be composed of a row and b columns, and the second matrix may be composed of c rows and d columns. Here, a, b, c, and d may all be natural numbers. For example, the first matrix may correspond to the matrix A described above, and the second matrix may correspond to the matrix B described above.

The computing device may load the nth row of the first matrix into the first matrix buffer and load the second matrix into the second matrix buffer. Here, the first matrix buffer may correspond to the matrix A buffer 205 described above, and the second matrix buffer may correspond to the matrix B buffer 206 described above.

At this time, the computing device can calculate the multiplication between all the elements of the m-th row of the second matrix and the i-th element of the n-th row of the first matrix. For example, a (1, 1), which is the first element of the first row of the first matrix, can be multiplied with each of all the elements of the first row of the second matrix. Also, a (1, 2) can be multiplied with each of all the elements of the second row of the second matrix. In other words, the n is a natural number less than or equal to the a, the i is a natural number equal to or less than the b, and the m can be a natural number equal to or smaller than the c. Here, when the value of the i-th element of the n-th row of the first matrix is 0 (zero), the computing apparatus does not perform broadcasting, thereby performing multiplication and accumulation operations between all the elements of the m-th row of the second matrix Can be omitted. At this time, the computing device may display an element having a value of 0 out of the elements of the nth row of the first matrix through the zero bit confirmation buffer 212, Whether the value of the i-th element of the n-th row of the column is 0 or not.

The arithmetic unit may accumulate and store the multiplication result between the i-th element of the n-th row of the first matrix and the j-th element of the m-th row of the second matrix in the j-th storage of the buffer storing the result matrix. Here, j may be a natural number less than or equal to d. For example, the multiplication result between a (1, 1) and b (1, 1), the first element of the first row of the second matrix, may be stored in the first storage c (1,1) of the result matrix . Also, the multiplication result between a (1, 2) and b (2, 1) can be accumulated and stored in the first storage c (1, 1) of the result matrix.

For computation of the transpose matrix, the computing device may first load a third matrix of e rows and f columns into the first matrix buffer. At this time, the computing device may load the second matrix as a transposed matrix of the third matrix into the second matrix buffer by replacing the mth row of the third matrix loaded in the first matrix buffer with the mth column of the second matrix. In this case, the values of e and d may be the same, and the values of f and c may be the same. This can be used to handle the multiplication operation between the first matrix and the second matrix, which is the transpose matrix of the third matrix.

Also, the third matrix may be the same matrix as the first matrix. In this case, the values of a and e may be the same, and the values of b and f may be the same. This can be used to process a multiplication operation between a first matrix and a second matrix, which is a transpose matrix of the first matrix. For example, if the third matrix is the same as the first matrix, the multiplication operation between the second matrix and the first matrix, which is the transpose matrix of the third matrix, may result in a first matrix (or a third matrix) 3 matrix) of the permutation matrix.

The computing device may be included in a sensor hub MCU (Micro Controller Unit) as a hardware accelerator.

8 to 13 are diagrams illustrating an example of a process of matrix multiplication in an embodiment of the present invention.

FIG. 8 shows that a 12 × 12 result matrix 830 is generated by multiplying a 12 × 6 matrix A (810) and a 6 × 12 matrix B (820). For the multiplication operation, the first element a (1, 1) of the first row of the matrix A 810 corresponds to the elements b (1, 1), ..., b (1, 12). ≪ / RTI >

9 shows the result of the multiplication of the first element a (1,1) of the first row of the matrix A 810 and the first element b (1,1) of the first row of the matrix B 810, (1, 1) in the first row of the first row of the column c (1, 1).

10 shows the case where the first element a (1, 1) of the first row of the matrix A 810 is the element b (1, 1), ..., b (1, 12) of the first row of the matrix B And the result of multiplication of element a (1,1) and element b (1, 12) may be stored in the last element c (1, 12) of the first row of result matrix 830 .

11 shows the case where the second element a (1,2) of the first row of the matrix A 810 is the element b (2, 1), ..., b (2, 12) of the second row of the matrix B 820 ), The multiplication result of the element a (1,2) and the element b (2,1) is multiplied by the first element c (1, 1) of the first row of the result matrix 830 ). ≪ / RTI > Since the value of "a (1, 1) * b (1, 1)" is already stored in element c (1, 1), the result of multiplication of element a (1, 2) and element b the value of the element c (1, 1) is "a (1, 1) * b (1, 1) + a (1, 2) b (2, 1) ".

12 shows the case where the second element a (1, 2) of the first row of the matrix A 810 corresponds to the elements b (2, 1), ..., b (2, 12) of the second row of the matrix B 820 ) And multiplication results of the element a (1,2) and the element b (2,12) are accumulated in the last element c (1, 12) of the first row of the result matrix 830 have. Since the value of "a (1, 1) * b (1, 12)" is already stored in element c (1, 12), the result of multiplication of element a (1, 2) and element b the value of the element c (1, 12) is "a (1, 1) * b (1, 12) + a (1, 2) b (2, 12) ".

FIG. 13 shows the first element c ((820)) of the first row of the result matrix 830 as the multiplication and accumulation of the elements of the first row of the matrix A 810 and the elements of the entire row of the matrix B 1, 1) in which the multiplication results are accumulated. (1, 1), a (1, 2), ..., c (1, 1) of the first row of the matrix A 810 are multiplied by the row and column multiplication between the matrices, a (1, 6)) and the elements of the first column of the matrix B (820) may be accumulated.

8 to 13 are sequentially repeated for the remaining rows of the matrix A 810 and accumulated in the result matrix so that the multiplication result between the matrix A 810 and the matrix B 820 is multiplied by the result matrix 830 .

The proposed hardware accelerator can also perform matrix transpose operation with slight modification of the control. Each row of the target matrix to be transposed may be loaded into a row buffer (matrix A-row 207) for matrix A and each element of the row is passed to the matrix B buffer 206 at the appropriate location for transposition This operation may be repeated for the entire row of the matrix to complete the transpose operation for the target matrix to be transposed. When the transpose operation is complete, all elements of the transposed matrix are stored in the matrix B buffer 206 The hardware accelerator can perform matrix multiplication between the matrix A and the transpose matrix A T of the matrix A stored in the matrix B buffer 206 without reloading the transposed matrix B. In addition, B T may be first obtained and stored in the matrix B buffer 206, and then matrix multiplication may be performed between the matrix A and the transpose matrix B T.

C. Optimization for zero

A hardware accelerator may have special properties to process elements of matrix A that may have a value of zero. Because the hardware accelerator broadcasts one element of the matrix and performs a multiplication with all elements of the matrix B's row, if the value of the element to be broadcasted in matrix A is zero, no multiplication and accumulation operations are required . Thus, referring again to FIG. 2, when the row of matrix A is loaded from the SRAM via DMA, the control unit checks the value of the row of matrix A (via the zero comparator 211 of FIG. 2) . A zero bit register (zero bit check buffer 212 in FIG. 2) may be added for each row buffer of matrix A, and a zero bit verifier 213 may be added to the row buffer It can be used to indicate whether the value of the stored element is 0 or not. The control unit can skip the operations (multiplication operation and accumulation operation) by not broadcasting when the value of the element to be broadcast is 0 by using the zero bit register. Therefore, the amount of computation can be reduced in proportion to the number of zero values included in the matrix, and the power consumption required for the computation can be reduced.

2. Evaluation

The MacSim simulator (see Macsim, see http://code.google.com/p/macsim/) and the pin trace tool (see Sion Berkowits, Tevi Devor, "Pin: Intel's Dynamic Binary Instrumentation Engine," CGO, Feb 2013.) Can be used for performance evaluation. The clock frequency of the MCU without cache memory is assumed to be 100 MHz. The MCU with the proposed hardware accelerator achieves a performance improvement twice as much as the reference structure without the hardware accelerator as shown in Table 2 below. The properties for the computation of the transpose and zero value elements have also achieved additional performance improvements.

Figure pat00002

3. Conclusion

Thus, the sensor hub structure including the specialized hardware accelerator according to the embodiments of the present invention can effectively process the sensor fusion algorithm. The performance improvement results show that the proposed hardware accelerator can achieve a large speed improvement. These hardware accelerators can be extended for application to other sensor data processing applications such as motion detection and context awareness applications.

As described above, according to the embodiments of the present invention, in order to improve the accuracy of direction estimation and reduce energy consumption for direction estimation with respect to a hardware accelerator for a sensor hub MCU (Micro Controller Unit) Filter (complex Kalman filter). In addition, with greater programmability, the performance over Kalman filter processing time can be improved by more than 100%. In addition, when a matrix is stored in a register from a memory, a zero bit register for storing whether the element of the corresponding row is zero is provided. If the element to be operated has a value of 0, By skipping, it is possible to reduce the power consumption required for the operation execution time and the operation.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims (17)

A method for operating a computing device,
Identifying a first operand having a zero value among a plurality of first operands in the computing device and indicating a first operand having a zero value through a zero bit verification buffer;
The computing device sequentially broadcasts the plurality of first operands to a plurality of computing devices included in the computing device, and skips the broadcasting of the first operand determined to have a zero value through the zero bit verification buffer ; And
Processing an operation between a plurality of second operands transmitted in correspondence with each of the plurality of operators in each of the plurality of operators and the broadcasted first operand
And generating a second set of values for the second set of values.
The method according to claim 1,
Wherein the plurality of first operands are elements of an nth row of a first matrix consisting of a rows and b columns,
The plurality of second operands being elements of an m-th row of a second matrix consisting of c rows and d columns,
Wherein a, b, c, and d are natural numbers,
Wherein n is a natural number equal to or smaller than the a,
And m is a natural number equal to or smaller than c.
3. The method of claim 2,
Wherein the broadcasting comprises:
Sequentially broadcasting elements of the nth row of the first matrix,
The method of claim 1,
Wherein each of the plurality of arithmetic operators performs a multiplication operation between an element of an nth row of the first matrix and a corresponding element of all elements of an mth row of the second matrix .
The method of claim 3,
Wherein the zero bit confirmation buffer stores a bit string for displaying elements which are zero values among elements of the nth row of the first matrix.
3. The method of claim 2,
Loading a third matrix consisting of e rows and f columns into a first matrix buffer; And
Storing the m-th row of the third matrix in a second matrix buffer so that the m-th row of the third matrix is substituted into an m-th column of the second matrix, and loading the second matrix as a transposed matrix of the third matrix into a second matrix buffer
Further comprising:
And e and f are natural numbers.
6. The method of claim 5,
Loading the first matrix into the first matrix buffer after the second matrix is loaded into the second matrix buffer as a transposed matrix of the third matrix
≪ / RTI >
The method according to claim 6,
The third matrix is the same matrix as the first matrix,
Wherein the values of a and e are the same, and the values of b and f are the same.
The method according to claim 1,
Accumulating operation results of each of the plurality of operators in a result buffer and storing
Further comprising:
Wherein the result buffer stores a plurality of arrays corresponding to each of the plurality of arithmetic units and stores arithmetic results of the corresponding arithmetic units.
In the computing device,
A zero bit verifier for identifying a first operand having a zero value among a plurality of first operands and indicating a first operand having a zero value through a zero bit verifying buffer;
A broadcasting unit for broadcasting the plurality of first operands sequentially to a plurality of operators included in the arithmetic unit, and for skipping broadcasting of a first operand determined to have a zero value through the zero bit confirmation buffer; ; And
A plurality of second operands transmitted in correspondence with each of the plurality of operators, and the plurality of operators that process an operation between the broadcasted first operands
And an arithmetic operation unit for arithmetically operating the arithmetic operation unit.
10. The method of claim 9,
Wherein the plurality of first operands are elements of an nth row of a first matrix consisting of a rows and b columns,
The plurality of second operands being elements of an m-th row of a second matrix consisting of c rows and d columns,
Wherein a, b, c, and d are natural numbers,
Wherein n is a natural number equal to or smaller than the a,
And m is a natural number equal to or smaller than c.
11. The method of claim 10,
The broadcasting unit includes:
Sequentially broadcasting elements of the nth row of the first matrix,
Wherein each of the plurality of operators includes:
And performs a multiplication operation between one of the elements of the n-th row of the first matrix and the corresponding element of all the elements of the m-th row of the second matrix.
11. The method of claim 10,
Wherein the zero bit check unit generates a bit string for displaying elements which are zero among elements of an nth row of the first matrix and stores the bit string in the zero bit check buffer.
11. The method of claim 10,
A third matrix consisting of e rows and f columns is loaded into a first matrix buffer and the mth row of the third matrix is substituted into an mth column of the second matrix, As a transposed matrix of the first matrix,
Further comprising:
And e and f are natural numbers.
14. The method of claim 13,
Wherein,
And loads the first matrix into the first matrix buffer after the second matrix is loaded into the second matrix buffer as a transposed matrix of the third matrix.
15. The method of claim 14,
The third matrix is the same matrix as the first matrix,
Wherein the values of a and e are the same, and the values of b and f are the same.
10. The method of claim 9,
A result buffer for cumulatively storing operation results of each of the plurality of operators;
Further comprising:
Wherein the result buffer stores a result of operation of a corresponding arithmetic unit including a plurality of arrays corresponding to each of the plurality of arithmetic units.
A sensor hub MCU (Micro Controller Unit) comprising an arithmetic unit according to any one of claims 9 to 16.
KR1020160017819A 2015-10-30 2016-02-16 Calcuating method and apparatus to skip operation with respect to operator having value of zero as operand KR101843243B1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020150152022 2015-10-30
KR20150152022 2015-10-30

Publications (2)

Publication Number Publication Date
KR20170052432A true KR20170052432A (en) 2017-05-12
KR101843243B1 KR101843243B1 (en) 2018-03-29

Family

ID=58740009

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020160017819A KR101843243B1 (en) 2015-10-30 2016-02-16 Calcuating method and apparatus to skip operation with respect to operator having value of zero as operand

Country Status (1)

Country Link
KR (1) KR101843243B1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018228703A1 (en) * 2017-06-16 2018-12-20 Huawei Technologies Co., Ltd. Multiply accumulator array and processor device
WO2019074185A1 (en) 2017-10-12 2019-04-18 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
WO2019172685A1 (en) * 2018-03-07 2019-09-12 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
WO2020190807A1 (en) * 2019-03-15 2020-09-24 Intel Corporation Systolic disaggregation within a matrix accelerator architecture
EP3659073A4 (en) * 2017-10-12 2020-09-30 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
US11361496B2 (en) 2019-03-15 2022-06-14 Intel Corporation Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format
US11934342B2 (en) 2019-03-15 2024-03-19 Intel Corporation Assistance for hardware prefetch in cache access

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4490925B2 (en) 2006-01-16 2010-06-30 株式会社日立製作所 Calculation device, calculation method, and calculation program
GB2436377B (en) 2006-03-23 2011-02-23 Cambridge Display Tech Ltd Data processing hardware

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018228703A1 (en) * 2017-06-16 2018-12-20 Huawei Technologies Co., Ltd. Multiply accumulator array and processor device
WO2019074185A1 (en) 2017-10-12 2019-04-18 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
EP3659073A4 (en) * 2017-10-12 2020-09-30 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
US11113361B2 (en) 2018-03-07 2021-09-07 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
WO2019172685A1 (en) * 2018-03-07 2019-09-12 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
CN111819581A (en) * 2018-03-07 2020-10-23 三星电子株式会社 Electronic device and control method thereof
WO2020190807A1 (en) * 2019-03-15 2020-09-24 Intel Corporation Systolic disaggregation within a matrix accelerator architecture
US11361496B2 (en) 2019-03-15 2022-06-14 Intel Corporation Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format
US11676239B2 (en) 2019-03-15 2023-06-13 Intel Corporation Sparse optimizations for a matrix accelerator architecture
US11709793B2 (en) 2019-03-15 2023-07-25 Intel Corporation Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format
US11842423B2 (en) 2019-03-15 2023-12-12 Intel Corporation Dot product operations on sparse matrix elements
US11899614B2 (en) 2019-03-15 2024-02-13 Intel Corporation Instruction based control of memory attributes
US11934342B2 (en) 2019-03-15 2024-03-19 Intel Corporation Assistance for hardware prefetch in cache access
US11954062B2 (en) 2019-03-15 2024-04-09 Intel Corporation Dynamic memory reconfiguration
US11954063B2 (en) 2019-03-15 2024-04-09 Intel Corporation Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format

Also Published As

Publication number Publication date
KR101843243B1 (en) 2018-03-29

Similar Documents

Publication Publication Date Title
KR101843243B1 (en) Calcuating method and apparatus to skip operation with respect to operator having value of zero as operand
US10817260B1 (en) Reducing dynamic power consumption in arrays
EP3602278B1 (en) Systems, methods, and apparatuses for tile matrix multiplication and accumulation
US11816446B2 (en) Systolic array component combining multiple integer and floating-point data types
US11449745B2 (en) Operation apparatus and method for convolutional neural network
US20180329867A1 (en) Processing device for performing convolution operations
EP3451162B1 (en) Device and method for use in executing matrix multiplication operations
US11467806B2 (en) Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range
EP3394723B1 (en) Instructions and logic for lane-based strided scatter operations
EP3910503A1 (en) Device and method for executing matrix addition/subtraction operation
EP3343391A1 (en) Heterogeneous hardware accelerator architecture for processing sparse matrix data with skewed non-zero distributions
US20170177352A1 (en) Instructions and Logic for Lane-Based Strided Store Operations
EP3394722A1 (en) Instructions and logic for load-indices-and-prefetch-gathers operations
US10338920B2 (en) Instructions and logic for get-multiple-vector-elements operations
WO2017172173A1 (en) Instruction, circuits, and logic for graph analytics acceleration
WO2017112246A1 (en) Instructions and logic for load-indices-and-gather operations
EP3394742A1 (en) Instructions and logic for load-indices-and-scatter operations
US9678749B2 (en) Instruction and logic for shift-sum multiplier
US20170177350A1 (en) Instructions and Logic for Set-Multiple-Vector-Elements Operations
US20170091103A1 (en) Instruction and Logic for Indirect Accesses
US20140244987A1 (en) Precision Exception Signaling for Multiple Data Architecture
US20130013283A1 (en) Distributed multi-pass microarchitecture simulation
US20160092400A1 (en) Instruction and Logic for a Vector Format for Processing Computations
US9910669B2 (en) Instruction and logic for characterization of data access
Douma et al. Fast and precise cache performance estimation for out-of-order execution

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right
GRNT Written decision to grant