WO2021246586A1

WO2021246586A1 - Method for accessing parameter for hardware accelerator from memory, and device using same

Info

Publication number: WO2021246586A1
Application number: PCT/KR2020/015480
Authority: WO
Inventors: 정태영
Original assignee: 오픈엣지테크놀로지 주식회사
Priority date: 2020-06-02
Filing date: 2020-11-06
Publication date: 2021-12-09
Also published as: KR20210149396A; KR102418794B1

Abstract

Disclosed is an operation method of a hardware accelerator, comprising steps in which: a hardware accelerator accesses a memory in which the value of one parameter is stored over a plurality of words, so as to read bits stored in some of the words; and a data calculation unit of the hardware accelerator calculates output data of the data calculation unit on the basis of only the bits stored in some of the words.

Description

Method of accessing parameters for hardware accelerator from memory and apparatus using same

The present invention relates to computing technology, and to a data transfer technology in computer hardware for effectively driving a hardware accelerator, particularly a neural network accelerator.

A neural network is a well-known technology used as one of the technologies to implement artificial intelligence.

1 is a conceptual diagram illustrating a partial configuration of a neural network presented to aid understanding of the present invention.

The neural network 600 according to an embodiment may include a plurality of layers. Conceptually, the first layer 610 among the plurality of layers may output output data 611 called a feature map or activation. In addition, the output data 611 output from the first layer 610 may be provided as input data of the second layer 620 downstream of the first layer 610 .

Each of the above layers may be regarded as a data conversion function module that converts input input data into predetermined output data. For example, the first layer 610 may be regarded as a data conversion function module that converts input data 609 input to the first layer 610 into output data 611 . In order to implement such a data conversion function module, the structure of the first layer 610 must be defined. In accordance with the structure of the first layer 610 , input variables in which the input data 609 input to the first layer 610 are stored should be defined, and the output data 611 output from the first layer 610 should be defined. Output variables representing The first layer 610 may use a set of weights 612 to perform its function. The set of weights 612 may be values multiplied by the input variables to calculate the output variables from the input variables. The set of weights 612 includes various It may be one of the parameters.

Each of the above parameters can be numeric with a given resolution. In a general embodiment, all parameters included in all layers included in a specific neural network may have a predetermined fixed resolution. For example, parameters used by all the

layers

610 , 620 , ... included in the neural network 600 may have a resolution of 8 bits.

The number of parameters used in each layer and the range of specific values may be different for each layer. Also, when the neural network 600 is changed from the first state to the second state by learning, the values of the parameters used by the specific layer in the first state are different from the values of the parameters used by the specific layer in the second state. can be different. In a preferred embodiment, the second state may be a state evolved from the first state. The parameters used by the specific layer may change in value according to the evolution.

The calculation process for calculating the output data 611 output from the first layer 610 from the input data 609 input to the first layer 610 of the neural network 600 may be implemented as software, but as hardware. may be implemented.

FIG. 2 shows a main structure of a neural network computing device including a neural network accelerator in which a function of a neural network is implemented by hardware, and a part of a computing device including the same.

The computing device 1 includes a DRAM (Dynamic Random Access Memory) 10, a neural network operating device 100, a bus 700 connecting the DRAM 10 and the neural network operating device 100, and other hardware 99 connected to the bus 700 .

In addition, the computing device 1 may further include a power supply unit, a communication unit, a main processor, a user interface, a storage unit, and peripheral device units (not shown). The bus 700 may be shared by the neural network computing device 100 and other hardware 99 .

The neural network computing device 100 may include a direct memory access part (DMA) 20 , a controller 40 , an internal memory 30 , and a neural network accelerator 60 .

In order for the neural network accelerator 60 to operate, input data 310 and weights 320 must be provided to the neural network accelerator 60 .

The input data 310 and the weights 320 provided to the neural network accelerator 60 may be output from the internal memory 30 .

The internal memory 30 may receive at least some or all of the input data 310 and the weights 320 from the DRAM 10 through the bus 700 . In this case, in order to move data stored in the DRAM 10 to the internal memory 30 , the DMA unit 20 may control the internal memory 30 and the DRAM 10 .

When the neural network accelerator 60 operates, the output data 330 may be generated based on the input data 310 and the weights 320 . The generated output data 330 may be first stored in the internal memory 30 .

The output data 330 stored in the internal memory 30 may be written to the DRAM 10 under the control of the DMA unit 20 . And/or the output data 330 stored in the internal memory 30 may be regarded as new input data for the neural network accelerator 60 and provided to the neural network accelerator 60 again.

The controller 40 may collectively control the operations of the DMA unit 20 , the internal memory 30 , and the neural network accelerator 60 .

The neural network accelerator 60 performs, for example, the function of the first layer 610 shown in FIG. 1 during the first time period, and performs the function of, for example, the second layer 620 shown in FIG. 2 during the second time period. can

In an embodiment, a plurality of neural network acceleration units 60 shown in FIG. 2 may be provided to perform operations requested by the control unit 40 in parallel.

The above-described DARM 10 is a representative volatile memory, and is a memory capable of randomly accessing data. DRAM includes a plurality of cells arranged in a matrix form. Each cell may consist of one transistor and one capacitor.

In the DRAM, one word line may be connected to a series of first cells arranged in the first direction (horizontal direction), and these first cells may constitute one word. One bit line may be connected to a series of second cells arranged in the second direction (vertical direction) in the DRAM. That is, a plurality of word lines may be disposed along the first direction (horizontal direction), and a plurality of bit lines may be disposed along the second direction (vertical direction).

Looking at the process of reading a DRAM cell, a plurality of bit lines are first pre-charged to an intermediate voltage between HIGH and LOW, and a specific word line is set to HIGH. Depending on whether the voltages of the capacitors of the cells connected to the specific word line are HIGH or LOW, respectively, the voltages of the plurality of precharged bit lines are slightly higher or lower. A sense amplifier provided in the DRAM can detect a small change in the voltage of each bit line and, accordingly, determine whether the value stored in each cell is 1 or 0.

The DRAM reading process may be performed using one word as a basic unit.

While the neural network acceleration unit 60 shown in FIG. 2 performs an operation to generate the output data 330 from the input data 310 and the weights 320 for the first time period, the internal memory 30 It is desirable to obtain new weights from DRAM 10 .

That is, for example, the neural network accelerator 60 may perform the function of the first layer 610 using the input data 609 of FIG. 1 and the first set of weights 612 during the first time period. In addition, the neural network accelerator 60 may perform the function of the second layer 620 using the input data 611 of FIG. 1 and the second set of weights 622 during the second time period. In this case, it is preferable that the internal memory 30 obtains the second set of weights 622 from the DRAM 10 while the neural network accelerator 60 performs the function of the first layer 610 . In addition, the input data 611 necessary to perform the function of the second layer 620 may be provided by the output data 330 generated by the neural network accelerator 60 calculation during the first time period.

At this time, the computation time for which the neural network accelerator 60 performs the function of the first layer 610 is referred to as the computation time TO, and the DRAM 10 sends predetermined data to the internal memory 30 , for example, the second layer 610 . When the transfer time required to transfer the weights 622 of the set is referred to as the movement time (TT), if the transfer time (TT) > operation time (TO), the neural network accelerator 60 has an idle duration ) (T_idle) is issued. In this case, it is not preferable because the neural network accelerator 60 cannot be used to the maximum.

3 is a timing diagram for explaining a problem that occurs when a time required for the neural network accelerator to operate once is longer than a time for preparing input data required for the operation.

For example, it may be assumed that the neural network accelerator 600 performs the function of the first layer 610 of FIG. 1 . In this case, the neural network accelerator 600 may perform an operation for the first operation time TO1 using a set of first weights 612 that are already prepared. And at the same time as the start of the first operation time TO1, the control unit 40 controls a set of second weights 622 necessary for the neural network acceleration unit 600 to perform the function of the second layer 620 of FIG. 1 . ) can be prepared in advance. In this case, it may take a first transfer time TT1 from starting to complete the preparation of the set of second weights 622 . If the first transfer time TT1 is longer than the first operation time TO1, the neural network accelerator 600 waits for the first idle time T_idle1 to perform the function of the second layer 620. You can start arithmetic for

On the other hand, in order to perform an operation for generating output data of a specific layer from input data of a specific layer of the neural network, (1) time to prepare the input data, (2) time to prepare a set of weights to be used in the calculation; and (3) the time it takes to perform the arithmetic process using the prepared input data and the set of weights may be considered.

Here, the transfer time TT for reading and preparing a set of weights to be used for the calculation from the DRAM 10 increases as the amount of data of the set of weights increases. In addition, the data amount of the set of weights may change for each layer of the neural network.

In addition, the calculation time (TO) required to perform the calculation processing using the prepared data may also vary for each layer of the neural network.

However, for example, in a convolutional neural network (CNN), the value of the transport time (TT)/operation time (TO) may increase as the convolutional layer existing further downstream is compared to the convolutional layer existing upstream.

Therefore, for a specific layer of the neural network, as described above, there is a problem in that the transfer time (TT) > operation time (TO) occurs.

The above contents are known by the inventor of the present invention as background knowledge for creating the present invention, and all of the above contents should not be considered as known to an unspecified number of people at the time of filing this patent application.

An object of the present invention is to provide a technique for increasing the utilization of a neural network accelerator composed of hardware by solving the above-described problems.

An object of the present invention is to provide a technique for increasing the utilization of a hardware accelerator.

In the present specification, a hardware accelerator may refer to hardware, particularly hardware optimized for a specific operation of a computing device. A hardware accelerator may refer to a hardware device provided for a specific operation rather than a general purpose. A hardware accelerator may function as part of a computing device. The hardware accelerator may be controlled by a general purpose processor in some cases.

In this specification, a word may mean a unit read from a memory at a time.

In the present specification, activation data or activation may refer to output data output by an arbitrary layer constituting a neural network. The activation may be provided as input data of another layer. Activation may also be referred to as a tensor or a feature map.

In the present specification, generating information about a parameter may include a concept of preparing by storing some or all bits of all bits constituting one parameter as they are.

A method of operating a hardware accelerator according to an aspect of the present invention includes: a hardware accelerator accessing a memory to obtain only some bits of a plurality of bits constituting one parameter; and calculating, by a data operation part of the hardware accelerator, output data of the data operation part based on the obtained partial bits.

In this case, only the partial bits among the plurality of bits constituting the one parameter may be transmitted from the memory to the hardware accelerator through a bus.

In this case, the memory may support a function of transmitting only the partial bits among the plurality of bits through the bus according to the request of the hardware accelerator.

In this case, the one parameter may be stored across a plurality of words of the memory, and some of the bits may be bits stored in some of the plurality of words.

In this case, the calculating may include, by the hardware accelerator, reconstructing, by the hardware accelerator, the value of the one parameter from the obtained values of some bits; and calculating, by the data operation unit, output data of the data operation unit based on the one restored parameter.

In this case, the data operation unit is a neural network accelerating part, and an input-to-output characteristic (transfer function) of the neural network accelerator may correspond to an input/output characteristic of one layer defined in the neural network.

In this case, the one parameter may be one value among input data input to the one layer.

In this case, the input data may be weight data used by the one layer or activation data input to the one layer.

According to another aspect of the present invention, there is provided a method of operating a hardware accelerator, comprising the steps of: obtaining, by the hardware accelerator, information about a plurality of parameters by accessing a memory; and calculating, by the data operation unit of the hardware accelerator, output data of the data operation unit based on the obtained information on the plurality of parameters. In this case, the acquiring may include, for each of the parameters, acquiring only some bits of a plurality of bits constituting each of the parameters; for each of the parameters, generating individual information about the parameter using only the obtained partial bits; and generating information on the plurality of parameters by using the set of individual information generated for the plurality of parameters.

In this case, for each of the parameters, only the partial bits among the plurality of bits constituting each of the parameters may be transmitted from the memory to the hardware accelerator through a bus.

In this case, each of the parameters may be stored across a plurality of words of the memory, and some of the obtained bits may be bits stored in some of the plurality of words.

In this case, the transfer time, which is the time from when the hardware accelerator starts accessing the memory to obtain the information about the plurality of parameters, until the access ends, is, It may be characterized in that it is shorter than the calculation time, which is a time required from starting the calculation process for calculating the output data based on the information on the data to ending the calculation process.

According to another aspect of the present invention, a method of operating a hardware accelerator includes: acquiring, by a neural network accelerator, information about a first parameter required to perform a function of a first layer defined in a neural network from a memory according to a first read mode ; calculating, by the neural network accelerator of the neural network accelerator, the output data of the first layer based on the obtained information on the first parameter; obtaining, by the neural network accelerator, information about a second parameter necessary for performing a function of a second layer defined in the neural network, from the memory according to a second read mode; and calculating, by the neural network accelerator, the output data of the second layer based on the obtained information on the second parameter. In this case, the acquiring of the information on the first parameter from the memory according to the first read mode may include: acquiring only some bits among a plurality of bits constituting the first parameter; and generating information about the first parameter by using only some bits obtained with respect to the first parameter.

In this case, the acquiring of the information about the second parameter from the memory according to the second read mode includes: acquiring only some bits among a plurality of bits constituting the second parameter; and generating information about the second parameter using only some bits obtained with respect to the second parameter, wherein the first read mode includes n1 bits among bits constituting the first parameter. is a mode for acquiring only n2 bits, and the second read mode is a mode for acquiring only n2 bits among bits constituting the second parameter, and n1 and n2 may have different values.

In this case, the maximum value of n1 is the total number of bits constituting the first parameter stored in the memory, or the maximum value of n2 is the total number of bits constituting the second parameter stored in the memory. It can be a number.

In this case, the first parameter is a feature map input to the first layer or a first weight value used in the first layer, and the second parameter is a feature map input to the second layer or a first weight value used in the second layer. It can be two-weighted.

In this case, the acquiring of the information about the second parameter from the memory according to the second read mode includes: acquiring only some bits among a plurality of bits constituting the second parameter; and generating information about the second parameter by using only some bits obtained with respect to the second parameter. In this case, n1, which is the number of some bits among the plurality of bits constituting the obtained first parameter, is smaller than n2, which is the number of some bits among the obtained plurality of bits constituting the second parameter. and the first layer may exist downstream of the second layer.

According to an aspect of the present invention, a hardware accelerator including a data operation unit and a control unit, wherein the control unit is configured to allow the hardware accelerator to access a memory to obtain only some bits of a plurality of bits constituting one parameter The hardware accelerator may be controlled, and the data operation unit of the hardware accelerator may be configured to control the data operation unit to calculate output data of the data operation unit based on the obtained partial bits.

In this case, the controller may be configured to control the memory so that only the partial bits among the plurality of bits constituting the one parameter are transmitted from the memory to the hardware accelerator through a bus.

According to an aspect of the present invention, a computing device including the hardware accelerator and the memory may be provided.

According to the present invention, it is possible to provide a technique for increasing the utilization of a neural network accelerator composed of hardware.

According to the present invention, it is possible to provide a technique for increasing the utilization of a hardware accelerator.

4 shows the main structure of a hardware accelerator for performing a predetermined operation and a part of the computing device 1 including the hardware accelerator.

5 is a flowchart illustrating a method of operating a hardware accelerator provided according to an embodiment of the present invention.

6 illustrates a logical structure of a memory capable of read and write operations in units of words.

7 is a flowchart illustrating a method of operating a hardware accelerator provided according to another embodiment of the present invention.

8 is a flowchart illustrating a method of operating a hardware accelerator provided according to another embodiment of the present invention.

9 is a flowchart illustrating a method of operating a hardware accelerator provided according to another embodiment of the present invention.

10 is a flowchart illustrating a method of operating a hardware accelerator provided according to another embodiment of the present invention.

11 illustrates a method of operating a hardware accelerator provided according to an embodiment of the present invention.

12 illustrates a method of operating a hardware accelerator provided according to another embodiment of the present invention.

13 illustrates a method of operating a hardware accelerator provided according to another embodiment of the present invention.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. However, the present invention is not limited to the embodiments described herein and may be implemented in various other forms. The terminology used herein is for the purpose of helping the understanding of the embodiments, and is not intended to limit the scope of the present invention. Also, singular forms used hereinafter include plural forms unless the phrases clearly indicate the opposite.

Hereinafter, an operating method of a hardware accelerator provided according to an embodiment of the present invention will be described with reference to the drawings.

The computing device 1 includes a memory 11 , a hardware accelerator 110 , a bus 700 connecting the memory 11 and the hardware accelerator 110 , and other hardware connected to the bus 700 . (99) may be included.

In addition, the computing device 1 may further include a power supply unit, a communication unit, a main processor, a user interface, a storage unit, and peripheral device units (not shown). The bus 700 may be shared by the hardware accelerator 110 and other hardware 99 .

The hardware accelerator 110 may include a DMA unit 20 , a control unit 40 , an internal memory 30 , and a data operation unit 610 .

The internal memory 30 may be, for example, a static memory.

In order for the data operation unit 610 to operate, the input data 310 must be provided to the data operation unit 610 .

The input data 310 provided to the data operation unit 610 may be output from the internal memory 30 .

The internal memory 30 may receive at least some or all of the input data 310 from the memory 11 through the bus 700 . In this case, in order to move the data stored in the memory 11 to the internal memory 30 , the DMA unit 20 may control the internal memory 30 and the memory 11 .

When the data operation unit 610 operates, the output data 330 may be generated based on the input data 310 . The generated output data 330 may be first stored in the internal memory 30 .

The output data 330 stored in the internal memory 30 may be written to the memory 11 under the control of the DMA unit 20 . And/or, the output data 330 stored in the internal memory 30 may be regarded as new input data for the data operation unit 610 and provided to the data operation unit 610 again.

The controller 40 may collectively control the operations of the DMA unit 20 , the internal memory 30 , and the data operation unit 610 .

In an embodiment, the data operation unit 610 performs a function of a first function module having a first input/output characteristic during a first time period, and performs a function of a second function module having a second input/output characteristic during a second time period. can be done Here, the format of the input data and the format of the output data of the first function module and the second function module may be the same.

In one embodiment, the data operation unit 610 performs, for example, the function of the first layer 610 shown in FIG. 1 during the first time period, and for example, the second layer 620 shown in FIG. 2 during the second time period. function can be performed.

In an embodiment, a plurality of data operation units 610 shown in FIG. 4 may be provided to perform operations requested by the control unit 40 in parallel.

In one preferred embodiment, the memory 11 of FIG. 4 may be a DRAM. However, the present invention is not limited thereto, and any memory providing a function of distributing and storing the value of one independent parameter in a plurality of words and individually reading each word may be used in the present invention. .

In a preferred embodiment, the components in the hardware accelerator 110 may be arranged on a single wafer.

In a preferred embodiment, the data operation unit 610 may be a neural network accelerating part 60 shown in FIG. 2 . In this case, the hardware accelerator 110 may be referred to as a neural network computing device shown in FIG. 2 or may also be referred to as a neural network accelerator. In this case, in addition to the input data 310 shown in FIG. 4 , weights 320 may be further provided to the data operation unit 610 . That is, in order for the neural network accelerator 60 to operate, the input data 310 and the weights 320 may be provided to the neural network accelerator 60 . In addition, the input data 310 and the weights 320 provided to the neural network accelerator 60 may be output from the internal memory 30 . The internal memory 30 may receive at least some or all of the input data 310 and the weights 320 from the memory 11 through the bus 700 . In this case, in order to move the data stored in the memory 11 to the internal memory 30 , the DMA unit 20 may control the internal memory 30 and the memory 11 . When the neural network accelerator 60 operates, the output data 330 may be generated based on the input data 310 and the weights 320 . The generated output data 330 may first be stored in the internal memory 30 .

In this case, the input-to-output characteristic (transfer function) of the neural network accelerator may correspond to the input/output characteristic of one layer defined in the neural network to be implemented by the neural network accelerator. The input/output characteristic may mean a conversion rule in which input data input to the one layer is converted into output data output by the one layer, and this may be expressed in terms of a transfer function in the present specification. .

The one neural network accelerator may perform functions of different layers of the neural network according to a driving time thereof. Each layer of the neural network may evolve over time according to a predetermined learning rule for the neural network. The specific values of the parameters used by the evolved layer may be different from the specific values of the parameters used by the layer before being evolved. In one embodiment, the values of the parameters before the evolution may be already stored in the memory 11, and the values of the parameters updated according to the progress of the evolution are additionally stored in the memory 11, The previous value may be stored in an updated form.

In step S100 , the hardware accelerator 110 accesses the memory 11 in which the value of one parameter is stored over a plurality of words, and is stored in some of the plurality of words. Only bits can be read.

Hereinafter, the relationship between the value of the one parameter, the plurality of words, and the bits stored in the part will be described with reference to FIG. 6 .

Cells in which information constituting a plurality of words may be stored may be provided in the memory 11 . Each word may consist of, for example, n bits. It can be assumed that a parameter to be input to the data operation unit 610 provided according to an embodiment of the present invention can be expressed by a number of bits greater than the n bits. That is, the first parameter of FIG. 6 cannot be expressed by n bits, but may be quantized to a first resolution that can be expressed using a greater number of bits. That is, the first parameter cannot be expressed using one word, but may be expressed using at least two words.

At this time, in order for the DMA unit 20 to obtain the first parameter having the first resolution, both the first word and the second word must be read. The read first word and the second word may sequentially move to the hardware accelerator 110 through the bus 700 .

In an embodiment of the present invention, the DMA unit 20 receives the MSB (Most Significant Bit) of the first parameter in the first word and the second word provided to express the first parameter at the first resolution. Only the containing word, for example, the first word can be read. In this way, since only the first word moves to the hardware accelerator 110 through the bus 700, the data movement time is reduced compared to sequentially moving both the first word and the second word.

Returning to FIG. 5 again, description will be made.

In step S200 , the data operation unit 610 of the hardware accelerator 110 may calculate the output data of the data operation unit 610 based on only the bits stored in the part.

In step S10, the hardware accelerator 110 accesses the memory 11 in which the value of one parameter is stored over a plurality of words, and is stored in some of the plurality of words. Only bits can be read.

In step S20 , the hardware accelerator 110 may reconstruct the value of the one parameter from the values of the read bits.

The read bits may mean only bits constituting the first word among the first word and the second word.

And, for example, assuming that each word is composed of 4 bits, the first parameter having the first resolution may be expressed as 8 bits. However, since the bits stored in some of the words (eg, the first word) are 4 bits in total, the value represented by these 4 bits cannot be called the first parameter. In the present invention, a total of four bits may be added to the lower side of the LSB of the bits constituting the first word, and the first parameter may be restored using a strategy to allow an error. Each of the four added bits may have a value of 0.

If the first parameter stored across the first and second words of the memory 11 has a first resolution of 2^8, the restored first parameter has a second resolution of 2^4 point can be understood (if n=4). This is because the lower four bits of the restored first parameter are substantially fixed dummy values. That is, the second resolution is lower than the first resolution.

The restored first parameter may not have exactly the same value as the first parameter stored in the memory, and may have a value close to the first parameter stored in the memory.

By performing the above-described steps S10 and S20 , the resolution of the first parameter to be provided to the data operation unit 610 is reduced, but the time to read and prepare the first parameter from the memory 11 is reduced It is understandable that there are advantages to doing so.

In step S30 , the data calculating unit 610 of the hardware accelerator 110 may calculate the output data of the data calculating unit 610 based on the one restored parameter.

The step S20 may be performed in any one of the internal memory 30 , the DMA unit 20 , and the data operation unit 610 according to a command of the control unit 40 .

The method of operating the hardware accelerator is a method of operating the hardware accelerator by accessing a memory in which the value of one parameter is stored over a plurality of words.

In step S210 , the hardware accelerator 110 may access the memory 11 to obtain relationship information on a plurality of parameters.

The step (S210) may include, for each of the parameters, reading only bits stored in some of the plurality of words in which the value of the parameter is stored (S211); and for each of the parameters, generating information about the parameters using only the read bits ( S212 ).

In step S220 , the data calculating unit 610 of the hardware accelerator 110 may calculate the output data of the data calculating unit 610 based on the obtained information on the plurality of parameters.

In step S110 , the hardware accelerator 110 may access the memory 11 to obtain a plurality of parameters.

In the arbitrary one of the acquired plurality of parameters 510 obtained in the step S110 , the hardware accelerator 110 stores the value of the arbitrary one parameter 510 . Reading only the bits stored in some 512 of the plurality of words 511 and 512 in operation (S111); And the hardware accelerator 110 may be obtained by performing the step (S112) of restoring the value of the one parameter from the read bits.

In step S120 , the data operation unit 610 of the hardware accelerator 110 may calculate the output data of the data operation unit 610 based on the plurality of acquired parameters.

At this time, the transfer time, which is the time from when the hardware accelerator 110 starts accessing the memory 11 to the end of the access to obtain the plurality of parameters, is referred to as TT, and the data operation unit 610 , if the calculation time, which is the time required from starting the calculation process for calculating the output data based on the obtained plurality of parameters to ending the calculation process, is TO, the transfer time TT is the calculation time TO could be shorter than

This operation method is a method of operating a neural network accelerator by accessing a memory in which the value of each parameter is stored over a plurality of words.

In step S310 , the neural network accelerator may acquire information about a first parameter required to perform the function of the first layer defined in the neural network from the memory according to the first read mode.

In the step S310, reading only bits stored in some of the plurality of words in which the value of the first parameter is stored (S311), and for the first parameter, the first parameter using only the read bits It may include generating information about the parameter (S312).

In step S320, the neural network accelerator of the neural network accelerator may calculate the output data of the first layer based on the obtained information on the first parameter.

In step S330, the neural network accelerator may acquire a second parameter necessary for performing the function of the second layer defined in the neural network from the memory according to the second read mode.

In step S340 , the neural network accelerator may calculate the output data of the second layer based on the obtained second parameter.

In this case, the second read mode may be a mode in which the neural network accelerator accesses the memory and reads all the bits stored in a plurality of words in which the value of the second parameter is stored.

In this case, the step of calculating the output data of the first layer (S320) includes the step of restoring, by the neural network accelerator, the value of the first parameter from information about the parameter generated with respect to the first parameter (S321) , and calculating, by the neural network accelerator, output data of the neural network accelerator on the basis of the restored first parameter (S322).

In this case, the first parameter is the input data of the first layer or a first weight used in the first layer, and the second parameter is the input data of the second layer or a second weight used in the second layer, In addition, the first layer may be a layer existing downstream of the second layer. The neural network may be, for example, a Convolutional Neural Network (CNN).

In step S400 , a hardware accelerator may access a memory to acquire only some bits of a plurality of bits constituting one parameter.

In step S410 , the data operation part of the hardware accelerator may calculate output data of the data operation part based on the obtained partial bits.

In step S500 , the hardware accelerator may access the memory to obtain information about a plurality of parameters.

Step S500 may include the following steps S501, S502, and S503.

In step S501, for each of the parameters, only some bits among a plurality of bits constituting each of the parameters may be obtained.

In step S502 , for each of the parameters, individual information about the parameters may be generated using only some of the obtained bits.

In step S503, information about the plurality of parameters may be generated by using the set of individual information generated for the plurality of parameters.

Next, in step S510 , the data calculating unit of the hardware accelerator may calculate output data of the data calculating unit based on the obtained information on the plurality of parameters.

In step S610, the neural network accelerator may acquire information about a first parameter required to perform a function of the first layer defined in the neural network from the memory according to the first read mode.

Step S610 may include the following steps S611 and S612.

In step S611, only some bits (N1 pieces) among a plurality of bits constituting the first parameter may be obtained.

In step S612, information about the first parameter may be generated using only some bits obtained with respect to the first parameter.

In step S620, the neural network accelerator of the neural network accelerator may calculate the output data of the first layer based on the obtained information on the first parameter.

In step S630 , the neural network accelerator may acquire information about a second parameter required to perform a function of the second layer defined in the neural network from the memory according to the second read mode.

Step S630 may include the following steps S631 and S632.

In step S631, only some bits (N2 pieces) of a plurality of bits constituting the second parameter may be obtained.

In step S632, information about the second parameter may be generated using only some bits obtained for the second parameter.

In step S640, the neural network accelerator may calculate the output data of the second layer based on the obtained information on the second parameter.

Here, N1 may be greater than N2, and in this case, the first layer may be a layer downstream of the second layer.

By using the above-described embodiments of the present invention, those skilled in the art will be able to easily implement various changes and modifications within the scope without departing from the essential characteristics of the present invention. The content of each claim in the claims may be combined with other claims without reference within the scope that can be understood through this specification.

<Sasa-Acknowledgment>

The present invention is a variable among the next-generation intelligent semiconductor technology development (design)-artificial intelligence processor business, a research project supported by the Ministry of Science and ICT and the Information and Communication Planning and Evaluation Institute affiliated with the National Research Foundation of Open Edge Technology Co., Ltd. (the task execution organization) It was developed in the course of carrying out the research project for precision high-speed-multi-object recognition deep learning processor technology development (task unique number 2020-0-01080, task number 2020-0-01080, research period 2020.04.01 ~ 2024.12.31).

Claims

A hardware accelerator comprising a data operation unit and a control unit, comprising:

The control unit is

and control the hardware accelerator so that the hardware accelerator accesses a memory and obtains only some bits of a plurality of bits constituting one parameter, and

The data operation unit of the hardware accelerator is configured to control the data operation unit to perform the step of calculating output data of the data operation unit based on the obtained partial bits,

hardware accelerator.
The hardware according to claim 1, wherein the control unit controls the memory so that only the partial bits of the plurality of bits constituting the one parameter are transmitted from the memory to the hardware accelerator through a bus. accelerator.
According to claim 1,

The calculating step is

reconstructing, by the hardware accelerator, the value of the one parameter from the obtained values of some bits; and

calculating, by the data calculating unit, output data of the data calculating unit based on the one restored parameter;

characterized in that it comprises,

hardware accelerator.
According to claim 1, wherein the data operation unit is a neural network accelerating part (neural network accelerating part), the input-to-output characteristic (transfer function) of the neural network acceleration part is the input-output characteristic of one layer defined in the neural network. Corresponding, hardware accelerator.
5. The method of claim 4,

The one parameter is a value of one of input data input to the one layer,

The input data is weight data used by the one layer or activation data input to the one layer,

hardware accelerator.
According to claim 1,

The obtaining is a step in which the hardware accelerator accesses the memory to obtain information about a plurality of parameters including the one parameter,

The calculating is a step of calculating, by the data calculating unit of the hardware accelerator, output data of the data calculating unit based on the obtained information on the plurality of parameters,

The obtaining step is

obtaining, for each said parameter, only some bits of a plurality of bits constituting each said parameter;

for each of the parameters, generating individual information about the parameter using only the obtained partial bits; and

generating information about the plurality of parameters by using the set of individual information generated with respect to the plurality of parameters;

containing,

hardware accelerator.
The transfer time according to claim 6, wherein the transfer time, which is a time from when the hardware accelerator starts accessing the memory to obtain the information about the plurality of parameters, until the access ends, is determined by the data operation unit. The hardware accelerator, characterized in that it is shorter than a calculation time, which is a time required from starting a calculation process for calculating the output data based on information about a plurality of parameters to ending the calculation process.
acquiring, by the neural network accelerator, information about a first parameter necessary to perform a function of a first layer defined in the neural network, from a memory according to a first read mode;

calculating, by the neural network accelerator of the neural network accelerator, the output data of the first layer based on the obtained information on the first parameter;

obtaining, by the neural network accelerator, information about a second parameter necessary for performing a function of a second layer defined in the neural network, from the memory according to a second read mode; and

calculating, by the neural network acceleration unit, output data of the second layer based on the obtained information on the second parameter;

includes,

The acquiring of the information about the first parameter from the memory according to the first read mode may include: acquiring only some bits among a plurality of bits constituting the first parameter; and generating information about the first parameter using only some bits obtained for the first parameter.

How hardware accelerators work.
9. The method of claim 8,

The acquiring of the information about the second parameter from the memory according to the second read mode may include: acquiring only some bits among a plurality of bits constituting the second parameter; and generating information about the second parameter using only some bits obtained for the second parameter,

The first read mode is a mode for acquiring only n1 bits among bits constituting the first parameter,

The second read mode is a mode for acquiring only n2 bits among bits constituting the second parameter,

n1 and n2 are different values,

How hardware accelerators work.
10. The method of claim 9,

The maximum value of n1 is the total number of bits constituting the first parameter stored in the memory, or

The maximum value of n2 is the total number of bits constituting the second parameter stored in the memory,

How hardware accelerators work.
9. The method of claim 8,

The first parameter is a feature map input to the first layer or a first weight value used in the first layer,

wherein the second parameter is a feature map input to the second layer or a second weight value used in the second layer;

How hardware accelerators work.
9. The method of claim 8,

The acquiring of the information about the second parameter from the memory according to the second read mode may include: acquiring only some bits among a plurality of bits constituting the second parameter; and generating information about the second parameter using only some bits obtained for the second parameter,

n1, which is the number of some bits of the obtained plurality of bits constituting the first parameter, is smaller than n2, which is the number of some bits among the obtained plurality of bits constituting the second parameter,

The first layer is present downstream of the second layer,

How hardware accelerators work.
A computing device comprising the hardware accelerator of claim 1 and the memory.