WO2023013649A1

WO2023013649A1 - Data cache device and program

Info

Publication number: WO2023013649A1
Application number: PCT/JP2022/029689
Authority: WO
Inventors: 崇吉田
Original assignee: 株式会社エヌエスアイテクス; 株式会社デンソー
Priority date: 2021-08-06
Filing date: 2022-08-02
Publication date: 2023-02-09
Also published as: JPWO2023013649A1

Abstract

A data cache device for temporarily holding digital data comprising: a data holding unit (101) that holds data; a data input interface unit (102) for inputting data from outside of the data cache device; a data output interface unit (103) for outputting data to outside of the data cache device; a selector unit (104) that selects one or a plurality of paths simultaneously among a path for transferring the data from the data input interface unit to the data holding unit, a path for transferring the data from the data input interface unit to the data output interface unit, and a path for transferring the data from the data holding unit to the data output interface unit; a selector control unit (105) that controls the selector unit; and a read prediction table unit (107) that holds the number of times of planned reading of the data which is read during processing. The selector unit is controlled in accordance with the number of times of reading that is set in the read prediction table unit.

Description

Data cache device and program

Cross-reference to related applications

This application is based on Japanese Application No. 2021-129531 filed on August 6, 2021, and claims the benefit of its priority, and the entire contents of that patent application are incorporated by reference. incorporated herein by.

The present disclosure relates to a data cache device that temporarily holds digital data.

For example, in an arithmetic unit that executes neural network processing, inference is performed by continuously executing convolutional operations. That is, the convolution operation is performed on the original input data by a predetermined algorithm, and the convolution operation is further performed on the output data by the following algorithm.

Here, the process of performing a convolution operation on one piece of input data and obtaining an output is called a layer. In a neural network, an algorithm for convolution operation is defined for each layer, and a network is constructed by combining these layers.

　In order to perform the convolution operation, the weight value is multiplied with the input data, and the sum of products operation is performed by adding the result. This weight value is defined by the kernel size by the convolution algorithm. will be reconciled. Since this convolution process is required according to the elements of the input data, the convolution operation of the neural network requires reading a large amount of weight data.

JP 2018-88256 A

If this weight is read from an external reading device at the required timing, a huge number of read accesses will occur, resulting in an increase in read power. Also, in many cases, the read time from the external reader can become the bottleneck of the overall process. For example, Patent Document 1 is known as a semiconductor device capable of reducing power consumption. In Japanese Patent Application Laid-Open No. 2002-200000, the power supply to a set with a low access frequency among the sets in the cache memory is cut off.

A neural network processing device may be equipped with a cache device. A typical cache device stores data once read from the outside in an internal cache device based on temporal proximity or positional proximity. When the same data is read again, the data is read from the cache device without accessing to the outside. This cache device can be expected to have the effect of hiding the time delay caused by reading data from the outside. Also, it is possible to reduce the power consumption due to the difference in implementation between the external storage device and the cache device.

However, in neural network processing, multiple layers may be processed together. In other words, input data may be cut out in processing units, and multiple layers of processing may be continuously performed on the cut out input data. As a result, the output data of a layer can be immediately reused as the next input data, and unnecessary memory accesses and delays can be reduced.

As a result of the inventor's detailed study, when performing such processing, it is necessary to frequently switch the weight of each layer and perform complicated readout. is not always guaranteed, resulting in an increase in cache misses.

Therefore, an object of the present disclosure is to provide a data cache device that is effective in reducing power consumption.

The present disclosure employs the following technical means to solve the above problems. The scope of claims is an example showing the corresponding relationship with specific means described in the embodiment described later as one aspect, and does not limit the technical scope of the present invention.

A data cache device of the present disclosure is a data cache device that temporarily holds digital data, and includes a data holding section that holds data and a data input interface section for inputting data from the outside of the data cache device. a data output interface unit for outputting data to the outside of said data cache device; a path for transferring data from said data input interface unit to said data holding unit; a selector section for simultaneously selecting one or more of a path for transferring data from the data holding section to the data output interface section; a selector control section for controlling the selector section; and a readout prediction table section for holding a scheduled number of readouts at which the data is read out during processing, and controls the selector section according to the number of readouts set in the readout prediction table section.

A program of the present disclosure includes a data holding unit that holds data, a data input interface unit that inputs data from outside the data cache device, and a data output interface that outputs data to the outside of the data cache device. a path for transferring data from the data input interface section to the data holding section; a path for transferring data from the data input interface section to the data output interface section; and a path for transferring data from the data holding section. A program for controlling a data cache device comprising: a selector section for simultaneously selecting one or more of the paths for transferring to the data output interface section; The program functions as readout prediction table means for holding the number of readouts and selector control means for controlling the selector unit according to the number of readouts set in the readout prediction table means.

A program according to another aspect of the present disclosure is a program that creates a table that sets whether to read data held in a data cache device or to read data from an external storage device as data necessary for executing an operation by an arithmetic device. a read count means for counting the number of times each data is read during execution of a calculation by performing an arithmetic simulation of the arithmetic unit by means of an arithmetic simulator; This is a program for functioning as table creation means for creating a table.

The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description with reference to the accompanying drawings. The drawing is
FIG. 1 is a diagram showing the configuration of a data cache device according to the first embodiment; FIG. 2 is a diagram showing the overall configuration of an arithmetic device including a data cache device according to the first embodiment; FIG. 3 is a flowchart showing the operation of the data cache device of the first embodiment; FIG. 4 is a diagram showing the configuration of a data cache device according to the second embodiment; FIG. 5 is a diagram showing the configuration of a data cache device according to the third embodiment; FIG. 6 is a flowchart showing the operation of the data cache device of the third embodiment; FIG. 7 is a diagram showing the configuration of a data cache device according to the fourth embodiment.

A data cache device according to an embodiment of the present disclosure will be described below with reference to the drawings.
(First embodiment)
FIG. 1 shows the configuration of the data cache device 1 of the first embodiment, and FIG. 2 shows the overall configuration including the data cache device 1, the external storage device 2, the arithmetic device 3, and the arithmetic algorithm control unit 4. is. As shown in FIG. 2, the data cache device 1 is connected to an external storage device 2 , an arithmetic device 3 and an arithmetic algorithm control section 4 . The data cache device 1 inputs data from the external storage device 2 , holds part of the data, and outputs the data to the arithmetic device 3 . The arithmetic algorithm control unit 4 manages layer parameters and algorithms of the neural network, and controls the data cache device 1 and the arithmetic device 3 in conjunction.

Next, the internal configuration of the data cache device 1 of this embodiment will be described. 101 is a data holding unit. This data holding unit 101 is a cache memory that temporarily holds the Weight data of the neural network in this embodiment. The data holding unit 101 is implemented with an SRAM or flip-flop circuit. Note that the data holding unit 101 is also called a "local buffer".

102 is a data input interface unit 102 . Data input interface unit 102 is connected to external storage device 2 and receives data from external storage device 2 . As for the connection between the data input interface section 102 and the external storage device 2, a bus interface or the like may be provided in the middle, so long as the data can be logically acquired. The data input interface section 102 includes a function as a bus master, and acquires data from the external storage device 2 in response to a request transmitted from the arithmetic device 3 via the selector section 104 .

103 is a data output interface. The data output interface section 103 is connected to the arithmetic device 3 .

　104 is a selector unit. The selector unit 104 is a circuit that selects the destination of input data from the data input interface unit 102 and the source of request data from the data output interface unit 103 .

105 is a selector control unit. The selector control unit 105 is a part that controls the selector unit 104, and controls connection of the selector unit 104 in units of instruction execution cycles. Selector control section 105 includes a path for transferring data from data input interface section 102 to data holding section 101, a path for transferring data from data input interface section 102 to data output interface section 103, and a path for transferring data from data holding section 101 to data output interface section 103. , one or a plurality of paths are selected at the same time from the paths for transferring the data of .

106 is a read counter unit. The read counter unit 106 has a function of counting the number of times the data output interface unit 103 is accessed. It internally holds a plurality of counters, and independently counts readout for each address requested by the arithmetic unit 3 via the data output interface unit 103 .

　107 is a read prediction table section. The reading prediction table unit 107 is implemented by an SRAM or a flip-flop circuit, and stores the number of times the weight value in each layer is read. The predicted number of times may be written as an initial value when the system is started, and then updated at any timing. Also, depending on the implementation policy, only the magnitude relationship with the previously set read number threshold may be retained.

　108 is a read prediction table control unit. The readout prediction table control unit 108 acquires the algorithm information specified by the arithmetic algorithm control unit 4, reads out the weight data readout prediction number for the corresponding layer from the readout prediction table unit 107, and sends a selector selection signal to the selector control unit 105. to be issued. In addition, the read prediction table control unit 108 has a function of acquiring read counter information from the read counter unit 106 and updating the values of the read prediction table unit 107 based on instructions from the arithmetic algorithm control unit 4 .

FIG. 3 is a flowchart showing the operation of the data cache device 1. FIG. In the example described below, the computation device 3 will be described as an example of computation for moving image processing using a neural network. The computing device 3 performs image processing for each frame that constitutes a moving image. More specifically, a frame image or feature map to be processed is divided into a plurality of processing unit regions, and the regions are called tiles, and inference processing is performed by a neural network for each tile. The inference process performed for each frame is the same.

In step S001 , the reading prediction table control unit 108 receives a list of algorithms to be processed from the arithmetic algorithm control unit 4 . For example, when processing multiple layers of a neural network collectively, the information includes the number of layers, the number of kernels in each layer, the number of input data channels, the number of output data channels, and the input activation size. Based on the received information, the readout prediction table control unit 108 calculates the required weight retention amount and the number of tiles for which the calculation is to be repeated. Alternatively, the calculation algorithm control unit 4 itself may calculate the weight retention amount and receive the result.

The read prediction table control unit 108 determines whether or not the required weight holding amount is equal to or less than the capacity of the data holding unit. If the amount of weight required for multi-layer processing is equal to or less than the capacity of the data holding unit 101 ("Yes" in S001), the reading prediction table control unit 108 continues the process of step S002.

In step S002, the reading prediction table control unit 108 receives information on the number of tiles when processing multiple layers from the arithmetic algorithm control unit 4, and determines whether the number of tiles is two or more.

At this time, when the number of tiles to be processed is 1 ("No" in S002), there is no power or performance advantage in transferring data to the data holding unit 101 and temporarily holding it, so the process proceeds to step S003. , the read prediction table control unit 108 updates the read prediction table unit 107 so that the data is directly transferred from the external storage device 2 without going through the data holding unit 101 . If the number of tiles is 2 or more ("Yes" in S002), proceed to step S004, and update the readout prediction table unit 107 with a flag indicating that the entire region of the prediction table is to be transferred to the data holding unit 101. do. Note that the order of determination in steps S001 and S002 described above may be reversed.

Subsequently, in step S012, data is transferred from the external storage device 2 according to the reading prediction table section 107. Specifically, when the number of tiles to be processed is 1, the data is directly transferred without going through the data holding unit 101, and when the number of tiles to be processed is 2 or more, the data is transferred from the external storage device 2 to the data holding unit 101. .

On the other hand, if the amount of Weight data is equal to or greater than the capacity of the data storage unit 101 in step S001 ("No" in S001), the process proceeds to step S005, and the data in the data storage unit 101 is stored in descending order of the number of tiles processed. The reading prediction table unit 107 is updated so that the weight data for the layers up to the capacity is transferred to the data holding unit 101 .

Step S006 is a step of transferring the Weight data to be read from the data holding unit 101 from the external storage device 2 to the data holding unit 101. The readout prediction table control unit 108 sends a readout request to the data input interface unit 102 for weight data that needs to be read out from the data holding unit 101 . Alternatively, the request may be notified via the arithmetic algorithm control unit 4 .

The data input interface unit 102 transfers Weight data to be newly read from the data holding unit 101 from the external storage device 2 to the data holding unit 101 . At this time, the data input interface unit 102 functions as a bus master, generates the address of the external storage device 2 from the weight position information held in the read prediction table, and issues a data transfer request. Alternatively, by providing a DMA controller or the like between the external storage device 2 and the data input interface unit 102, equivalent functions may be replaced.

Step S007 and subsequent steps are steps after the start of the processing of the first tile. Step S007 is the timing at which the arithmetic algorithm control unit 4 sends a processing start signal to the arithmetic unit 3, and the arithmetic unit 3 sends a request to acquire the necessary weight data to the data output interface unit 103 together with the address. Occur. The data output interface unit 103 receives a weight data transfer request from the arithmetic unit 3 together with an address. At this time, data output interface section 103 transmits the requested address information to selector control section 105 .

The selector control unit 105 refers to the read prediction table unit 107 to determine whether the requested data exists in the external storage device 2 or is held in the data holding unit 101, controls the selector unit 104, A path to the storage area is connected so that the data output interface unit 103 can access the data. The selector control unit 105 notifies the data output interface unit 103 of map information indicating whether the data exists in the external storage device 2 or in the data holding unit 101 as a path establishment signal.

The data output interface unit 103 increments the requested transfer data amount of the read counter unit 106 for each information indicating which layer's weight data the data requested by itself belongs to and the map information returned from the selector control unit 105 . Based on the map information, the read counter unit 106 counts and holds the number of times the weight data is read from the data holding unit and the number of times the weight data is directly read from the data input interface unit.

In step S008, the arithmetic unit 3 continues to acquire Weight data via the data output interface unit 103 until all tiles are completed. During this time, data output interface section 103 continues step S007. If the processing of all tiles is completed in step S008 ("Yes" in S008), the process proceeds to step S009. In the processing up to this point, the read counter unit 106 counts the number of times the weight data is read in each layer when one frame is processed.

In step S009 , the readout prediction table control unit 108 reads count data from the readout counter unit 106 for each layer, and compares it with a threshold specified in advance by the arithmetic algorithm control unit 4 . The threshold is calculated from the amount of power consumed for reading and transferring data. That is, the amount of power required to transfer data from the external storage device 2 without going through the data holding unit 101 and the amount of power required to transfer data from the external storage device 2 to the data holding unit 101 and read data from the data holding unit 101 is determined based on the amount of power required for

When the number of readings is equal to or greater than the threshold, reading the weight data from the data holding unit 101 consumes less power than obtaining the data directly from the external storage device 2 . The reading prediction table control unit 108 compares the weight data reading count with the threshold value, and sets a flag for the weight data of the layer whose reading count is equal to or greater than the threshold value. In step S010 , the flag information set in step S009 is read out and written in the readout prediction table section 107 .

Note that a flag may be set in the read prediction table unit 107 to read from the data holding unit 101 in descending order of the read count, that is, in descending order of the read amount, without performing threshold evaluation in step S009.

Step S011 is a step of transferring data to the data holding unit 101 based on the information in the reading prediction table unit 107. Readout prediction table control unit 108 sends a readout request to data input interface unit 102 for weight data that needs to be read out from data holding unit 101 anew. Alternatively, the request may be notified via the arithmetic algorithm control unit 4 . The data input interface unit 102 overwrites the weight data to be newly read from the data holding unit 101 in the area of the weight data newly read directly from the external storage device 2 . Alternatively, step S006 may be executed again based on the newly updated contents of readout prediction table section 107 .

In step S013, the remaining frames are processed using the readout prediction table section 107 and the data holding section 101 prepared by the above process.

In the present embodiment, the weight data is described as an example of the data held in the data holding unit 101, but the present disclosure is not applied only to the weight data, and other data such as activation data and other It goes without saying that it may also be applied to weighted data.

The configuration of the data cache device 1 of this embodiment has been described above. is stored in the storage means, and the CPU executes the program, thereby realizing the data cache device 1 that performs the above-described control. Such programs are also included within the scope of this disclosure.

The data cache device 1 of the present embodiment sets a flag in the weight data of the layer whose read count is equal to or greater than the threshold value and writes it to the read prediction table section 107 when performing the first frame processing. Since the weight data is transferred to the data holding unit 101 based on the reading prediction table unit 107, the weight data stored in the data holding unit 101 can be used when performing the remaining frame processing, thereby reducing power consumption. It is possible to improve the arithmetic performance while suppressing the In addition, when performing the remaining frame processing, power for evaluation may be reduced by stopping the counting and comparison processing of the present embodiment.

(Second embodiment)
FIG. 4 is a diagram showing the configuration of the data cache device 5 according to the second embodiment. The basic configuration of the data cache device 5 of the second embodiment is the same as that of the data cache device 1 of the first embodiment, but the data cache device 5 of the second embodiment is compressed. The difference is that weight data is handled.

The data cache device 5 is connected to the external storage device 2, arithmetic device 3, and arithmetic algorithm control section 4 described in FIG. The data cache device 5 inputs data from the external storage device 2 , holds a part of the data, and outputs the data to the arithmetic device 3 . The arithmetic algorithm control unit 4 manages layer parameters and algorithms of the neural network, and controls the data cache device 5 and the arithmetic device 3 in conjunction.

Next, the internal configuration of the data cache device 5 of the second embodiment will be explained.
501 is a data holding unit. This data holding unit 501 is a cache memory that temporarily holds Weight data. The data holding unit 501 is implemented with an SRAM or flip-flop circuit.

　502 is a data input interface unit. Data input interface unit 502 is connected to external storage device 2 and receives data from external storage device 2 . As for the connection between the data input interface section 502 and the external storage device 2, a bus interface or the like may be provided in the middle as long as the data can be logically acquired. The data input interface section 502 includes a function as a bus master, and acquires data from the external storage device 2 in response to a request transmitted from the arithmetic device 3 via the selector section 504 .

　503 is a data output interface unit. The data output interface section 503 is connected to the arithmetic device 3 .

　504 is a selector part. A selector unit 504 is a circuit that selects the destination of input data from the data input interface unit 502 and the source of request data from the data output interface unit 503 .

505 is a selector control unit. A selector control unit 505 is a portion that controls the selector unit 504, and controls connection of the selector unit 504 in units of instruction execution cycles.

　506 is a read counter unit. A read counter unit 506 has a function of counting the number of accesses to the data output interface unit 503 . It internally holds a plurality of counters, and independently counts readout for each address requested by the arithmetic unit 3 via the data output interface unit 503 .

　507 is a read prediction table section. The reading prediction table section 507 is implemented by an SRAM or a flip-flop circuit, and stores the number of times weight data is read for each layer. The predicted number of times may be written as an initial value when the system is started, and then updated at any timing. Also, depending on the implementation policy, only the magnitude relationship with the previously set read number threshold may be retained.

　508 is a read prediction table control unit. The readout prediction table control unit 508 acquires the algorithm information specified by the arithmetic algorithm control unit 4, reads out the weight data readout prediction number for the corresponding layer from the readout prediction table unit 507, and sends a selector selection signal to the selector control unit 505. to be issued. In addition, the read prediction table control unit 508 has a function of acquiring read counter information from the read counter unit 506 and updating the values of the read prediction table unit 507 based on instructions from the arithmetic algorithm control unit 4 .

　509 is a data decompression unit. A data decompression unit 509 decompresses the compressed weight data transferred based on the data request of the arithmetic unit 3 by a predetermined method. The method of compression is not limited to any particular method, and its algorithm does not affect the validity of this disclosure, but may be, for example, run-length encoding.

The operation of the data cache device 5 of the second embodiment conforms to the flowchart of FIG. However, in creating the prediction table in steps S002, S003 to S005, it is necessary to calculate the compressed Weight data amount.

In the present embodiment, weight data has been described as an example, but it goes without saying that the present disclosure is not applied only to weight, and may be applied to other data such as activation data and other weight data. stomach.

Similarly to the data cache device 1 of the first embodiment, the data cache device 5 of the second embodiment can improve the computational performance while suppressing power consumption.

(Third Embodiment)
FIG. 5 is a diagram showing the configuration of the data cache device 6 according to the third embodiment. The data cache device 6 is connected to the external storage device 2 , arithmetic device 3 and arithmetic algorithm control section 4 . The data cache device 6 inputs data from the external storage device 2 , holds a part of the data, and outputs the data to the arithmetic device 3 . The arithmetic algorithm control unit 4 manages layer parameters and algorithms of the neural network and controls the arithmetic unit 3 .

　7 is the Weight placement program. This weight placement program 7 is a program for determining whether to place the weight data in the external storage device 2 or in the data holding unit 601 .

　8 is a calculation simulator. The arithmetic simulator 8 is software that logically simulates the operations of the external storage device 2 , the arithmetic device 3 , the arithmetic algorithm control section 4 and the data cache device 6 .

Next, the internal configuration of the data cache device 6 of this embodiment will be described.
601 is a data holding unit. This data holding unit 601 is a cache memory that temporarily holds Weight data. The data holding unit 601 is implemented with an SRAM or flip-flop circuit.

602 is a data input interface unit. The data input interface unit 602 is connected to the external storage device 2 and receives data from the external storage device 2 . As for the connection between the data input interface unit 602 and the external storage device 2, a bus interface or the like may be provided in the middle as long as the data can be logically acquired. The data input interface unit 602 includes a function as a bus master, and acquires data from the external storage device 2 in response to requests transmitted from the arithmetic unit 3 via the selector unit 604 .

603 is a data output interface unit. The data output interface section 603 is connected to the arithmetic device 3 .

604 is a selector unit. The selector unit 604 is a circuit that selects the destination of input data from the data input interface unit 602 and the source of request data from the data output interface unit 603 . The selector unit 604 connects data from the data holding unit 601 or to the data input interface unit 602 according to the request address issued by the data output interface unit 603 .

605 is a selector control unit. A selector control unit 605 is a portion that controls whether data is transferred from the data input interface unit 602 to the data holding unit 601 or data is sent according to a request from the data output interface unit 603 .

606 is a data arrangement table. The data arrangement table section 606 is implemented by an SRAM or a flip-flop circuit, and records which layer's Weight value is held in the data holding section 601 .

Next, the internal configuration of the weight placement program 7 will be described.
Reference numeral 701 denotes weight arrangement table creation means. Weight arrangement table creation means 701 determines whether to arrange the weight data of each layer in the external storage device 2 or hold it in the data holding unit 601 based on the access frequency prediction.

　702 is read count means, which calculates the number of weight readouts for each layer. This read count calculation is performed by causing the calculation simulator 8 to execute layer processing, thereby acquiring the read count of each weight data from the calculation simulator 8 .

The weight placement table creation means 701 and the read count means 702 are configured by program modules, and their functions are exhibited by executing the weight placement program 7 by a computer.

Next, FIG. 6 is a diagram showing the operation of the data cache device 6 of the third embodiment.
S101 is a step in which the weight placement program 7 uses the calculation simulator 8 to perform calculation simulation. The weight allocation program 7 calculates the number of accesses to the weight data using the arithmetic simulator 8 based on information such as the algorithm, the number of kernels, the number of layers, the input data size, etc., which are input information to the arithmetic algorithm control unit 4. . In step S 102 , the weight arrangement table creation means 701 executes this process for all tile layers, and lists the number of accesses for each layer in the read count means 702 . After step S102 is completed ("Yes" in step S102), the process proceeds to step S103.

Step S103 is a step of counting the number of times of Weight access for each layer. In step S103, the weight arrangement table creation means 701 sorts the read counts for each layer held in the read count means 702 in descending order.

Steps S104 to S106 are steps for creating a weight arrangement table. The weight allocation table creation means 701 sets the address of the allocation table so as to arrange data in the data holding unit 601 in descending order of the number of weight accesses based on the capacity of the data holding unit 601 previously acquired. For example, when the address of the data holding unit 601 and the address of the external storage device 2 are designed independently, the address of the data holding unit 601 should be mapped to the weight area with the high access frequency. The processing in step S104 is performed in descending order of the number of accesses, and in step S105, it is determined whether the amount of data held in the data holding unit 601 is full. If the area of the data holding unit 601 is used up in step S105 ("Yes" in step S105), the process proceeds to step S106.

Step S106 is a step of specifying an area to be allocated in the external storage device 2. In the step S106, the address of the external storage device 2 is specified for the weight data of the layer that could not be held in the data holding unit 601. FIG.

Step S107 is a step of transferring the created weight arrangement table to the data arrangement table section 606 of the data cache device 6. In step S107, the weight placement program 7 transfers the created weight placement table information to the data placement table section 606 for actual processing. The transfer means may read via the data input interface section 102, or the data arrangement table section 606 may be memory-mapped from an external area and transferred using an external master or the like (not shown).

Step S108 is a step of transferring Weight data to be read from the data holding unit 601 to the data holding unit 601 according to the transferred data arrangement table unit 606. The data input interface unit 602 reads the data arrangement table, switches the selector unit 604 to the data holding unit 601 by the selector control unit 605 , and transfers the weight describing the address of the data holding unit 601 from the external storage device 2 . The transfer may be performed using the data input interface unit 102 or may be transferred by connecting an external master to the data holding unit 601 .

Step S109 is a step of transferring data during arithmetic processing. In this step, the data requested by the arithmetic unit 3 reaches the selector unit 604 via the data output interface unit 603 and is acquired from the data holding unit 601 or the data input interface unit 602 . The data output interface unit 603 receives the necessary weight data from the arithmetic unit 3, refers to the data allocation table, and generates an address to the area where the weight is stored. During the arithmetic processing, step S109 is continued until the processing is completed.

The data cache device 6 of the third embodiment includes a weight placement program 7 and a calculation simulator 8, and determines whether the weight data should be placed in the external storage device 2 or held in the data holding unit 601 based on the results of the calculation simulation. The data allocation table section 606 is updated based on the determination result. By arranging the weight data according to this data arrangement table section 606, it is possible to improve the calculation performance while suppressing the power consumption.

(Fourth embodiment)
FIG. 7 is a diagram showing the configuration of the data cache device 9 of the fourth embodiment. The basic configuration of the data cache device 9 of the fourth embodiment is the same as that of the data cache device 1 of the first embodiment. is provided with three arithmetic units 3. Although FIG. 7 illustrates a case where there are three output destinations, the number of computing devices 3 as output destinations is not limited to three, and may be two or four or more.

The data cache device 9 is connected to the external storage device 2, arithmetic device 3, and arithmetic algorithm control section 4 described in FIG. The data cache device 9 inputs data from the external storage device 2 , holds a part of the data, and outputs the data to the arithmetic device 3 . The arithmetic algorithm control unit 4 manages layer parameters and algorithms of the neural network, and controls the data cache device 9 and the arithmetic device 3 in conjunction with each other.

Next, the internal configuration of the data cache device 9 of the fourth embodiment will be explained.
901 is a data holding unit. This data holding unit 901 is a cache memory that temporarily holds Weight data. The data holding unit 901 is implemented with an SRAM or flip-flop circuit.

　902 is a data input interface unit. The data input interface section 902 is connected to the external storage device 2 and receives data from the external storage device 2 . As for the connection between the data input interface section 902 and the external storage device 2, a bus interface or the like may be provided in the middle, as long as the data can be logically acquired. The data input interface section 902 includes a function as a bus master, and acquires data from the external storage device 2 in response to a request transmitted from the arithmetic device 3 via the selector section 904 .

　903 is a data output interface unit. The data output interface section 903 is connected to three arithmetic units 3 .

904 is a selector unit. A selector unit 904 is a circuit that selects the destination of input data from the data input interface unit 902 and the source of request data from the data output interface unit 903 .

905 is a selector control unit. A selector control unit 905 is a portion that controls the selector unit 904, and controls connection of the selector unit 904 in units of instruction execution cycles.

906 is a read counter unit. A read counter unit 906 has a function of counting the number of accesses to the data output interface unit 903 . It internally holds a plurality of counters, and independently counts readout for each address requested by each arithmetic unit 3 via the data output interface unit 903 .

907 is a read prediction table section. The reading prediction table section 907 is implemented by an SRAM or a flip-flop circuit, and stores the number of times the Weight value in each layer is read. In this embodiment, the number of times of reading is stored for each arithmetic unit 3 . The predicted number of times may be written as an initial value when the system is started, and then updated at any timing. Also, depending on the implementation policy, only the magnitude relationship with the previously set read number threshold may be retained.

908 is a read prediction table control unit. The readout prediction table control unit 908 acquires the algorithm information specified by the arithmetic algorithm control unit 4, reads out the weight data readout prediction number for the corresponding layer from the readout prediction table unit 907, and sends a selector selection signal to the selector control unit 905. to be issued. In addition, the read prediction table control unit 908 has a function of acquiring read counter information from the read counter unit 906 and updating the values of the read prediction table unit 907 based on instructions from the arithmetic algorithm control unit 4 .

The operation of the data cache device 9 of the fourth embodiment conforms to the flowchart of FIG. However, in step S007, the data requested by the data output interface unit 103 is sent back from the selector control unit 105 together with information indicating from which arithmetic unit the data is requested and information indicating which layer Weight is. The requested transfer data amount of the read counter unit 106 is incremented for each piece of map information obtained.

In addition, in S009, the number of times of reading data that is used when comparing the number of times of reading data with the threshold value is the number of times of reading that is linked to the arithmetic unit 3 used for processing. For example, it is assumed that the reading prediction table section 907 stores the number of reading times of the weight data of each layer in association with each of the three arithmetic units A, B, and C. FIG. Assuming that arithmetic devices A and B are used and arithmetic device C is not used in the processing to be performed from now on, the total value of the number of times of reading linked to arithmetic devices A and B is compared with a threshold, and if the number of times of reading is equal to or greater than the threshold, Flag Weight data.

Similarly to the data cache device 1 of the first embodiment, the data cache device 9 of the fourth embodiment can improve the computational performance while suppressing power consumption.

Further, since the data cache device 9 of the fourth embodiment counts the number of times of reading for each arithmetic unit 3 and stores the number of times of reading in the read prediction table unit 907, the arithmetic unit executing the processing to be performed from now on is stored. 3, Weight data to be stored in the data holding unit 901 can be determined.

Although the data cache device according to the embodiment of the present disclosure has been described above, the present disclosure is not limited to the above-described embodiment.
In the above-described embodiment, when the processing of all tiles is completed ("Yes" in S008 in FIG. 3), a flag is set for data whose number of data reads is equal to or greater than the threshold (S009), and the read prediction table is updated ( S010) and transferring data to the data holding unit based on the updated readout prediction table (S011) have been described. In other words, data that is effective in reducing power consumption when held in the data holding unit 101 is determined in units of frames, but this unit does not necessarily have to be in units of frames. It's okay. That is, when the processing of one tile is completed, a flag may be set for the data of the layer whose read count is equal to or greater than the threshold. In particular, this configuration is effective when the tile size is a neural network regardless of the layer.

Claims

A data cache device that temporarily holds digital data,
a data holding unit (101) holding data;
a data input interface unit (102) for inputting data from the outside of the data cache device;
a data output interface unit (103) for outputting data to the outside of the data cache device;
a path for transferring data from the data input interface section to the data holding section; a path for transferring data from the data input interface section to the data output interface section; and a path for transferring data from the data holding section to the data output section. a selector unit (104) that simultaneously selects one or more of the routes to be transferred to the interface unit;
a selector control unit (105) for controlling the selector unit;
a readout prediction table unit (107) that holds a scheduled number of readouts at which the data is read out during processing;
A data cache device that controls the selector unit according to the number of times of reading set in the read prediction table unit.
The data cache device according to claim 1,
a read counter section (106) for counting the number of times read from the data output interface section;
further comprising a readout prediction table control unit (108) that sets the number of times of readout according to an algorithm in the readout prediction table unit;
The read prediction table control section updates the read prediction table according to the output of the read counter section.
A data cache device according to claim 2,
The read counter section counts and holds the number of times data is read from the data holding section and the number of times data is read directly from the data input interface section.
A data cache device according to claim 2,
The read prediction table control unit has a list of algorithms to be executed by an arithmetic unit (3) that acquires data from the data cache device and processes it, calculates the predicted number of read times according to the algorithm, and calculates the read prediction table. A data cache device that updates the part.
A data cache device according to claim 2,
The read counter unit counts the number of times of reading for each output destination to which the data cache device outputs data,
The read prediction table control unit is a cache device that updates the read prediction table according to the read count for each output destination.
a data holding unit for holding data, a data input interface unit for inputting data from outside the data cache device, a data output interface unit for outputting data to the outside of the data cache device, and the data input interface a path for transferring data from the unit to the data holding unit; a path for transferring data from the data input interface unit to the data output interface unit; and a route for transferring data from the data holding unit to the data output interface unit. A program for controlling a data cache device comprising: a selector unit for simultaneously selecting one or more of the paths to
reading prediction table means for holding a scheduled reading number of times that the data is read during processing;
selector control means for controlling the selector unit according to the number of times of reading set in the reading prediction table means;
A program to function as
7. The program according to claim 6, wherein the computer is
read counter means for counting the number of times read from the data output interface section;
readout prediction table control means for setting the number of times of readout according to an algorithm in said readout prediction table means and updating said readout prediction table according to the output of said readout counter means;
A program to function as
A program according to claim 7,
A program for causing the computer to function so that the read counter means counts and holds the number of times data is read from the data holding section and the number of times data is read directly from the data input interface section. .
A program according to claim 7,
The read prediction table control means is provided with a list of algorithms executed by an arithmetic unit that acquires and processes data from the data cache device, calculates the number of predicted reads according to the algorithm, and updates the read prediction table means. A program that makes a computer work.
A program according to claim 7,
causing the read counter means to count the number of times of reading for each output destination to which the data cache device outputs data;
A program for making a computer function such that the readout prediction table control means updates the readout prediction table means according to the number of times of reading for each of the output destinations.
A program for creating a table for setting whether to read out data held in a data cache device or read out from an external storage device as data necessary for executing an operation by an arithmetic device, the computer comprising:
a read count means (702) for performing an arithmetic simulation of the arithmetic unit by an arithmetic simulator and counting the number of times each data is read during arithmetic execution;
table creation means (701) for creating the table based on the number of times each data is read by the read count means;
A program to function as
A program according to claim 11,
A program for causing a computer to function so that the table creation means creates the table based on the number of times each data has been read and the size of the data holding unit of the data cache device.