CN109902807B

CN109902807B - Many-core chip distributed thermal modeling method based on recurrent neural network

Info

Publication number: CN109902807B
Application number: CN201910148729.6A
Authority: CN
Inventors: 王海; 肖涛; 唐迪娅
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2022-07-05
Anticipated expiration: 2039-02-27
Also published as: CN109902807A

Abstract

The invention belongs to the field of electronic design automation and discloses a many-core chip distributed thermal modeling method based on a recurrent neural network. Dynamic thermal management can manage the temperature of many-core chips very efficiently, while a good many-core chip thermal modeling can help dynamic thermal management well. However, in the conventional many-core chip lumped thermal modeling, the calculation cost is exponentially increased as the number of chip cores is increased. In order to solve the problem of overlarge calculation cost of the lumped thermal model, the invention provides a many-core chip distributed thermal modeling method based on a cyclic neural network. The invention can simulate the temperature characteristic of many-core chips with quite high speed and high precision.

Description

Many-core chip distributed thermal modeling method based on recurrent neural network

Technical Field

The invention belongs to the field of electronic design automation, relates to the technical field of deep learning, and particularly relates to a many-core chip distributed thermal modeling method based on a recurrent neural network.

Background

As the feature size of chips continues to decrease with advances in semiconductor processing, commercial chips at 7nm have begun to be produced in volume by 2018. After the nano-scale chip enters the nano-scale, the dominant frequency of the chip is difficult to improve due to the influence of leakage current, so that the high performance development direction of the chip is to increase the core number of the chip instead of improving the dominant frequency, and remarkable effect is achieved.

The performance of many-core chips is greatly improved due to the increase of the number of cores, but the performance of many-core chips also brings serious chip thermal reliability problems, and the main reason for the problem is that the chip temperature is too high due to high power density.

To address the many-core die thermal reliability issue, an efficient and less costly solution to dynamic thermal management is proposed. The scheme is based on an automatic control theory, and the ideal temperature distribution is obtained through accurate estimation and real-time adjustment of power consumption. Dynamic thermal management techniques can ensure that temperatures are managed efficiently at a lower performance overhead, especially when the number of cores is small. However, when the number of core chips in the core network is too large, the size of the lumped thermal model is too large, and the calculation cost exponentially increases along with the increase of the number of the core chips, so that the processor performance cost caused by thermal management is too large.

In order to solve the problems, the public core chip distributed thermal modeling is one of the problems to be solved urgently at present.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a many-core chip distributed thermal modeling method based on a recurrent neural network. The modeling method decomposes the many-core chip thermal model into a plurality of small models, and the more extreme example is to establish a thermal model for each core of the many-core chip and carry out limited information exchange between the cores. The method first builds a recurrent neural network model and then trains the network through offline temperature and power data. The trained recurrent neural network can predict the temperature of each core on the chip. Performing thermal modeling on each core of the many-core chip, wherein the positions of the cores of the chips are different, and the thermal modeling is also different; given the position of a chip core, the temperature of the chip core can be calculated according to the power of the chip core and the temperature of surrounding cores; establishing a thermal model of each core by using a recurrent neural network, simulating a nonlinear function, and processing data of a vector sequence; the input layer of the recurrent neural network is subjected to weight value disassembly, an input weight value matrix is disassembled into two parts, one part is used for supplying power, and the other part is used for supplying temperature, so that the problem that the input has both power and temperature is solved; for each core of the many-core chip, the selection mode of the ambient temperature is fixed to reduce errors, the cores are not arranged from small to large according to the serial numbers of the cores, the cores are arranged right above the chips, and then the chips are rotated clockwise until the temperatures of all adjacent cores are taken, so that errors are not prone to occur when the outermost cores are processed, because the cores are adjacent to the external environment and the adjacent positions are different.

The invention adopts the following technical scheme to solve the problems:

step one, many-core chip thermal model parameters, mainly thermal capacitance and thermal resistance parameters on the whole chip, are extracted from Hotspot, and a many-core chip thermal model is established.

And step two, acquiring a plurality of groups of data (a plurality of time nodes, each node has power and temperature information of each core) by using the thermal models, and then making the plurality of groups of data into a training set and a verification set, wherein the training set is used for training the recurrent neural network, the verification set only verifies the trained neural network, and the data in the verification set is not used for training.

And step three, sending the training set into a circulating neural network model which is not trained (the weight matrix in the model is initialized randomly), so as to obtain the output of the circulating neural network. Because the recurrent neural network has not been trained, there is a large gap between the output of the recurrent neural network and the true output. In order to obtain an accurate recurrent neural network model, the weight matrix can be adjusted to make the temperature output of the weight matrix as close as possible to the output of the training set. Therefore, the goal becomes to minimize the loss function by adjusting the weight matrix of the recurrent neural network, the smaller the loss function, the closer the output of the recurrent neural network is to the true output.

And step four, optimizing the loss function by adopting a gradient descent optimization algorithm, setting a learning rate, calculating partial derivatives of the weight matrix of the circular neural network by the loss function respectively, and then carrying out iterative updating on the partial derivatives. And after multiple iterations, the training is finished until the loss function is not reduced or reaches the set maximum iteration number, and the loss function value at the moment, namely the training error, is recorded. Meanwhile, the verification set is sent to a trained recurrent neural network, iteration is not updated on the verification set, and only the loss function value, namely the verification error, is recorded. And then changing the number of the hidden layers and the number of neurons of the hidden layers, retraining a new model, and recording the training error and the verification error of the new model. Finally, from these models, the one with the smallest verification error is selected as the thermal model of the chip core. For each core on the many-core chip, a thermal model is trained for each core, and the cores are combined to form the many-core chip distributed thermal model.

Compared with the prior art, the invention has the beneficial results that: the recurrent neural network can effectively fit a nonlinear function, so that the many-core chip distributed thermal model established by using the recurrent neural network can simulate the temperature characteristics of the many-core chip with high accuracy and quick response speed.

Drawings

The invention is further illustrated with reference to the following figures and examples.

Fig. 1 is a layout and numbering of a 16-core chip and a position diagram of a 6 th core.

Fig. 2 is a thermal model structure of the 6 th core and a positional relationship with other adjacent cores.

Fig. 3 is a diagram of a recurrent neural network whose recurrent structure is from the output layer to the hidden layer.

FIG. 4 is a graph comparing a predicted temperature value and an actual temperature value for a 6 th nuclear thermal model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the examples of the present invention will be clearly and completely described below with reference to the accompanying drawings in the examples of the present invention, and it is obvious that the described examples are a part of examples of the present invention, but not all examples. All other examples, which can be obtained by a person skilled in the art without inventive step based on the examples of the present invention, are within the scope of the present invention.

Fig. 1 is a layout number of a 16-core chip and a position diagram of a 6 th core.

In the example of the invention, a 16-core many-core chip is provided, which is numbered as shown in fig. 1, and the distributed thermal model of the whole many-core chip is formed by modeling each core and then combining the models, namely the distributed thermal model of the whole many-core chip. Where the 6 th kernel is highlighted, the process of building the model will be explained taking the 6 th kernel as an example.

The thermal modeling of the 6 th core and its positional relationship to other adjacent cores is depicted, where it is adjacent to

cores

2, 5, 7, and 10, connected by thermal resistors; there is also a grounded thermal capacitor; there is also externally input power.

Fig. 3 is a structural diagram of a recurrent neural network, and in consideration of the non-linear effect, the present invention uses a neural network to build a thermal model for each core, and since the temperature has a series of values in time to form a vector sequence, the recurrent neural network is a neural network specialized in sequence modeling, which can well process data in the form of a vector sequence, and thus the recurrent neural network is finally used to build a thermal model for each core. Here, the input is P_i(k) And T_{i_near}(k) The power of the ith core and the temperature of the core adjacent to the ith core at the time k are respectively represented, and the state H_i(k) The method is also called a hidden layer in the recurrent neural network and represents the state of the ith core at the k moment; output T_i(k) Indicating the temperature of the ith nucleus at time k. W_ihIs a weight matrix, W, from the input layer to the hidden layer_hoIs a weight matrix, W, from the hidden layer to the output layer_ohIs the weight matrix from the output layer to the hidden layer. In order to enable the recurrent neural network to better fit the function, the hidden layer may have multiple layers, and here is shown a cycle with only one hidden layerA recurrent neural network.

The trained thermal model of the 6 th core based on the recurrent neural network is used for predicting the temperature of the 6 th core, and it can be seen that the temperature predicted by the model can be well fitted with the real temperature.

The invention discloses a many-core chip distributed thermal model method based on a recurrent neural network, which is described in detail in the above examples, but the invention is not limited to the above examples, and the technical scheme described in the previous examples can be modified afterwards, so that the essence of the corresponding technical scheme does not depart from the spirit and scope of the technical scheme of each example of the invention.

Claims

1. A many-core chip distributed thermal modeling method based on a recurrent neural network is characterized in that: performing thermal modeling on each core of the many-core chip, wherein the thermal modeling of the chip cores is different due to different positions of the chip cores; given the position of a chip core, the temperature of the chip core can be calculated according to the power of the chip core and the temperature of surrounding cores; establishing a thermal model of each core by using a recurrent neural network, simulating a nonlinear function, and processing data of a vector sequence; the input layer of the recurrent neural network is subjected to weight value disassembly, an input weight value matrix is disassembled into two parts, one part is used for supplying power, and the other part is used for supplying temperature, so that the problem that the input has both power and temperature is solved; for each core of the many-core chip, the selection mode of the ambient temperature is fixed to reduce errors, the cores are not arranged from small to large according to the serial numbers of the cores, the cores are arranged right above the chips, and then the chips are rotated clockwise until the temperatures of all adjacent cores are taken, so that errors are not prone to occur when the outermost cores are processed, because the cores are adjacent to the external environment and the adjacent positions are different.