CN115496647A

CN115496647A - GPU module low-power-consumption processing method

Info

Publication number: CN115496647A
Application number: CN202211286288.4A
Authority: CN
Inventors: 杜文静; 许文强
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2022-12-20

Abstract

The application belongs to the technical field of GPU low-power consumption processing, and particularly relates to a GPU module low-power consumption processing method, which comprises the following steps: determining a low power consumption processing direction; step two: selecting a low-power standard unit library; step three: low power consumption layout and optimization; step four: a low power consumption clock tree; and step five, iterating and grouping the triggers which are initially grouped again by using the threshold capacitance and the threshold distance to obtain a final trigger group, and distributing a buffer to the trigger group. According to the framework of the GPU module, the method is combined with a flat layout method and a layering layout method, the GPU module is divided into layers from top to bottom, voltage domains are divided according to the functions of all the layers, clock gating is inserted, after the clock gating is inserted, triggers after the clock gating are grouped, a clock tree is built based on the trigger group, and different threshold voltage units are used for the GPU module which finishes winding to repair the timing sequence problem and the power consumption problem.

Description

GPU module low-power-consumption processing method

Technical Field

The invention relates to a low-power-consumption processing method, in particular to a GPU module low-power-consumption processing method.

Background

With the continuous development of very large scale integrated circuits and graphics technologies, GPUs have become one of the focuses of research. Because the GPU has a highly parallel architecture, it is suitable for single floating point operation and highly parallel data intensive computation, and has become an indispensable module in chip design. The "power wall" problem that occurs while GPU performance is increasing presents a greater challenge to designers. As chip integration degree is higher and higher, chip surface temperature is also higher and exponentially increased, and power consumption has become an important design index in GPU design on the same level as performance and area. Research shows that the performance of the GPU is about 5 times that of the CPU, but the overall power consumption of the GPU is 2-3 times that of the CPU. Not only does high power consumption imply a large energy consumption, but also thermal pile and ever increasing power density will cause GPU stability problems. Studies have shown that for every 10 degrees increase in operating temperature, the failure rate of the chip doubles. In addition, to alleviate the heat buildup, larger packaging materials with more enhanced heat dissipation capability, additional heat sinks, and heat-generating protection circuits have to be used, which undoubtedly increases the manufacturing cost of the GPU. The performance of the GPU is limited by the excessively high power consumption, and if the core frequency is further increased or the on-chip cache capacity is increased, the power consumption of the GPU continues to increase, and then the GPU enters a vicious circle. Therefore, the low-power design technology becomes a core problem in GPU design, and has important practical value and research significance in the engineering field.

The low power consumption processing of the GPU module has three aspects: firstly with GPU module low-power consumption demand assorted standard cell storehouse, the low-power consumption standard cell storehouse has included many threshold voltage units, extension channel unit and selectivity extension channel unit, many registers, keep register, many size gradient units, minimum size unit and delay element and chronogenesis improve register etc. for power planning, many power multi-voltage, power gate, clock gate etc. low-power consumption technical means establish the basis. Secondly, based on the layout of the whole GPU module low-power-consumption design, in the layout and optimization stage, the selection of standard units, physical placement, equivalent logic conversion, non-critical path power consumption optimization and the like are closely related to the power consumption of the chip, and how to reduce the power consumption on the basis of ensuring the performance and the winding of the chip is the key of the chip low-power-consumption design. And thirdly, a clock tree with low power consumption, wherein the clock is the heart of the chip, and the design of the clock tree plays an extremely important role in the design of the whole chip. The uncertainty caused by the improvement of the process has larger and larger interference to the clock, the power consumption of the clock tree occupies 20% -40% of the total power consumption of the chip, and the high-quality clock tree is related to the working performance of the whole chip and the total power consumption.

The existing commonly used back-end low-power consumption processing technology is clock gating, multi-voltage domain and multi-threshold voltage technology. Clock gating is the insertion of clock gating cells into a clock tree to reduce the slew rate of the clock signal. Today, large-scale integrated circuits are basically sequential circuits, and the sequential circuits are all realized by using flip-flops. The signal transmission between the flip-flops is controlled by a clock signal. The huge load of the clock network causes large dynamic power consumption because the clock network can be periodically turned over. The use method of the gated clock is shown in the following figure, and the gated clock cuts off part of the clock network under the condition that the subsequent trigger signal is not turned over, so that the load and the turning rate of the clock network are reduced, and the power consumption of the clock network is further reduced. The principle of multi-power-supply-domain design is that according to the difference of performance requirements of each module in an integrated circuit, the modules are respectively distributed to different voltage domains, the power supply voltage can simultaneously affect the static power consumption and the dynamic power consumption of the circuit, and the power supply voltage can also greatly affect the time delay of the circuit. Therefore, in the GPU module, the sub-modules with high performance are required to increase high voltage, delay of devices is reduced, low voltage is provided for other sub-modules with low performance requirements, and power consumption of the GPU module is reduced. The principle of the multi-threshold device is that the low-threshold voltage unit has high working frequency and high speed, but the leakage current is large; high threshold voltage cells operate at low frequencies but have less cell leakage. The unit with low threshold voltage has high working frequency and high speed, and can be used on a critical path and a clock tree, so that the GPU can obtain high-quality timing sequence and the clock tree, while the timing sequence of a non-critical path can make certain sacrifice for power consumption, and the unit with high threshold voltage is used for reducing the power consumption.

Although clock gating is a very effective method for reducing the power consumption of a clock tree in the low-power-consumption processing of a GPU module, the number of triggers driven by single clock gating cannot be too small, otherwise, the power consumption and the area generated by a clock gating circuit cannot be paid by designers; on the other hand, too many flip-flops are driven, which results in an unsatisfactory clock tree synthesis result and a large clock skew. Furthermore, because the EDA tool does not consider clock gating as a leaf node of the clock tree during the clock tree synthesis stage, the clock signal arrives at the clock gating earlier than the flip-flops driven by the clock gating, and the generated clock skew easily causes the setup time and hold time of the enable signal to be violated. In addition, in the process of building the clock tree by the GPU, a large number of buffers are inserted into the flip-flops behind the clock gating unit to achieve clock tree balance, which causes waste of area and power consumption.

Disclosure of Invention

The present disclosure is directed to a GPU module low power processing method, so as to effectively solve the problems of the inventor in the above background art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a low-power consumption processing method for a GPU module comprises the following specific steps:

the method comprises the following steps: determining a low-power consumption processing direction, and mainly researching from three aspects of a low-power consumption standard cell library, low-power consumption layout and optimization and a low-power consumption clock tree;

step two: selecting a low-power standard unit library, and designing the GPU module with low power consumption by using a TSMC 6nm process;

step three: low-power-consumption layout and optimization, wherein a GPU module is divided into layers from top to bottom and divided into voltage domains by combining a flattening layout and a layering layout method;

step four: the low-power consumption clock tree is inserted into the clock gating unit, and the triggers driven by the clock gating unit are grouped by using a minimum spanning tree algorithm and taking the trend of the interconnection line of the clock tree to the minimum value as a standard;

step five: carrying out iterative grouping on the initially grouped triggers again by utilizing a threshold capacitance and a threshold distance to obtain a final trigger group, and distributing a buffer to the trigger group;

step six: and performing winding, static time sequence repair and physical verification on the GPU module.

Preferably, in the second step, the chip physical design is not separated from the standard cell library matched with the requirement, and the low-power standard cell library comprises a multi-threshold voltage unit, an elongated channel unit, a selective elongated channel unit, a multi-bit register, a holding register, a multi-size gradient unit, a minimum-size unit, a delay unit, a time sequence improving register and the like, so that a foundation is laid for low-power technical means such as power planning, multi-power multi-voltage, power gating, clock gating and the like.

Preferably, in the physical design of the chip, the most critical part is whether the produced chip can meet the timing requirement, so that the clock is the heart of the chip, the design of the clock tree plays an extremely important role in the design of the whole chip, the uncertainty caused by the improvement of the process increasingly interferes with the clock, the power consumption of the clock tree occupies 20% -40% of the total power consumption of the chip, the high-quality clock tree is related to the working performance of the whole chip and the total power consumption, and therefore a clock tree network which meets the timing requirement and the power consumption requirement is established.

Preferably, in the fourth step, the performance indicators of the clock tree include clock delay, clock jitter, clock skew and transmission time; clock delay refers to the delay from a clock source point to any one clock pin in the circuit; clock skew refers to the difference in time taken for a clock signal to arrive at various parts of a sequential circuit; clock jitter refers to the deviation of the clock signal edge from the ideal clock signal edge time; the transmission time, also called transition time, refers to the time it takes for a signal to transition between two particular levels, optimizing the power consumption of the GPU clock tree by clustering registers on the GPU clock tree and deleting buffers on the clock tree.

Preferably, in the third step, the layout and optimization need to consider not only the chip timing and the degree of congestion of the wiring, but also the power consumption level of the whole chip. In the layout and optimization stage, selection, physical placement, equivalent logic conversion, non-critical path power consumption optimization and the like of standard units are relevant to power consumption of a chip, it is critical to reduce power consumption on the basis of ensuring chip performance and wire winding, a flat layout and a hierarchical layout method are combined, hierarchical division of a GPU is carried out according to a GPU architecture, cells used by a GPU circuit are selected, more than one clock of the GPU is provided, tens of thousands of registers driven by the clock are provided, some registers are inserted into clock gating units and are directly connected with clock signals of the GPU, the inserted gating units can drive tens of registers or even hundreds of registers, and the registers still directly driven by the clock signals of the GPU have optimized spaces.

Preferably, in the third step, the physical design of the GPU mainly includes layout planning, power supply planning, and timing analysis. The layout planning content comprises macro unit placement, IO unit placement, module level division and the like; the content of the voltage plan comprises the division of a voltage domain, the arrangement of a power supply ring and the like; the time sequence analysis content has the conditions whether the establishment time and the holding time of modules in different layers are illegal, whether the time sequence path is reasonable, and the like.

In view of this, compared with the prior art, the beneficial effects of the invention are:

in the method, the GPU module is subjected to low-power-consumption processing by adopting a TSMC 6nm technology, the power consumption of the GPU module is reduced on the premise of not sacrificing the performance and the area of the GPU module, and the physical design of the GPU module is completed to obtain the low-power-consumption processing of the GPU module.

According to the method, the GPU module is divided into layers from top to bottom according to the framework of the GPU module and by combining a flat layout method with a layering layout method, voltage domains are divided according to the functions of all the layers, clock gating is inserted, after the clock gating is inserted, triggers after the clock gating are grouped, a clock tree is built based on a trigger group, and different threshold voltage units are used for the GPU module which completes winding to repair the timing sequence problem and the power consumption problem.

In the method, after the layout is finished, on the premise that the netlist is not affected, the interconnection capacitance of the clock tree is reduced and the triggers driven by the clock gating units are grouped in a minimum spanning tree mode, so that the power consumption of the GPU is reduced. The invention does not change the position of the trigger, does not influence the signal path time sequence, and reduces the clock offset and the clock delay of the clock tree. The method is realized on IC compiler II by using TCL language design, and compared with the existing low-power consumption method, the total power consumption of the GPU module is reduced by 5%.

Drawings

FIG. 1 is a diagram showing a CPU module layout;

FIG. 2 is a diagram of a switch unit;

FIG. 3 is a diagram of an isolation unit;

FIG. 4 is a clock gating diagram;

FIG. 5 is a graph of the center of the trigger;

FIG. 6 is a graph of the Manhattan distance of the flip-flop;

FIG. 7 is a diagram of initial grouping of flip-flops;

FIG. 8 is a diagram of the final grouping of flip-flops;

FIG. 9 is a diagram of an insertion buffer;

FIG. 10 is a view showing a completed winding;

FIG. 11 is a detailed flow chart of the present invention;

fig. 12 is a block diagram of a buffer deletion algorithm.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-12, the present invention provides the following embodiments:

step four: the low-power consumption clock tree is inserted into the clock gate control unit, and the triggers driven by the clock gate control unit are grouped by using a minimum spanning tree algorithm and taking the trend of the clock tree interconnection line to the minimum value as a standard;

step five: iteratively grouping the triggers which are initially grouped again by using a threshold capacitance and a threshold distance to obtain a final trigger group, and distributing a buffer to the trigger group;

Preferably, in the second step, the chip physical design is not separated from the standard cell library matched with the requirement, and the low-power consumption standard cell library comprises a multi-threshold voltage unit, an extended channel unit, a selective extended channel unit, a multi-bit register, a holding register, a multi-size gradient unit, an ultra-small size unit, a delay unit, a time sequence improving register and the like, thereby laying a foundation for low-power consumption technical means such as power planning, multi-power supply and multi-voltage, power gating, clock gating and the like.

Preferably, in step four, the performance indicators of the clock tree include clock delay, clock jitter, clock skew and transmission time; clock delay refers to the delay from the clock source point to any clock pin in the circuit; clock skew refers to the difference in time taken for a clock signal to arrive at various parts of a sequential circuit; clock jitter refers to the deviation of the clock signal edge from the ideal clock signal edge time; the transmission time, also called transition time, refers to the time it takes for a signal to transition between two particular levels, optimizing the power consumption of the GPU clock tree by clustering registers on the GPU clock tree and deleting buffers on the clock tree.

Preferably, in step three, the layout and optimization need to consider not only the chip timing and the degree of congestion of the wiring, but also the power consumption level of the whole chip. In the layout and optimization stage, selection, physical placement, equivalent logic conversion, non-critical path power consumption optimization and the like of standard units are relevant to power consumption of a chip, it is critical to reduce power consumption on the basis of ensuring chip performance and wire winding, a flat layout and a hierarchical layout method are combined, hierarchical division of a GPU is carried out according to a GPU architecture, cells used by a GPU circuit are selected, more than one clock of the GPU is provided, tens of thousands of registers driven by the clock are provided, some registers are inserted into clock gating units and are directly connected with clock signals of the GPU, the inserted gating units can drive tens of registers or even hundreds of registers, and the registers still directly driven by the clock signals of the GPU have optimized spaces.

Preferably, in step three, the physical design of the GPU mainly includes layout planning, power supply planning, and timing analysis. The layout planning content comprises macro unit placement, IO unit placement, module hierarchical division and the like; the content of the voltage planning comprises the division of a voltage domain, the arrangement of a power supply ring and the like; the time sequence analysis content has the conditions whether the establishment time and the holding time of modules in different layers are illegal, whether the time sequence path is reasonable, and the like.

The specific implementation mode of the invention is as follows:

as shown in FIG. 1, the GPU module of the present invention is mali-G31 of ARM corporation. According to the framework of the GPU, the GPU is divided into two layers from top to bottom, most of logic is divided into a loader _ core, and the rest of logic is divided into a top-layer mali _ ace. And dividing four voltage domains according to the functions of the core and the mali, and writing corresponding files with uniform power supply formats.

The method comprises the following specific implementation steps:

(1) estimating the area required by the GPU module according to a formula Std cell area/(chip area-Hard placement area) = util, and dividing the shape.

(2) And writing a UPF, and dividing a voltage domain for the GPU module by using create _ power _ domain.

(3) The switch unit and the isolation unit are inserted in a voltage domain which can be switched off.

As shown in fig. 4, clock gating cells are inserted to group the flip-flops driven by the clock gating cells.

The method comprises the following specific implementation steps:

(1) dividing the triggers driven by the clock gating units according to levels, finding the triggers under the same level, calculating the center coordinates of the triggers, and determining the threshold distance according to the requirement of maximum fan-out.

(2) And calculating the Manhattan distance between every two triggers, and initially grouping the triggers according to the threshold distance.

(3) And calculating the threshold capacitance of the trigger group by using the initial grouping number according to the total load capacitance of all the triggers, and performing iterative grouping on the triggers according to the threshold capacitance to obtain the final trigger group.

(4) An appropriate buffer is inserted for each flip-flop group.

(5) A set of flip-flop based clock trees is established.

As shown in fig. 10, the winding is completed and static timing analysis and power consumption optimization are performed.

The method comprises the following specific implementation steps:

(1) and winding the GPU module after the clock tree is established.

(2) And performing time sequence repair on the GPU module which is wound, so that the setup time and the hold time are not violated.

(3) According to the time sequence situation, the low threshold voltage unit is used in the tight time sequence path, and the high threshold voltage unit is used in the abundant time sequence place.

In the description herein, references to the description of "one embodiment," "an example," "a specific example," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand the invention for and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A GPU module low-power consumption processing method is characterized by comprising the following steps: the method comprises the following specific steps:

step three: low-power-consumption layout and optimization, wherein a flattening layout method and a layering layout method are combined to divide the GPU module into layers from top to bottom and divide a voltage domain;

2. The GPU module low-power-consumption processing method of claim 1, wherein: in the second step, the chip physical design can not be separated from the standard cell library matched with the requirement, and the low-power consumption standard cell library comprises a multi-threshold voltage unit, an extended channel unit, a selective extended channel unit, a multi-bit register, a holding register, a multi-size gradient unit, an ultra-small size unit, a delay unit, a time sequence improving register and the like, thereby laying a foundation for low-power consumption technical means such as power planning, multi-power supply and multi-voltage, power gating, clock gating and the like.

3. The GPU module low-power-consumption processing method of claim 1, characterized in that: in the physical design of a chip, the most critical part is whether the produced chip can meet the timing requirement, so that a clock is the heart of the chip, the design of a clock tree occupies an extremely important position in the design of the whole chip, the uncertainty caused by the improvement of the process increasingly interferes with the clock, the power consumption of the clock tree occupies 20% -40% of the total power consumption of the chip, and the high-quality clock tree is related to the working performance of the whole chip and the total power consumption, so that a clock tree network which meets the timing requirement and the power consumption requirement is established.

4. The GPU module low-power-consumption processing method of claim 3, characterized in that: in the fourth step, the performance indexes of the clock tree comprise clock delay, clock jitter, clock offset and transmission time; clock delay refers to the delay from the clock source point to any clock pin in the circuit; clock skew refers to the difference in time taken for a clock signal to arrive at various parts of a sequential circuit; clock jitter refers to the deviation of the clock signal edge from the ideal clock signal edge time; the transmission time, also called transition time, refers to the time it takes for a signal to transition between two particular levels, optimizing the power consumption of the GPU clock tree by clustering registers on the GPU clock tree and deleting buffers on the clock tree.

5. The GPU module low-power-consumption processing method of claim 1, wherein: in the third step, not only the crowding degree of chip time sequence and wiring but also the power consumption level of the whole chip need to be considered for layout and optimization, in the layout and optimization stage, the selection, physical placement, equivalent logic conversion, non-critical path power consumption optimization and the like of standard units are closely related to the power consumption of the chip, and the reduction of the power consumption on the basis of ensuring the chip performance and winding is the key of the low power consumption of the chip.

6. The GPU module low-power-consumption processing method of claim 1, wherein: in the third step, the physical design of the GPU mainly comprises layout planning, power supply planning and time sequence analysis, wherein the layout planning comprises macro unit arrangement, IO unit arrangement, module level division and the like; the content of the voltage planning comprises the division of a voltage domain, the arrangement of a power supply ring and the like; the time sequence analysis content has the conditions whether the establishment time and the holding time of modules in different layers are illegal, whether the time sequence path is reasonable, and the like.