WO2021208360A1 - 一种服务器内gpu的降功耗电路及服务器 - Google Patents

一种服务器内gpu的降功耗电路及服务器 Download PDF

Info

Publication number
WO2021208360A1
WO2021208360A1 PCT/CN2020/117277 CN2020117277W WO2021208360A1 WO 2021208360 A1 WO2021208360 A1 WO 2021208360A1 CN 2020117277 W CN2020117277 W CN 2020117277W WO 2021208360 A1 WO2021208360 A1 WO 2021208360A1
Authority
WO
WIPO (PCT)
Prior art keywords
gpu
server
power consumption
psu
frequency reduction
Prior art date
Application number
PCT/CN2020/117277
Other languages
English (en)
French (fr)
Inventor
王鹏
程世超
孙珑玲
刘闻禹
叶明洋
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Priority to US17/791,310 priority Critical patent/US11656674B2/en
Publication of WO2021208360A1 publication Critical patent/WO2021208360A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/325Power saving in peripheral device
    • G06F1/3278Power saving in modem or I/O interface
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3243Power saving in microcontroller unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3206Monitoring of events, devices or parameters that trigger a change in power modality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/324Power saving characterised by the action undertaken by lowering clock frequency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5094Allocation of resources, e.g. of the central processing unit [CPU] where the allocation takes into account power or heat criteria
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of servers, in particular to a power consumption reduction circuit of GPU in a server and a server.
  • the server usually uses a power capping method to limit the power consumption of the server, so that the power consumption of the server is less than the upper limit supported by the PSU.
  • the Power capping method is to set a total power consumption threshold for the server in advance. During the server operation, monitor whether the power consumption of the server exceeds the set total power consumption threshold, and if so, start the power consumption reduction strategy. , That is, through the PCIE (peripheral component interconnect express) bus to issue a frequency reduction command to the GPU in the server, so that the GPU limits its operating frequency after receiving the frequency reduction command.
  • the Power capping method is completed at the operating system level, and there is a long delay (about 50ms).
  • the PSU is likely to have triggered the over-power protection due to the over-power output (PSU over-power output for a period of time). After time, over-power protection will be triggered), which will cause abnormal power failure of the server, which will cause the user's business data to be lost.
  • the purpose of the present invention is to provide a GPU power consumption reduction circuit and server in a server, which are directly implemented by the underlying hardware circuit without the intervention of the operating system, and the response speed is relatively fast, so that the entire frequency reduction operation of the GPU can be completed within 5ms. In this short period of time, the over-power protection will not be triggered, thus avoiding the loss of the user's business data caused by the abnormal power failure of the server.
  • the present invention provides a power consumption reduction circuit for GPU in a server, including:
  • the frequency reduction control chip connected to the PSU in the server and the PWRBRK pin of each GPU in the server, respectively, is used to generate the frequency reduction control signal to each GPU after receiving the over-power alarm signal generated by the PSU The PWRBRK pin to enable the frequency reduction operation of each GPU.
  • the power consumption reduction circuit further includes:
  • a switch chip connected to the output terminal of the frequency reduction control chip and the PWRBRK pin of each GPU;
  • a control circuit connected to the channel control terminal of the switch chip is used to determine the target GPU to be down-converted from each GPU according to the comparison between the power consumption of the server and the rated power of the PSU , And control the switch chip to open the transmission channel between the frequency reduction control chip and the target GPU, so that the frequency reduction control signal is output to the PWRBRK pin of the target GPU, and the target GPU is turned on The frequency reduction operation.
  • control circuit includes:
  • An I/O expansion chip connected to the channel control terminal of the switch chip
  • the controller connected to the I/O expansion chip is configured to determine the target GPU to be down-converted from each GPU according to the comparison relationship between the power consumption of the server and the rated power of the PSU, And through the I/O expansion chip to control the switch chip to open the transmission channel between the frequency reduction control chip and the target GPU, so that the frequency reduction control signal is output to the PWRBRK engine of the target GPU foot.
  • the controller is specifically used for:
  • the switch chip When k ⁇ m, the switch chip is controlled by the I/O expansion chip to open the transmission channel between the frequency reduction control chip and the k GPUs, so that the frequency reduction control signal is output to the k GPUs.
  • the switch chip When k>m, the switch chip is controlled by the I/O expansion chip to open the transmission channel between the frequency reduction control chip and the m GPUs, so that the frequency reduction control signal is output to the m GPUs.
  • the controller is connected to the PSU via a PMBus bus;
  • controller is also used to reduce the overpower threshold of the PSU when k ⁇ m.
  • the controller is further configured to, when k>m, perform an alarm that characterizes that the power consumption of the whole machine after frequency reduction of the server still exceeds the rated power of the PSU.
  • the frequency reduction control chip is specifically a CPLD in the server, and the controller is specifically a BMC in the server.
  • the present invention also provides a server, including a PSU and a GPU, and also includes any of the above-mentioned GPU power consumption reduction circuits in the server.
  • the PSU is specifically a PSU 1+1 redundancy architecture.
  • the invention provides a power consumption reduction circuit for GPU in a server, which includes a frequency reduction control chip. After receiving the over-power alarm signal generated by the PSU, the frequency reduction control chip generates a frequency reduction control signal to the PWRBRK pin of each GPU to enable the frequency reduction operation of each GPU. It can be seen that this application is directly implemented by the underlying hardware circuit, without the intervention of the operating system, and the response speed is faster, so that the entire frequency reduction operation of the GPU can be completed within 5ms, and the PSU will not trigger the over-power protection in this relatively short time, thereby avoiding The abnormal power failure of the server causes the loss of the user's business data.
  • the present invention also provides a server, which has the same beneficial effects as the above-mentioned power consumption reduction circuit.
  • FIG. 1 is a schematic structural diagram of a power consumption reduction circuit of a GPU in a server according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a specific structure of a GPU power consumption reduction circuit in a server provided by an embodiment of the present invention
  • FIG. 3 is a schematic diagram of specific components of a GPU power consumption reduction circuit in a server according to an embodiment of the present invention.
  • the core of the present invention is to provide a GPU power reduction circuit and server in the server, which are directly implemented by the underlying hardware circuit, without the intervention of the operating system, and the response speed is relatively fast, so that the entire frequency reduction operation of the GPU can be completed within 5ms. In this short period of time, the over-power protection will not be triggered, thus avoiding the loss of the user's business data caused by the abnormal power failure of the server.
  • FIG. 1 is a schematic structural diagram of a power consumption reduction circuit for a GPU in a server according to an embodiment of the present invention.
  • the power consumption reduction circuit of the GPU in the server includes:
  • the frequency reduction control chip 1 connected to the PSU in the server and the PWRBRK pin of each GPU in the server is used to generate the frequency reduction control signal to the PWRBRK pin of each GPU after receiving the over-power alarm signal generated by the PSU. To turn on the frequency reduction operation of each GPU.
  • the power consumption reduction circuit of the GPU in the server of the present application includes the frequency reduction control chip 1, and its working principle is as follows:
  • the PSU of the server When the PSU of the server detects that its output power is greater than the preset over-power threshold (the rated power of the PSU can be used in this application), it generates an over-power alarm signal to the frequency reduction control chip 1. After receiving the over-power alarm signal, the frequency reduction control chip 1 generates a frequency reduction control signal to the PWRBRK pin (full name of the power break pin, namely the power control pin) of each GPU in the server.
  • the PWRBRK pin full name of the power break pin, namely the power control pin
  • the PSU of this application can choose the PSU 1+1 redundancy architecture, and the frequency reduction control chip 1 is connected to two PSUs, and is used to generate frequency reduction control signals after receiving the overpower alarm signal generated by any PSU . Since the GPU of this application can achieve a fast frequency reduction response, this application can meet the design requirements of a server product with a PSU 1+1 redundancy architecture.
  • the invention provides a power consumption reduction circuit for GPU in a server, which includes a frequency reduction control chip. After receiving the over-power alarm signal generated by the PSU, the frequency reduction control chip generates a frequency reduction control signal to the PWRBRK pin of each GPU to enable the frequency reduction operation of each GPU. It can be seen that this application is directly implemented by the underlying hardware circuit, without the intervention of the operating system, and the response speed is faster, so that the entire frequency reduction operation of the GPU can be completed within 5ms, and the PSU will not trigger the over-power protection in this relatively short time, thereby avoiding The abnormal power failure of the server causes the loss of the user's business data.
  • FIG. 2 is a schematic diagram of a specific structure of a GPU power consumption reduction circuit in a server according to an embodiment of the present invention.
  • the power consumption reduction circuit further includes:
  • the switch chip 2 connected to the output terminal of the frequency reduction control chip 1 and the PWRBRK pin of each GPU respectively;
  • the control circuit 3 connected to the channel control terminal of the switch chip 2 is used to determine the target GPU to be down-converted from each GPU according to the comparison between the power consumption of the server and the rated power of the PSU, and control the switch chip 2
  • the transmission channel between the frequency reduction control chip 1 and the target GPU is turned on, so that the frequency reduction control signal is output to the PWRBRK pin of the target GPU, and the frequency reduction operation of the target GPU is turned on.
  • the power consumption reduction circuit of the present application further includes a switch chip 2 (a FET Switch chip can be used) and a control circuit 3, and its working principle is as follows:
  • the switch chip 2 is set on the connection line between the output terminal of the down-conversion control chip 1 and the PWRBRK pin of each GPU, and is used to control the output terminal of the down-conversion control chip 1 and the PWRBRK pin of each GPU one by one.
  • the circuit is on and off. It can be understood that when the line between the output terminal of the down-conversion control chip 1 and the PWRBRK pin of a GPU is connected, the down-conversion control signal generated by the down-conversion control chip 1 can be output to the PWRBRK lead of the GPU. When the line between the output terminal of the frequency reduction control chip 1 and the PWRBRK pin of a GPU is disconnected, the frequency reduction control signal generated by the frequency reduction control chip 1 cannot be output to the PWRBRK pin of the GPU.
  • the control circuit 3 obtains the power consumption of the server and the rated power of the PSU respectively, and then according to the comparison between the power consumption of the server and the rated power of the PSU, determines the target to be down-converted from all GPUs in the server GPU, the purpose is to control the switch chip 2 to connect the line between the output end of the down-conversion control chip 1 and the PWRBRK pin of the target GPU, that is, to open the transmission channel between the down-conversion control chip 1 and the target GPU, so as to enable the down-conversion control
  • the frequency reduction control signal generated by chip 1 is output to the PWRBRK pin of the target GPU, and the frequency reduction operation of the target GPU is started.
  • control circuit 3 includes:
  • I/O expansion chip connected to the channel control terminal of switch chip 2;
  • the controller connected to the I/O expansion chip is used to determine the target GPU to be down-converted from each GPU according to the comparison between the power consumption of the server and the rated power of the PSU, and control it through the I/O expansion chip
  • the switch chip 2 opens the transmission channel between the frequency reduction control chip 1 and the target GPU, so that the frequency reduction control signal is output to the PWRBRK pin of the target GPU.
  • control circuit 3 of the present application includes an I/O (Input/Output, input/output) expansion chip (PCA9555 chip can be selected) and a controller, and its working principle is:
  • the number of transmission channels of the switch chip 2 needs to be greater than or equal to the total number of GPUs in the server (as shown in Figure 3, there are 4 GPUs in the server, which are not specifically limited in this application), and the switch chip 2 is provided with One-by-one channel control terminals (OE1-OE4 as shown in Figure 3) that control the opening or closing of the transmission channel.
  • the controller of this application is connected to the channel control terminals of the switch chip 2 through the I/O expansion chip one by one, so that the controller can control the output signal of the I/O expansion chip.
  • the transmission channel of the control switch chip 2 is opened or closed.
  • the controller is specifically configured to:
  • the I/O expansion chip control switch chip 2 opens the transmission channel between the frequency reduction control chip 1 and the k GPUs, so that the frequency reduction control signal is output to the PWRBRK pins of the k GPUs;
  • the transmission channel between the frequency reduction control chip 1 and the m GPUs is opened through the I/O expansion chip control switch chip 2, so that the frequency reduction control signal is output to the PWRBRK pins of the m GPUs.
  • the rated power of the PSU is P1
  • the total power consumption of other components in the server machine except the GPU is P2
  • the power consumption of a single GPU is P3
  • the total number of GPUs in the server machine is m
  • the controller is controlled by the I/O expansion chip at this time
  • the switch chip 2 opens the transmission channel between the frequency reduction control chip 1 and the k GPUs, so that the frequency reduction control signal generated by the frequency reduction control chip 1 is output to the PWRBRK pins of the k GPUs.
  • the controller opens the transmission channel between the down-conversion control chip 1 and all GPUs through the I/O expansion chip control switch chip 2, so that the down-conversion control signal generated by the down-conversion control chip 1 is output to PWRBRK pins for all GPUs.
  • the controller is connected to the PSU through the PMBus bus;
  • controller is also used to reduce the overpower threshold of the PSU when k ⁇ m.
  • the controller is connected to the PSU through the PMBus bus (power management bus), and the purpose is to reduce the overpower threshold of the PSU through the PMBus bus when k ⁇ m, thereby more effectively avoiding the overload of the PSU before the GPU down-conversion operation takes effect.
  • PMBus bus power management bus
  • the controller is also used to, when k>m, perform an alarm that characterizes that the power consumption of the entire server after frequency reduction still exceeds the rated power of the PSU.
  • control The device also performs an alarm that characterizes that the power consumption of the server after frequency reduction still exceeds the rated power of the PSU, such as controlling the web interface of the BMC (Baseboard Manager Controller) to give an alarm prompt for the user to view.
  • BMC Baseboard Manager Controller
  • the frequency reduction control chip 1 is specifically a CPLD in the server, and the controller is specifically a BMC in the server.
  • the frequency reduction control chip 1 of the present application can be implemented by using CPLD (Complex Programmable Logic Device, CPLD) in the server, and the controller can be implemented by using BMC in the server, as shown in Figure 3, without additional devices, saving Cost, simplified structure.
  • CPLD Complex Programmable Logic Device
  • the BMC can read the rated power of the PSU through the PMBus bus, and read the maximum power consumption of the GPU and other components in the system through the I 2 C bus, and then according to the relationship P2+P3/4 ⁇ n+(mn) ⁇ P3 ⁇ P1 ⁇ P2+P3/4 ⁇ (n-1)+(m-n+1) ⁇ P3, to obtain the reset value of the number of GPUs that need to turn on the frequency reduction control and the overpower threshold of the PSU.
  • This application also provides a server, including a PSU and a GPU, and also includes any of the foregoing GPU power consumption reduction circuits in the server.
  • the PSU is specifically a PSU 1+1 redundancy architecture.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Power Sources (AREA)

Abstract

一种服务器内GPU的降功耗电路及一种服务器,该降功耗电路包括降频控制芯片。降频控制芯片在接收到PSU生成的过功率告警信号后,生成降频控制信号至各GPU的PWRBRK引脚,以开启各GPU的降频操作。可见,本申请直接采用底层硬件线路实现,无需操作系统介入,响应速度较快,使得GPU的整个降频操作可在5ms内完成,PSU在此较短时间内不会触发过功率保护,从而避免了服务器异常掉电造成用户的业务数据丢失的情况发生。

Description

一种服务器内GPU的降功耗电路及服务器
本申请要求于2020年4月16日提交中国专利局、申请号为202010300844.3、发明名称为“一种服务器内GPU的降功耗电路及服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及服务器领域,特别是涉及一种服务器内GPU的降功耗电路及服务器。
背景技术
随着大数据、物联网等技术的应用,数据在近几年呈指数型增长,导致仅由CPU作为数据处理核心的传统服务器无法满足数据处理需求,搭载GPU(Graphics Processing Unit,图形处理器)的服务器应运而生。目前,随着GPU的计算能力不断提升,其功耗也随之升高,再加上服务器内CPU、内存、硬盘等硬件的存在,服务器的整机功耗将会超过为服务器供电的PSU(Power Supply Unit,供电模块)所能支持的上限。
现有技术中,服务器通常采用Power capping(功率封顶)方法限制服务器的整机功耗,以使服务器的整机功耗小于PSU所能支持的上限。具体地,Power capping方法为:提前为服务器设置一个总功耗阈值,在服务器运行的过程中,监控服务器的整机功耗是否超过所设的总功耗阈值,若是,则启动降功耗策略,即通过PCIE(peripheral component interconnect express,外围元件快速互联)总线向服务器内GPU下发降频指令,以使GPU在接收到降频指令后对自身运行频率进行限制。但是,Power capping方法在操作系统层面完成,存在较长时间的延时(大约在50ms左右),在此段时间内,PSU很有可能已经因超功率输出触发过功率保护(PSU超功率输出一段时间后会触发过功率保护),这将会导致服务器异常掉电,从而造成用户的业务数据丢失。
因此,如何提供一种解决上述技术问题的方案是本领域的技术人员目前需要解决的问题。
发明内容
本发明的目的是提供一种服务器内GPU的降功耗电路及服务器,直接采用底层硬件线路实现,无需操作系统介入,响应速度较快,使得GPU的整个降频操作可在5ms内完成,PSU在此较短时间内不会触发过功率保护,从而避免了服务器异常掉电造成用户的业务数据丢失的情况发生。
为解决上述技术问题,本发明提供了一种服务器内GPU的降功耗电路,包括:
分别与服务器内的PSU和所述服务器内各GPU的PWRBRK引脚连接的降频控制芯片,用于在接收到所述PSU生成的过功率告警信号后,生成降频控制信号至各所述GPU的PWRBRK引脚,以开启各所述GPU的降频操作。
优选地,所述降功耗电路还包括:
分别与所述降频控制芯片的输出端和各所述GPU的PWRBRK引脚连接的开关芯片;
与所述开关芯片的通道控制端子连接的控制电路,用于根据所述服务器的整机功耗与所述PSU的额定功率的对比关系,从各所述GPU中确定需降频处理的目标GPU,并控制所述开关芯片将所述降频控制芯片和所述目标GPU之间的传输通道打开,以使所述降频控制信号输出至所述目标GPU的PWRBRK引脚,开启所述目标GPU的降频操作。
优选地,所述控制电路包括:
与所述开关芯片的通道控制端子连接的I/O扩展芯片;
与所述I/O扩展芯片连接的控制器,用于根据所述服务器的整机功耗与所述PSU的额定功率的对比关系,从各所述GPU中确定需降频处理的目标GPU,并通过所述I/O扩展芯片控制所述开关芯片将所述降频控制芯片和所述目标GPU之间的传输通道打开,以使所述降频控制信号输出至所述目标GPU的PWRBRK引脚。
优选地,所述控制器具体用于:
当P2+m×P3>P1时,将所述服务器中需降频处理的目标GPU的数量n从n=1开始遍历,确定第一个满足关系式P2+P3/N×n+(m-n)×P3≤P1≤P2+P3/N×(n-1)+(m-n+1)×P3的整数k;其中,P1为所述PSU的额定功率,P2为所述服务器整机中除GPU外的其他部件总功耗,P3为单个GPU的功耗,m为所述服务器整机中GPU的总数量,N为预设参数;
当k≤m时,通过所述I/O扩展芯片控制所述开关芯片打开所述降频控制芯片和k个GPU之间的传输通道,以使所述降频控制信号输出至k个所述GPU的PWRBRK引脚;
当k>m时,通过所述I/O扩展芯片控制所述开关芯片打开所述降频控制芯片和m个GPU之间的传输通道,以使所述降频控制信号输出至m个所述GPU的PWRBRK引脚。
优选地,所述控制器通过PMBus总线与所述PSU连接;
且所述控制器还用于当k≤m时,降低所述PSU的过功率阈值。
优选地,所述控制器具体用于当k≤m时,根据过功率阈值P=P2+P3/N×k+(m-k)×P3修改所述PSU的过功率阈值,以降低所述PSU的过功率阈值。
优选地,所述控制器还用于当k>m时,进行表征所述服务器在降频后的整机功耗仍超过所述PSU的额定功率的告警。
优选地,所述降频控制芯片具体为所述服务器内的CPLD,所述控制器具体为所述服务器内的BMC。
为解决上述技术问题,本发明还提供了一种服务器,包括PSU和GPU,还包括上述任一种服务器内GPU的降功耗电路。
优选地,所述PSU具体为PSU 1+1冗余架构。
本发明提供了一种服务器内GPU的降功耗电路,包括降频控制芯片。降频控制芯片在接收到PSU生成的过功率告警信号后,生成降频控制信号至各GPU的PWRBRK引脚,以开启各GPU的降频操作。可见,本申请直接采用底层硬件线路实现,无需操作系统介入,响应速度较快,使得GPU的整个降频操作可在5ms内完成,PSU在此较短时间内不会触发过功率保护,从而避免了服务器异常掉电造成用户的业务数据丢失的情况发生。
本发明还提供了一种服务器,与上述降功耗电路具有相同的有益效果。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对现有技术和实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种服务器内GPU的降功耗电路的结构示意图;
图2为本发明实施例提供的一种服务器内GPU的降功耗电路的具体结构示意图;
图3为本发明实施例提供的一种服务器内GPU的降功耗电路的具体器件示意图。
具体实施方式
本发明的核心是提供一种服务器内GPU的降功耗电路及服务器,直接采用底层硬件线路实现,无需操作系统介入,响应速度较快,使得GPU的整个降频操作可在5ms内完成,PSU在此较短时间内不会触发过功率保护,从而避免了服务器异常掉电造成用户的业务数据丢失的情况发生。
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参照图1,图1为本发明实施例提供的一种服务器内GPU的降功耗电路的结构示意图。
该服务器内GPU的降功耗电路包括:
分别与服务器内的PSU和服务器内各GPU的PWRBRK引脚连接的降频控制芯片1,用于在接收到PSU生成的过功率告警信号后,生成降频控制信号至各GPU的PWRBRK引脚,以开启各GPU的降频操作。
具体地,本申请的服务器内GPU的降功耗电路包括降频控制芯片1,其工作原理为:
服务器的PSU在检测到自身输出功率大于预设过功率阈值(本申请可选用PSU的额定功率)时,生成过功率告警信号至降频控制芯片1。降频控制芯片1在接收到过功率告警信号后,生成降频控制信号至服务器内各GPU的PWRBRK引脚(全称power break引脚,即功率控制引脚)。每个GPU的PWRBRK引脚在接收到降频控制信号后,会将功耗降至当前功耗的1/N(N为正参数,N的取值取决于GPU内部所设置的降功耗策略,如N=4)左右,从而使服务器的整机功耗以较快速度降至PSU可支持的范围内,以保证系统不掉电。
此外,本申请的PSU可选用PSU 1+1冗余架构,则降频控制芯片1与两个PSU连接,用于在接收到任一PSU生成的过功率告警信号后,均生成降频控制信号。由于本申请的GPU可实现快速降频响应,所以本申请可满足具有PSU 1+1冗余架构的服务器产品的设计要求。
本发明提供了一种服务器内GPU的降功耗电路,包括降频控制芯片。降频控制芯片在接收到PSU生成的过功率告警信号后,生成降频控制信号至各GPU的PWRBRK引脚,以开启各GPU的降频操作。可见,本申请直接采用底层硬件线路实现,无需操作系统介入,响应速度较快,使得GPU的整个降频操作可在5ms内完成,PSU在此较短时间内不会触发过功率保护,从而避免了服务器异常掉电造成用户的业务数据丢失的情况发生。
在上述实施例的基础上:
请参照图2,图2为本发明实施例提供的一种服务器内GPU的降功耗电路的具体结构示意图。
作为一种可选的实施例,降功耗电路还包括:
分别与降频控制芯片1的输出端和各GPU的PWRBRK引脚连接的开关芯片2;
与开关芯片2的通道控制端子连接的控制电路3,用于根据服务器的整机功耗与PSU的额定功率的对比关系,从各GPU中确定需降频处理的目标GPU,并控制开关芯片2将降频控制芯片1和目标GPU之间的传输通道打开,以使降频控制信号输出至目标GPU的PWRBRK引脚,开启目标GPU的降频操作。
具体地,本申请的降功耗电路还包括开关芯片2(可选用FET Switch的芯片)和控制电路3,其工作原理为:
开关芯片2设于降频控制芯片1的输出端和各GPU的PWRBRK引脚之间的连接线路上,用于一一控制降频控制芯片1的输出端与各GPU的PWRBRK引脚之间的线路通断,可以理解的是,当降频控制芯片1的输出端与一GPU的PWRBRK引脚之间的线路连通时,降频控制芯片1生成的降频控制信号可输出至此GPU的PWRBRK引脚;当降频控制芯片1的输出端与一GPU的PWRBRK引脚之间的线路断开时,降频控制芯片1生成的降频控制信号无法输出至此GPU的PWRBRK引脚。
基于此,控制电路3分别获取服务器的整机功耗和PSU的额定功率,然后根据服务器的整机功耗与PSU的额定功率的对比关系,从服务器内所有GPU中确定需降频处理的目标GPU,目的是控制开关芯片2连通降频控制芯片1的输出端与目标GPU的PWRBRK引脚之间的线路,即将降频控制芯片1和目标GPU之间的传输通道打开,从而使降频控制芯片1生成的降频控制信号输出至目标GPU的PWRBRK引脚,开启目标GPU的降频操作。
作为一种可选的实施例,控制电路3包括:
与开关芯片2的通道控制端子连接的I/O扩展芯片;
与I/O扩展芯片连接的控制器,用于根据服务器的整机功耗与PSU的额定功率的对比关系,从各GPU中确定需降频处理的目标GPU,并通过I/O扩展芯片控制开关芯片2将降频控制芯片1和目标GPU之间的传输通道打开,以使降频控制信号输出至目标GPU的PWRBRK引脚。
具体地,本申请的控制电路3包括I/O(Input/Output,输入/输出)扩展芯片(可选用PCA9555型号的芯片)和控制器,其工作原理为:
开关芯片2的传输通道数量需大于等于服务器内GPU的总数量(如图3所示,服务器内设有4个GPU,本申请对此不做特别地限定),且开关芯片2上设有用于一一控制传输通道打开或关闭的通道控制端子(如图3所示OE1-OE4)。考虑到控制器的I/O口有限,所以本申请的控制器通过I/O扩展芯片与开关芯片2的通道控制端子一一连接,以使控制器通过控制I/O扩展芯片的输出信号来控制开关芯片2的传输通道打开或关闭。
作为一种可选的实施例,控制器具体用于:
当P2+m×P3>P1时,将服务器中需降频处理的目标GPU的数量n从n=1开始遍历,确定第一个满足关系式P2+P3/N×n+(m-n)×P3≤P1≤P2+P3/N×(n-1)+(m-n+1)×P3的整数k;其中,P1为PSU的额定功率,P2为服务器整机中除GPU外的其他部件总功耗,P3为单个GPU的功耗,m为服务器整机中GPU的总数量,N为预设参数;
当k≤m时,通过I/O扩展芯片控制开关芯片2打开降频控制芯片1和k个GPU之间的传输通道,以使降频控制信号输出至k个GPU的PWRBRK引脚;
当k>m时,通过I/O扩展芯片控制开关芯片2打开降频控制芯片1和m个GPU之间的传输通道,以使降频控制信号输出至m个GPU的PWRBRK引脚。
具体地,设PSU的额定功率为P1,服务器整机中除GPU外的其他部件总功耗为P2,单个GPU的功耗为P3,服务器整机中GPU的总数量为m,则:
1)若P2+m×P3≤P1,说明服务器的整机功耗未超出PSU所能支持的上限,则系统无需执行降频策略。
2)若P2+m×P3>P1,说明服务器的整机功耗超出PSU所能支持的上限,则系统需执行降频策略。设系统中需降频处理的目标GPU的数量为n,将n从n=1开始遍历,找到第一个满足如下关系式的整数k:
P2+P3/4×n+(m-n)×P3≤P1≤P2+P3/4×(n-1)+(m-n+1)×P3。
当k≤m时,说明需对系统中k个GPU进行降频处理,即可满足降频后服务器的整机功耗低于PSU的额定功率,则此时控制器通过I/O扩展芯片控制开关芯片2打开降频控制芯片1和k个GPU之间的传输通道,以使降频控制芯片1生成的降频控制信号输出至k个GPU的PWRBRK引脚。
当k>m时,此时控制器通过I/O扩展芯片控制开关芯片2打开降频控制芯片1和所有GPU之间的传输通道,以使降频控制芯片1生成的降频控制信号输出至所有GPU的PWRBRK引脚。
作为一种可选的实施例,控制器通过PMBus总线与PSU连接;
且控制器还用于当k≤m时,降低PSU的过功率阈值。
进一步地,控制器通过PMBus总线(电源管理总线)与PSU连接,目的是当k≤m时,通过PMBus总线降低PSU的过功率阈值,从而更有效避免GPU降频操作生效前PSU过载。
作为一种可选的实施例,控制器具体用于当k≤m时,根据过功率阈值P=P2+P3/N×k+(m-k)×P3修改PSU的过功率阈值,以降低PSU的过功率阈值。
具体地,当k≤m时,控制器可根据过功率阈值P=P2+P3/N×k+(m-k)×P3修改PSU的过功率阈值,从而较为合理地降低PSU的过功率阈值。
作为一种可选的实施例,控制器还用于当k>m时,进行表征服务器在降频后的整机功耗仍超过PSU的额定功率的告警。
进一步地,考虑到当k>m时,即使对系统中全部的m个GPU进行降频处理,降频后服务器的整机功耗也会超出PSU的额定功率,所以当k>m时,控制器还进行表征服务器在降频后的整机功耗仍超过PSU的额定功率的告警,如控制BMC(Baseboard Manager Controller,基板管理控制器)的web界面进行告警提示,供用户查看。
作为一种可选的实施例,降频控制芯片1具体为服务器内的CPLD,控制器具体为服务器内的BMC。
具体地,本申请的降频控制芯片1可采用服务器内的CPLD(Complex Programmable Logic Device,CPLD)实现,控制器可采用服务器内的BMC实现,如图3所示,无需另外增设器件,节约了成本,简化了结构。
更具体地,BMC可通过PMBus总线读取PSU的额定功率,并通过I 2C总线读取GPU以及系统中其他部件的最大功耗,然后根据关系式P2+P3/4×n+(m-n)×P3≤P1≤P2+P3/4×(n-1)+(m-n+1)×P3,得到需要开启降频控制的GPU数量和PSU的过功率阈值的重新设定值。
本申请还提供了一种服务器,包括PSU和GPU,还包括上述任一种服务器内GPU的降功耗电路。
作为一种可选的实施例,PSU具体为PSU 1+1冗余架构。
本申请提供的服务器的介绍请参考上述降功耗电路的实施例,本申请在此不再赘述。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其他实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (10)

  1. 一种服务器内GPU的降功耗电路,其特征在于,包括:
    分别与服务器内的PSU和所述服务器内各GPU的PWRBRK引脚连接的降频控制芯片,用于在接收到所述PSU生成的过功率告警信号后,生成降频控制信号至各所述GPU的PWRBRK引脚,以开启各所述GPU的降频操作
  2. 如权利要求1所述的服务器内GPU的降功耗电路,其特征在于,所述降功耗电路还包括:
    分别与所述降频控制芯片的输出端和各所述GPU的PWRBRK引脚连接的开关芯片;
    与所述开关芯片的通道控制端子连接的控制电路,用于根据所述服务器的整机功耗与所述PSU的额定功率的对比关系,从各所述GPU中确定需降频处理的目标GPU,并控制所述开关芯片将所述降频控制芯片和所述目标GPU之间的传输通道打开,以使所述降频控制信号输出至所述目标GPU的PWRBRK引脚,开启所述目标GPU的降频操作。
  3. 如权利要求2所述的服务器内GPU的降功耗电路,其特征在于,所述控制电路包括:
    与所述开关芯片的通道控制端子连接的I/O扩展芯片;
    与所述I/O扩展芯片连接的控制器,用于根据所述服务器的整机功耗与所述PSU的额定功率的对比关系,从各所述GPU中确定需降频处理的目标GPU,并通过所述I/O扩展芯片控制所述开关芯片将所述降频控制芯片和所述目标GPU之间的传输通道打开,以使所述降频控制信号输出至所述目标GPU的PWRBRK引脚。
  4. 如权利要求3所述的服务器内GPU的降功耗电路,其特征在于,所述控制器具体用于:
    当P2+m×P3>P1时,将所述服务器中需降频处理的目标GPU的数量n从n=1开始遍历,确定第一个满足关系式P2+P3/N×n+(m-n)×P3≤P1≤P2+P3/N×(n-1)+(m-n+1)×P3的整数k;其中,P1为所述PSU的额定功率,P2 为所述服务器整机中除GPU外的其他部件总功耗,P3为单个GPU的功耗,m为所述服务器整机中GPU的总数量,N为预设参数;
    当k≤m时,通过所述I/O扩展芯片控制所述开关芯片打开所述降频控制芯片和k个GPU之间的传输通道,以使所述降频控制信号输出至k个所述GPU的PWRBRK引脚;
    当k>m时,通过所述I/O扩展芯片控制所述开关芯片打开所述降频控制芯片和m个GPU之间的传输通道,以使所述降频控制信号输出至m个所述GPU的PWRBRK引脚。
  5. 如权利要求4所述的服务器内GPU的降功耗电路,其特征在于,所述控制器通过PMBus总线与所述PSU连接;
    且所述控制器还用于当k≤m时,降低所述PSU的过功率阈值。
  6. 如权利要求5所述的服务器内GPU的降功耗电路,其特征在于,所述控制器具体用于当k≤m时,根据过功率阈值P=P2+P3/N×k+(m-k)×P3修改所述PSU的过功率阈值,以降低所述PSU的过功率阈值。
  7. 如权利要求4所述的服务器内GPU的降功耗电路,其特征在于,所述控制器还用于当k>m时,进行表征所述服务器在降频后的整机功耗仍超过所述PSU的额定功率的告警。
  8. 如权利要求3所述的服务器内GPU的降功耗电路,其特征在于,所述降频控制芯片具体为所述服务器内的CPLD,所述控制器具体为所述服务器内的BMC。
  9. 一种服务器,其特征在于,包括PSU和GPU,还包括如权利要求1-8任一项所述的服务器内GPU的降功耗电路。
  10. 如权利要求9所述的服务器,其特征在于,所述PSU具体为PSU1+1冗余架构。
PCT/CN2020/117277 2020-04-16 2020-09-24 一种服务器内gpu的降功耗电路及服务器 WO2021208360A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/791,310 US11656674B2 (en) 2020-04-16 2020-09-24 Power consumption reduction circuit for GPUs in server, and server

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010300844.3A CN111475009B (zh) 2020-04-16 2020-04-16 一种服务器内gpu的降功耗电路及服务器
CN202010300844.3 2020-04-16

Publications (1)

Publication Number Publication Date
WO2021208360A1 true WO2021208360A1 (zh) 2021-10-21

Family

ID=71753762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/117277 WO2021208360A1 (zh) 2020-04-16 2020-09-24 一种服务器内gpu的降功耗电路及服务器

Country Status (3)

Country Link
US (1) US11656674B2 (zh)
CN (1) CN111475009B (zh)
WO (1) WO2021208360A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475009B (zh) * 2020-04-16 2022-03-22 苏州浪潮智能科技有限公司 一种服务器内gpu的降功耗电路及服务器
CN112269466B (zh) * 2020-10-16 2022-07-08 苏州浪潮智能科技有限公司 一种功率芯片的供电方法及服务器主板
CN112947720B (zh) * 2021-02-19 2022-12-09 浪潮电子信息产业股份有限公司 一种ai服务器的安全控制方法及系统
CN113064479B (zh) * 2021-03-03 2023-05-23 山东英信计算机技术有限公司 一种gpu服务器的电源冗余控制系统、方法及介质
CN113157076B (zh) * 2021-04-22 2024-01-30 中科可控信息产业有限公司 一种电子设备及功耗控制方法
CN113589913B (zh) * 2021-09-27 2021-12-17 苏州浪潮智能科技有限公司 一种cpu性能调节方法、装置及介质
CN114759773B (zh) * 2022-04-22 2023-11-03 苏州浪潮智能科技有限公司 一种服务器多输入电源、控制方法及存储介质
CN117369612B (zh) * 2023-12-08 2024-02-13 电子科技大学 一种服务器硬件管理系统及方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140380073A1 (en) * 2013-06-20 2014-12-25 Quanta Computer Inc. Computer system and power management method thereof
CN107831883A (zh) * 2017-11-24 2018-03-23 郑州云海信息技术有限公司 一种gpu服务器电源异常保护系统及方法
CN107908583A (zh) * 2017-11-09 2018-04-13 郑州云海信息技术有限公司 一种服务器用功耗管理板
CN108304295A (zh) * 2018-01-29 2018-07-20 郑州云海信息技术有限公司 一种控制gpu降频的方法、装置和计算机可读存储介质
CN210111685U (zh) * 2019-06-14 2020-02-21 同方国际信息技术(苏州)有限公司 一种电源切换的快速反应电路
CN111475009A (zh) * 2020-04-16 2020-07-31 苏州浪潮智能科技有限公司 一种服务器内gpu的降功耗电路及服务器

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9395774B2 (en) * 2012-12-28 2016-07-19 Intel Corporation Total platform power control
CN106919240B (zh) * 2015-12-28 2022-12-09 伊姆西Ip控股有限责任公司 用于向处理器供电的方法和设备
CN107844187B (zh) * 2016-09-21 2020-06-12 龙芯中科技术有限公司 功耗管理方法、装置及电子设备
CN106598814B (zh) * 2016-12-26 2019-05-14 郑州云海信息技术有限公司 一种实现服务器系统过热保护的设计方法
CN107450702A (zh) * 2017-06-29 2017-12-08 郑州云海信息技术有限公司 一种减小Rack GPU电压波动的供电系统
US10761592B2 (en) * 2018-02-23 2020-09-01 Dell Products L.P. Power subsystem-monitoring-based graphics processing system
US10788876B2 (en) * 2018-07-27 2020-09-29 Dell Products L.P. System and method to maintain power cap while baseboard management controller reboots
CN109960632A (zh) * 2019-03-20 2019-07-02 苏州浪潮智能科技有限公司 一种实现gpu服务器电源冗余的方法及系统
CN110147155A (zh) * 2019-05-21 2019-08-20 苏州浪潮智能科技有限公司 基于bmc的服务器电源冷冗余控制方法、装置及bmc
CN110597684A (zh) * 2019-08-02 2019-12-20 苏州浪潮智能科技有限公司 一种降低系统过载风险的psu及降低系统过载风险方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140380073A1 (en) * 2013-06-20 2014-12-25 Quanta Computer Inc. Computer system and power management method thereof
CN107908583A (zh) * 2017-11-09 2018-04-13 郑州云海信息技术有限公司 一种服务器用功耗管理板
CN107831883A (zh) * 2017-11-24 2018-03-23 郑州云海信息技术有限公司 一种gpu服务器电源异常保护系统及方法
CN108304295A (zh) * 2018-01-29 2018-07-20 郑州云海信息技术有限公司 一种控制gpu降频的方法、装置和计算机可读存储介质
CN210111685U (zh) * 2019-06-14 2020-02-21 同方国际信息技术(苏州)有限公司 一种电源切换的快速反应电路
CN111475009A (zh) * 2020-04-16 2020-07-31 苏州浪潮智能科技有限公司 一种服务器内gpu的降功耗电路及服务器

Also Published As

Publication number Publication date
US11656674B2 (en) 2023-05-23
US20230035371A1 (en) 2023-02-02
CN111475009B (zh) 2022-03-22
CN111475009A (zh) 2020-07-31

Similar Documents

Publication Publication Date Title
WO2021208360A1 (zh) 一种服务器内gpu的降功耗电路及服务器
US10817043B2 (en) System and method for entering and exiting sleep mode in a graphics subsystem
US11181971B2 (en) System power management using communication bus protocols
US11385985B2 (en) Server power consumption management method and device
US8935441B2 (en) USB 3.0 device and control method thereof
US20210041929A1 (en) Dynamic network controller power management
WO2012016492A1 (zh) 一种电源模块和电源系统
JP2003092359A (ja) 半導体集積回路
WO2022007414A1 (zh) 一种基于控制芯片的服务器风扇控制装置及方法
EP2936273A1 (en) Reduction of idle power in a communication port
CN113126892A (zh) 控制存储系统方法、电子设备和计算机程序产品
KR101355326B1 (ko) 포트를 활성 상태로부터 대기 상태로 전이하는 방법, 통신장치에서 이용하기 위한 장치 및 데이터 통신 시스템
CN112433596A (zh) 链路宽度调节方法、装置、设备和存储介质
CN109062392A (zh) 一种自动切换服务器板卡供电的设备、方法及系统
TW202125156A (zh) 在待機階段提供電源的方法
WO2018024190A1 (zh) 一种快速保护系统
US10331592B2 (en) Communication apparatus with direct control and associated methods
US10587265B2 (en) Semiconductor device and semiconductor system
WO2023029375A1 (zh) 一种四路服务器电源功耗管理装置
CN114089825B (zh) 服务器的供电方法及供电电路
TWM620009U (zh) 高速傳輸系統與訊號中繼器
WO2023245980A1 (zh) 半导体器件
AU702881B2 (en) Computer system speed control using memory refresh counter
CN115437283A (zh) 一种自动控制带外控制器的核心的方法及装置
CN117148953A (zh) 微控制器及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20931626

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20931626

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20931626

Country of ref document: EP

Kind code of ref document: A1