CN115981853A - GPU (graphics processing Unit) interconnection architecture, method for realizing GPU interconnection architecture and computing equipment - Google Patents

GPU (graphics processing Unit) interconnection architecture, method for realizing GPU interconnection architecture and computing equipment Download PDF

Info

Publication number
CN115981853A
CN115981853A CN202211663517.XA CN202211663517A CN115981853A CN 115981853 A CN115981853 A CN 115981853A CN 202211663517 A CN202211663517 A CN 202211663517A CN 115981853 A CN115981853 A CN 115981853A
Authority
CN
China
Prior art keywords
gpu
communication ports
card
interconnected
display memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211663517.XA
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Moore Threads Technology Co Ltd
Original Assignee
Moore Threads Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Moore Threads Technology Co Ltd filed Critical Moore Threads Technology Co Ltd
Priority to CN202211663517.XA priority Critical patent/CN115981853A/en
Publication of CN115981853A publication Critical patent/CN115981853A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present disclosure provides a GPU interconnection architecture, a method for implementing the same, a computing device, a computer program product, and a computer readable storage medium. The GPU interconnection architecture comprises: at least two GPU board cards which are interconnected; each of the at least two GPU boards comprises a display memory and N communication ports, and the display memory of the GPU board interconnected with the GPU board is accessed using the at least two communication ports, where N is an integer greater than or equal to 2. Each of the at least two GPU cards is configured to: and according to the count value, accessing the display memory of the GPU card interconnected with the GPU card by using at least two communication ports in the N communication ports in turn. According to the embodiment of the disclosure, load balancing of a plurality of communication ports is realized, and the link of an individual communication port is prevented from being idle and the link of some communication ports is prevented from being blocked, thereby realizing high speed and high bandwidth.

Description

GPU (graphics processing Unit) interconnection architecture, method for realizing GPU interconnection architecture and computing equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a GPU interconnection architecture, a method of implementing the GPU interconnection architecture, a computing device, a computer program product, and a computer-readable storage medium.
Background
A Graphics Processing Unit (GPU) is a microprocessor dedicated to personal computers, workstations, game machines and some mobile devices (e.g. tablet computers, smart phones, etc.), and implements image and graphics related operations. With the development of computer technology, the throughput of data is huge in the 3D operation of a high-end graphics processor, so that the server platform has an excessive demand for PCIE resources. In order to meet the requirement of ultra-strong computing capacity, a plurality of GPU board cards can be integrated in the computing equipment.
Disclosure of Invention
The inventors have noted that for multiple interconnected GPU cards, the data traffic (i.e., load) on the communication port of each GPU card is typically unbalanced. The method and the device schedule the data packets as the granularity, realize the balanced load of a plurality of ports on each GPU board card, and further more fully utilize the video memory bandwidth.
According to an aspect of the present disclosure, a GPU interconnect architecture is provided. The GPU interconnection architecture comprises: at least two GPU board cards which are interconnected; each of the at least two GPU boards comprises a display memory and N communication ports, and the display memory of the GPU board interconnected with the GPU board is accessed using the at least two communication ports, where N is an integer greater than or equal to 2. Each of the at least two GPU cards is configured to: and accessing the display memory of the GPU card interconnected with the GPU card by using at least two communication ports of the N communication ports in turn according to the counting value.
In some embodiments, each of the at least two GPU boards further comprises a counter for generating the count value.
In some embodiments, the counter starts counting from 0 and the maximum value of the count value is equal to N-1.
In some embodiments, each of the at least two GPU boards further comprises a router configured to: and analyzing address information in the access request, and sending the access request to one of the N communication ports.
In some embodiments, the router comprises a plurality of router units; the display memory of the GPU board card interconnected with the GPU board card comprises a plurality of address ranges, and the router units are in one-to-one correspondence with the address ranges.
In some embodiments, each of the at least two GPU boards further comprises a plurality of multiplexers; the plurality of multiplexers correspond one-to-one with the plurality of router units, and each of the plurality of multiplexers is communicably connected with the N communication ports.
In some embodiments, each of the plurality of multiplexers is configured to: and receiving an access request sent by a corresponding router unit, and sending the access request to one of the N communication ports according to the count value.
In some embodiments, each of the at least two GPU boards further comprises a bridge, each of the plurality of multiplexers communicatively connected with the N communication ports via the bridge; the bridge comprises N connecting modules, and the N connecting modules correspond to the N communication ports one to one; each of the N connection modules is communicably connected with a corresponding communication port.
In some embodiments, each of the plurality of multiplexers includes a register that stores addresses of at least two of the N communication ports.
According to another aspect of the present disclosure, a method of implementing a GPU interconnect architecture is provided. The GPU interconnection architecture comprises at least two interconnected GPU cards, each of the at least two GPU cards comprises a display memory and N communication ports, and the display memories of the GPU cards interconnected with the GPU cards are accessed by using the at least two communication ports, wherein N is an integer greater than or equal to 2, and the method comprises the following steps:
configuring each of the at least two GPU cards to: and accessing the display memory of the GPU card interconnected with the GPU card by using at least two communication ports of the N communication ports in turn according to the counting value.
In some embodiments, each of the at least two GPU boards further comprises a counter for generating the count value.
In some embodiments, the counter starts counting from 0 and the maximum value of the count value is equal to N-1.
In some embodiments, each of the at least two GPU boards further comprises a router configured to: and analyzing address information in the access request, and sending the access request to one of the N communication ports.
In some embodiments, the router comprises a plurality of router units; the display memory of the GPU board card interconnected with the GPU board card comprises a plurality of address ranges, and the router units are in one-to-one correspondence with the address ranges.
In some embodiments, each of the at least two GPU boards further comprises a plurality of multiplexers; the plurality of multiplexers correspond one-to-one with the plurality of router units, and each of the plurality of multiplexers is communicably connected with the N communication ports.
In some embodiments, each of the plurality of multiplexers is configured to: and receiving an access request sent by a corresponding router unit, and sending the access request to one of the N communication ports according to the count value.
In some embodiments, each of the at least two GPU boards further comprises a bridge, each of the plurality of multiplexers communicatively connected with the N communication ports via the bridge; the bridge comprises N connecting modules, and the N connecting modules correspond to the N communication ports one to one; each of the N connection modules is communicably connected with a corresponding communication port.
In some embodiments, each of the plurality of multiplexers includes a register that stores addresses of at least two of the N communication ports.
According to yet another aspect of the present disclosure, a computing device is provided. The computing device comprises a GPU interconnect architecture according to any of the preceding embodiments.
According to another aspect of the present disclosure, a computer program product is provided. The computer program product comprises computer executable instructions, wherein the computer executable instructions, when executed by a processor, perform the method according to any of the preceding embodiments.
According to another aspect of the present disclosure, a computer-readable storage medium is provided. The computer-readable storage medium stores computer-executable instructions that, when executed, perform the method according to any of the preceding embodiments.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 illustrates a GPU interconnection architecture according to an embodiment of the present disclosure;
FIG. 2 illustrates a GPU interconnection architecture according to another embodiment of the present disclosure;
FIG. 3 is a schematic diagram illustrating a structure of a GPU card according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram showing a GPU board card according to another embodiment of the present disclosure;
FIG. 5 shows a schematic diagram of a multiplexer in a GPU card according to an embodiment of the present disclosure;
FIG. 6 shows a flow diagram of a method of implementing a GPU interconnect architecture in accordance with an embodiment of the present disclosure; and
FIG. 7 illustrates an example system according to an embodiment of this disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
When multiple GPU cards are integrated within the interior of a computing device, the multiple GPU cards may share computing resources (such as, but not limited to, GPUs) and/or storage resources (such as, but not limited to, display memory) in an interconnected manner. In this configuration, each GPU card can access the display memory VRAM of the other GPU card via at least one port, for example, when it comprises a plurality of ports, via N ports, where N is an integer greater than 1, by means of the interconnection technique. For example, as shown in FIG. 1, GPU card 101 and GPU card 102 may have the same configuration. The GPU card 101 may include a GPU core (GPU), a display memory (VRAM), a router, and 2 communication ports P0, P1. When two GPU boards are interconnected, each GPU board may access the display memory VRAM of the other GPU board via, for example, two communication ports P0, P1.
According to an aspect of the present disclosure, a GPU interconnect architecture is provided. FIG. 1 illustrates a GPU interconnection architecture according to an embodiment of the present disclosure; fig. 2 illustrates a GPU interconnect architecture according to another embodiment of the present disclosure.
As shown in fig. 1, the GPU interconnect architecture comprises: at least two GPU boards 101 and 102 which are interconnected; each of the at least two GPU boards 101, 102 includes a display memory VRAM and N communication ports P0-P3, and accesses the display memory VRAM of the GPU board interconnected with the GPU board using the at least two communication ports, where N is an integer greater than or equal to 2; wherein each of the at least two GPU boards 101, 102 is configured to: at least two communication ports (e.g., P0 and P1) of the N communication ports P0-P3 are alternately used to access the display memory VRAM of the GPU card interconnected with the GPU card according to the count value.
In the embodiment shown in FIG. 1, the first GPU card 101 comprises a GPU core (GPU), a display memory VRAM, a router, and N communication ports P0-P3. At least two communication ports may be selected as needed to access the display memory VRAM of the second GPU board 102. For example, as shown in FIG. 1, communication ports P0 and P1 may be alternately selected to access the second GPU card 102. It should be understood that while the first GPU card 101 and the first GPU card 102 communicate through P0, P1, and P2, the communication ports P0, P1, and P2 may alternatively be selected to access the second GPU card 102. Those skilled in the art will appreciate that the communication ports P0, P1, P2, and P3 (or more communication ports) may also be selected to access the second GPU card 102.
Similarly, the second GPU card 102 includes a GPU core (GPU), a display memory VRAM, a router, and N communication ports P0-P3. At least two communication ports may be selected as needed to access the display memory VRAM of the first GPU board 101. For example, as shown in FIG. 1, communication ports P0 and P1 may be alternately selected to access the first GPU card 101. As previously described, communication ports P0, P1, and P2 may also be selected in turn to access the first GPU card 101. Those skilled in the art will appreciate that communication ports P0, P1, P2, and P3 (or more communication ports) may also be selected to access the first GPU card 101.
In the context of the present disclosure, "alternately" means sequentially and cyclically executed. For example, when the first GPU board 101 selects the communication ports P0 and P1 to access the display memory VRAM of the second GPU board 102, the display memory VRAM of the second GPU board 102 may be accessed in the order of P0-P1-P0-P1. When the first GPU board 101 selects the communication ports P0, P1, and P2 to access the display memory VRAM of the second GPU board 102, the display memory VRAM of the second GPU board 102 may be accessed in the order of P0-P1-P2-P0-P1-P2. Similarly, when the first GPU board 101 selects the communication ports P0, P1, and P2 to access the display memory VRAM of the second GPU board 102, the display memory VRAM of the second GPU board 102 may also be accessed in the order of P0-P2-P1-P0-P2-P1.... Once; that is, the accesses may be performed via the plurality of communication ports in a particular order.
In an embodiment of the disclosure, each GPU board uses at least two of the N communication ports in turn to access the display memory of the GPU board interconnected with the GPU board. According to the embodiment of the disclosure, load balancing of a plurality of communication ports is realized, and the condition that the links of individual communication ports are idle and the links of some communication ports are blocked is prevented, so that high speed and high bandwidth are realized.
Those skilled in the art will appreciate that in embodiments of the present disclosure, each GPU card may include N communication ports (N being an integer greater than or equal to 2). Through address negotiation, the GPU interconnection architecture of the embodiment of the present disclosure may include at least designating at least two communication ports for accessing the GPU board card interconnected with the GPU board card. As shown in fig. 1, the first GPU card 101 may designate communication ports P0 and P1 for accessing VRAMs of the second GPU card 102; the second GPU card 102 can designate communication ports P0 and P1 for accessing the VRAM of the first GPU card 101. It should be understood that the GPU interconnection architecture of the present disclosure may further include a GPU board card designated to access a GPU board card interconnected therewith through one communication port, for example, there is a fifth GPU board card including 4 ports, where a P0 port is used for communicating with the first GPU board card, a P1 port is used for communicating with the second GPU board card, a P2 port is used for communicating with the third GPU board card, and a P3 port is used for communicating with the fourth GPU board card.
In architectures with more interconnected GPU boards, the GPU board can still designate at least two communication ports for accessing the GPU board interconnected with the GPU board through address negotiation. FIG. 2 shows an architecture with four interconnected GPU cards. As shown in fig. 2, the first GPU card 101 may specify: communication ports P0 and P1 are used to access the VRAM of the second GPU card 102, communication ports P2 and P3 are used to access the VRAM of the third GPU card 103, and communication ports P4 and P5 are used to access the VRAM of the fourth GPU card 104. Similarly, the second GPU board 102 may specify: communication ports P0 and P1 are used to access the VRAM of the first GPU card 101, communication ports P3 and P4 are used to access the VRAM of the third GPU card 103, and communication ports P2 and P5 are used to access the VRAM of the fourth GPU card 104.
It is understood that, although each GPU board in the example of the present application designates 2 communication ports for accessing the GPU boards interconnected with the GPU board, the above setting may also be performed according to the actual number of communication ports on each GPU board and the actual number of interconnected GPU boards. For example, for an architecture having 2 interconnected GPU cards, if each GPU card includes 6 communication ports, each GPU card may designate at least one communication port, e.g., 6 communication ports for accessing another GPU card.
Those skilled in the art will appreciate that each GPU card may include a GPU core (shown as "GPU" in fig. 1-4) that includes an adder (array) and/or a multiplier (array) made up of a plurality of gates for performing various types of operations.
In some embodiments, as shown in fig. 1 and 2, each of the at least two GPU boards further comprises a counter C for generating the count value.
A count operation is performed using the counter C, and a count value indicating the communication port is obtained. For each GPU card, the count value may be input to a router that specifies a communication port to access the display memory of the GPU card interconnected to that GPU card based on the count value. Those skilled in the art will appreciate that the count value used to indicate the communication port may also come from outside the GPU board, such as from the system bus or other components, and thus the counter C in fig. 1 and 2 is not a necessary component for implementing the embodiments of the present disclosure.
In some embodiments, the counter starts counting from 0 and the maximum value of the count value is equal to N-1.
As will be appreciated by those skilled in the art, if each GPU card designates (or activates) N communication ports for accessing one other GPU card, the maximum value of the count value is equal to N-1. Where count value 0 corresponds to the 1 st communication port, count value 1 corresponds to the 2 nd communication port, and so on, and count value N-1 corresponds to the nth communication port. It should be noted that the starting value and the maximum value of the count value may be flexibly configured, for example, when two GPU boards each include 6 communication ports, and the two GPU boards can communicate through 5 communication ports, the counter starts from 0, and the maximum value of the count value is 4. That is, when two GPU boards are interconnected through x communication ports, the counter starts to count from 0, and the maximum value of the count value is equal to x-1,x which is an integer, and x takes a value between 2 and N.
In some embodiments, as shown in fig. 1, each of the at least two GPU boards 101, 102 further comprises a router configured to: and analyzing address information in the access request, and sending the access request to one of the N communication ports.
For example, as shown in FIG. 1, after address negotiation, assume that the address range of the VRAM of the first GPU card 101 is [0,8G-1], and assume that the address range of the VRAM of the second GPU card 102 is [8G, 1lg-1 ]. Both the communication port P0 and the communication port P1 of the first GPU card 101 can access the entire VRAM of the second GPU card 102, and both the communication port P0 and the communication port P1 of the second GPU card 102 can access the entire VRAM of the first GPU card 101. When each GPU card accesses the VRAM of another GPU card, the router can be used for selecting and accessing the communication port. For example, when the first GPU card 101 accesses the second GPU card 102, the router of the first GPU card 101 can be utilized to resolve address information in the access request. Upon determining that the address information in the access request belongs to the address range [8g,16g-1], the router of the first GPU card 101 sends the access request to one of the communication ports P0 and P1, and alternately uses the communication ports P0 and P1 to send access requests matching the address range. Thus, traffic load balancing among multiple communication ports on the same GPU card can be achieved by using the router.
Those skilled in the art will appreciate that each GPU card may also include a separate communication port (not shown), wherein the GPU card may use only one such communication port to access the display memory of the GPU card interconnected with the GPU card.
Fig. 3 shows a schematic structural diagram of a GPU board 301 according to an embodiment of the present disclosure. As shown in fig. 3, in some embodiments, the router 3011 includes a plurality of router units R0-R5; the display memory of the GPU board interconnected with the GPU board 301 includes a plurality of address ranges, and the plurality of router units R0 to R5 are in one-to-one correspondence with the plurality of address ranges. For example, the number of GPU boards (not shown) interconnected with the GPU board 301 is 3; after address negotiation, the address ranges of the VRAMs of the 3 GPU cards are [8G,16G-1], [16G,24G-1] and [24G,32G-1] respectively. Then router unit R0 may correspond to address range [8G,16G-1], router unit R1 may correspond to address range [16G,24G-1], and router unit R2 may correspond to address range [24G,32G-1]. When the router 3011 receives a (remote) access request from the GPU, it first parses the address information in the access request, and then sends the access request to one of the communication ports P0-P5 according to the address information, thereby accessing the VRAMs of other GPU boards.
It should be noted that, in the interconnect architecture of the embodiment of the present disclosure, the display memory of each GPU board may have a local address, and the address range (for example, a global address) of the display memory of each GPU board may be obtained by performing uniform addressing based on the local address of each GPU board. The size and the unified addressing mode of the local address of each GPU board card are not limited in the present disclosure, for example, the sizes of the local address ranges of different GPU board cards may be the same or different, unified addressing may obtain a plurality of address ranges, the plurality of GPU board cards correspond to the plurality of address ranges one to one, and any two address ranges do not overlap.
In one possible implementation, the unified addressing may be performed by a processor (e.g., a central processing unit CPU) that may obtain device information for a plurality of GPU boards, the device information may include a number of the GPU board and storage information (e.g., local addresses of display memory, etc.), determine an address range for each GPU board based on the device information for the plurality of GPU boards, and send the address range to the corresponding GPU board to cause each GPU board to address its display memory based on the address range. That is, the address range of each GPU board may be obtained by the central processing unit addressing in a unified manner, and is determined based on the device information of the plurality of GPU boards.
Taking the example of the CPU executing the unified addressing, the CPU may obtain port connection information (for example, may include a port number and a communication port number) for communication between the GPU boards, determine an addressing policy (for example, including routing information) of each GPU board based on the address ranges and the port connection information of the GPU boards, and send the addressing policy of each GPU board to the corresponding GPU board, so that the GPU boards address the display memories of the GPU boards based on the received addressing policies. The addressing policy may include how to address the display memory of the GPU board itself and how to address the display memories of other GPU boards. For example, the address range corresponding to each router unit, the corresponding communication port, etc., which is not limited by this disclosure.
As shown in FIG. 3, in some embodiments, each of the at least two GPU cards further comprises a plurality of multiplexers M0-M5; the plurality of multiplexers M0-M5 are in one-to-one correspondence with the plurality of router units R0-R5, and each of the plurality of multiplexers M0-M5 is communicably connected with the N communication ports (P0-P5).
In some embodiments, each of the plurality of multiplexers is configured to: and receiving an access request sent by a corresponding router unit, and sending the access request to one of the N communication ports according to the count value.
As shown in fig. 3, in an embodiment of the present disclosure, for a single access request, each multiplexer (e.g., M0) receives the access request from the corresponding router unit (e.g., R0) and sends the access request to one of the plurality of communication ports (P0-P5) according to the address information. Furthermore, in the embodiment shown in FIG. 3, if the address range stored by the router unit R0 is designated for access by the communication ports P0 and P1, the multiplexer M0 will send multiple access requests from the router unit R0 "in turn" to one of the communication ports P0 and P1. Similarly, if the address range stored by the router unit R0 is designated for access by the communication ports P0-P3, the multiplexer M0 will send multiple access requests from the router unit R0 "in turn" to one of the communication ports P0-P3. Thereby, load balancing of the plurality of communication ports is achieved.
Fig. 4 is a schematic structural diagram of a GPU board card according to another embodiment of the present disclosure. The embodiment shown in fig. 4 differs from the embodiment shown in fig. 3 in the bridge 3013. As shown in fig. 4, in some embodiments, each of the at least two GPU boards further comprises a bridge 3013, and each of the plurality of multiplexers M0-M5 is communicatively connected with the N communication ports (P0-P5) via the bridge 3013; the bridge 3013 includes N connection modules B0 to B5, where the N connection modules B0 to B5 correspond to the N communication ports (P0 to P5) one to one; each of the N connection modules B0-B5 is communicably connected with a corresponding communication port.
For the GPU board, if the distance between the communication port and the router is long, the wires between the communication port and the router may increase the difficulty of wiring at the back end of the chip. Therefore, the bridge can be arranged near the multiplexer, and each connecting module of the bridge is connected with the corresponding communication port, so that the wiring difficulty of the rear end of the chip is reduced. The connection module may be a conductive hardware module for transmitting access requests from the multiplexer to the corresponding communication port.
In some embodiments, each of the plurality of multiplexers includes a register that stores addresses of at least two of the N communication ports.
For example, the multiplexer may provide a 32bits register to indicate which communication ports to output to, such as the register of M0 may fill in (1 | 1 < < 1), which indicates that the output of M0 may be to P0 and P1. Furthermore, as shown in fig. 5, the multiplexer may also include a counter C to decide to which communication port the current access request is sent. In this embodiment, the counter in the multiplexer starts counting from 0, and the maximum value is-1, the number of communication ports that the multiplexer designates (or, activates). For example, in this embodiment, the maximum count value of the counter is 1, and the count value of the counter is incremented by 1 every time the multiplexer receives an access request (or packet). If the current count value of the counter is 1, the count value returns to the minimum value of 0 after the operation of + 1. Each time an access request is received, the multiplexer reads the count value of the counter, and if the count value is 0, the access request is sent to the communication port P0, otherwise, the access request is sent to the communication port P1. Therefore, scheduling is carried out by taking the access request (or the packet) as granularity, and load balance of a plurality of communication ports is realized. Those skilled in the art will appreciate that the above described scheduling is also applicable for a larger number of communication ports.
According to another aspect of the present disclosure, a method of implementing a GPU interconnect architecture is provided. As shown in fig. 6, the GPU interconnect architecture includes at least two GPU cards interconnected, each of the at least two GPU cards including a display memory and N communication ports, and the display memory of the GPU card interconnected with the GPU card is accessed using the at least two communication ports, where N is an integer greater than or equal to 2, the method comprising: configuring each of the at least two GPU cards to: at least two of the N communication ports are alternately used to access a display memory of a GPU board interconnected with the GPU board according to the count value (S601).
In an embodiment of the disclosure, each GPU card uses at least two of the N communication ports in turn to access the display memory of the GPU card interconnected to the GPU card. According to the embodiment of the disclosure, load balancing of a plurality of communication ports is realized, and the link of an individual communication port is prevented from being idle and the link of some communication ports is prevented from being blocked, thereby realizing high speed and high bandwidth.
In some embodiments, as shown in fig. 1 and 2, each of the at least two GPU boards further comprises a counter C for generating the count value.
A count operation is performed using the counter C, and a count value indicating the communication port is obtained. For each GPU card, the count value may be input to a router, and the router may designate a communication port to access a display memory of the GPU card interconnected with the GPU card according to the count value. Those skilled in the art will appreciate that the count value used to indicate the communication port may also come from outside the GPU board, such as from the system bus or other components, and thus the counter C in fig. 1 and 2 is not a necessary component for implementing the embodiments of the present disclosure.
In some embodiments, the counter starts counting from 0 and the maximum value of the count value is equal to N-1.
Those skilled in the art will appreciate that the number of communication ports per GPU card may be greater than N. If each GPU card designates (or activates) N communication ports for accessing other GPU cards, the maximum value of the count value is equal to N-1. Where count value 0 corresponds to the 1 st communication port, count value 1 corresponds to the 2 nd communication port, and so on, and count value N-1 corresponds to the nth communication port.
In some embodiments, each of the at least two GPU boards further comprises a router configured to: and analyzing address information in the access request, and sending the access request to one of the N communication ports.
For example, as shown in FIG. 1, after address negotiation, assume that the address range of the VRAM of the first GPU card 101 is [0,8G-1], and assume that the address range of the VRAM of the second GPU card 102 is [8G, 1lg-1 ]. Both the communication port P0 and the communication port P1 of the first GPU card 101 can access the entire VRAM of the second GPU card 102, and both the communication port P0 and the communication port P1 of the second GPU card 102 can access the entire VRAM of the first GPU card 101. When each GPU card accesses the VRAM of another GPU card, the router can be used for selecting and accessing the communication port. For example, when the first GPU card 101 accesses the second GPU card 102, the router of the first GPU card 101 can be utilized to resolve address information in the access request. Upon determining that the address information in the access request belongs to the address range [8g,16g-1], the router of the first GPU card 101 sends the access request to one of the communication ports P0 and P1, and alternately uses the communication ports P0 and P1 to send access requests matching the address range. Thus, traffic load balancing between two communication ports or N communication ports on the same GPU board can be achieved using the router.
Those skilled in the art will appreciate that each GPU card may also include a separate communication port (not shown), wherein the GPU card may use only one such communication port to access the display memory of the GPU card interconnected with the GPU card.
Fig. 3 shows a schematic structural diagram of a GPU board card 301 according to an embodiment of the present disclosure. As shown in fig. 3, in some embodiments, the router 3011 includes a plurality of router units R0-R5; the display memory of the GPU board interconnected with the GPU board 301 includes a plurality of address ranges, and the plurality of router units R0 to R5 are in one-to-one correspondence with the plurality of address ranges. For example, the number of GPU boards (not shown) interconnected with the GPU board 301 is 3; after address negotiation, the address ranges of the VRAMs of the 3 GPU cards are [8G,16G-1], [16G,24G-1] and [24G,32G-1] respectively. Then, router unit R0 may correspond to address range [8G,16G-1], router unit R1 may correspond to address range [16G,24G-1], and router unit R2 may correspond to address range [24G,32G-1]. When the router 3011 receives a (remote) access request from the GPU, it first parses the address information in the access request, and then sends the access request to one of the communication ports P0-P5 according to the address information, thereby accessing the VRAMs of other GPU boards.
As shown in FIG. 3, in some embodiments, each of the at least two GPU cards further comprises a plurality of multiplexers M0-M5; the plurality of multiplexers M0-M5 are in one-to-one correspondence with the plurality of router units R0-R5, and each of the plurality of multiplexers M0-M5 is communicably connected with the N communication ports (P0-P5).
In some embodiments, each of the plurality of multiplexers is configured to: and receiving an access request sent by a corresponding router unit, and sending the access request to one of the N communication ports according to the count value.
As shown in fig. 3, in an embodiment of the present disclosure, for a single access request, each multiplexer (e.g., M0) receives the access request from the corresponding router unit (e.g., R0) and sends the access request to one of the plurality of communication ports (P0-P5) according to the address information. Furthermore, in the embodiment shown in FIG. 3, if the address range stored by the router unit R0 is designated for access by the communication ports P0 and P1, the multiplexer M0 will send multiple access requests from the router unit R0 "in turn" to one of the communication ports P0 and P1. Similarly, if the address range stored by the router unit R0 is designated for access by the communication ports P0-P3, the multiplexer M0 will send multiple access requests from the router unit R0 "in turn" to one of the communication ports P0-P3. Thereby, load balancing of the plurality of communication ports is achieved.
Although 6 router units R0-R5 are shown in the solution of fig. 3 in the present disclosure, those skilled in the art can understand that each GPU board card may also include other router units that do not correspond to the multiple address ranges.
As shown in fig. 4, in some embodiments, each of the at least two GPU boards further comprises a bridge 3013, and each of the plurality of multiplexers M0-M5 is communicatively connected with the N communication ports (P0-P5) via the bridge 3013; the bridge 3013 includes N connection modules B0 to B5, where the N connection modules B0 to B5 correspond to the N communication ports (P0 to P5) one to one; each of the N connection modules B0-B5 is communicably connected with a corresponding communication port.
For the GPU board, if the distance between the communication port and the router is long, the wires between the communication port and the router may increase the difficulty of wiring at the back end of the chip. Therefore, the bridge can be arranged near the multiplexer, and each connecting module of the bridge is connected with the corresponding communication port, so that the wiring difficulty of the rear end of the chip is reduced. The connection module may be a conductive hardware module for transmitting access requests from the multiplexer to the corresponding communication port.
In some embodiments, each of the plurality of multiplexers includes a register that stores addresses of at least two of the N communication ports.
For example, the multiplexer may provide a 32bits register to indicate which communication ports to output to, such as the register of M0 may fill in (1 | 1 < < 1), which indicates that the output of M0 may be to P0 and P1. Furthermore, as shown in fig. 5, the multiplexer may also include a counter C to decide to which communication port the current access request is sent. In this embodiment, the counter in the multiplexer starts counting from 0, and the maximum value is the number of communication ports-1 that the multiplexer specifies (or activates). For example, in this embodiment, the maximum count value of the counter is 1, and the count value of the counter is incremented by 1 every time the multiplexer receives an access request (or packet). If the current count value of the counter is 1, the count value returns to the minimum value of 0 after the operation of + 1. Each time an access request is received, the multiplexer reads the count value of the counter, and if the count value is 0, the access request is sent to the communication port P0, otherwise, the access request is sent to the communication port P1. Therefore, scheduling is carried out by taking the access request (or the packet) as granularity, and load balance of a plurality of communication ports is realized. Those skilled in the art will appreciate that the above described scheduling is also applicable for a larger number of communication ports.
According to yet another aspect of the present disclosure, a computing device is provided. The computing device comprises a GPU interconnect architecture according to any of the preceding embodiments.
According to another aspect of the present disclosure, a computer program product is provided. The computer program product comprises computer executable instructions, wherein the computer executable instructions, when executed by a processor, perform the method according to any of the preceding embodiments.
According to another aspect of the present disclosure, a computer-readable storage medium is provided. The computer-readable storage medium stores computer-executable instructions that, when executed, perform the method according to any of the preceding embodiments.
Fig. 7 illustrates an example system 700 that includes an example computing device 710 that represents one or more systems and/or devices that can implement the various techniques described in this disclosure, according to an embodiment of this disclosure. Computing device 710 may be, for example, a server of a service provider, a device associated with a server, a system on a chip, and/or any other suitable computing device or computing system.
The example computing device 710 as illustrated includes a processing system 711, one or more computer-readable media 712, and one or more I/O interfaces 713 communicatively coupled to each other. Although not shown, the computing device 710 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.
The processing system 711 represents functionality to perform one or more operations using hardware. Thus, the processing system 711 is illustrated as including hardware elements 714 that may be configured as processors, functional blocks, and so forth. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. Hardware element 714 is not limited by the material from which it is formed or the processing mechanism employed therein. For example, a processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.
The computer-readable medium 712 is illustrated as including a memory/storage 715. Memory/storage 715 represents memory/storage capacity associated with one or more computer-readable media. Memory/storage 715 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). Memory/storage 715 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 712 may be configured in various other ways as further described below.
One or more I/O interfaces 713 represent functionality that allows a user to enter commands and information to computing device 710 using various input devices and optionally also allows information to be presented to the user and/or other components or devices using various output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that may not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Accordingly, the computing device 710 may be configured in various ways to support user interaction, as described further below.
Computing device 710 also includes application 716. The application 716 may, for example, be a software embodiment implementing a GPU interconnect architecture, and in combination with other elements in the computing device 710 implement the techniques described in this disclosure.
The present disclosure may describe various techniques in the general context of software, hardware elements, or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used in this disclosure generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described in this disclosure are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can include a variety of media that can be accessed by computing device 710. By way of example, and not limitation, computer-readable media may comprise "computer-readable storage media" and "computer-readable signal media".
"computer-readable storage medium" refers to a medium and/or device, and/or a tangible storage apparatus, capable of persistently storing information, as opposed to mere signal transmission, carrier wave, or signal per se. Accordingly, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage, tangible media, or an article of manufacture suitable for storing the desired information and which may be accessed by a computer.
"computer-readable signal medium" refers to a signal-bearing medium configured to transmit instructions to the hardware of the computing device 710, such as via a network. Signal media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, data signal or other transport mechanism. Signal media also includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
As previously described, hardware element 714 and computer-readable medium 712 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware form that may be used in some embodiments to implement at least some aspects of the techniques described in this disclosure. The hardware elements may include integrated circuits or systems-on-a-chip, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device that performs program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device for storing instructions for execution, such as the computer-readable storage medium described previously.
Combinations of the foregoing may also be used to implement the various techniques and modules described in this disclosure. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 714. The computing device 710 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, implementing modules at least partially in hardware as modules executable by computing device 710 as software may be accomplished, for example, through the use of computer-readable storage media of a processing system and/or hardware element 714. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 710 and/or processing systems 711) to implement the techniques, modules, and examples described in this disclosure.
In various implementations, the computing device 710 may assume a variety of different configurations. For example, the computing device 710 may be implemented as a computer-like device including a personal computer, a desktop computer, a multi-screen computer, a laptop computer, a netbook, and so forth. The computing device 710 may also be implemented as a mobile device-like device including mobile devices such as mobile phones, portable music players, portable gaming devices, tablet computers, multi-screen computers, and the like. Computing device 710 may also be implemented as a television-like device that includes devices with or connected to a generally larger screen in a casual viewing environment. These devices include televisions, set-top boxes, game consoles, etc.
It is noted that, in the present disclosure, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element. The terms "upper", "lower", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing the present disclosure and simplifying the description, and do not indicate or imply that the referred devices or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present disclosure. Unless expressly stated or limited otherwise, the terms "mounted," "connected," and "connected" are intended to be inclusive and mean, for example, that they may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present disclosure can be understood by those of ordinary skill in the art as appropriate.
In the description of the present disclosure, numerous specific details are set forth. It can be understood, however, that embodiments of the disclosure may be practiced without these specific details. In some embodiments, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, disclosed aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.
The above description is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto. Any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the disclosure, and all the changes or substitutions are covered by the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (13)

1. A GPU interconnect architecture, comprising: at least two GPU board cards which are interconnected; each of the at least two GPU cards includes a display memory and N communication ports, and the display memory of the GPU card interconnected with the GPU card is accessed using the at least two communication ports, wherein N is an integer greater than or equal to 2;
wherein each of the at least two GPU cards is configured to: and according to the count value, accessing the display memory of the GPU card interconnected with the GPU card by using at least two communication ports in the N communication ports in turn.
2. The GPU interconnection architecture of claim 1, wherein each of the at least two GPU cards further comprises a counter for generating the count value.
3. A GPU interconnect architecture according to claim 2, wherein the counter starts counting from 0 and the maximum value of the count value is equal to N-1.
4. The GPU interconnection architecture of claim 1, wherein each of the at least two GPU cards further comprises a router configured to: and analyzing address information in the access request, and sending the access request to one of the N communication ports.
5. A GPU interconnection architecture according to claim 4, wherein the router comprises a plurality of router units; the display memory of the GPU board card interconnected with the GPU board card comprises a plurality of address ranges, and the router units are in one-to-one correspondence with the address ranges.
6. A GPU interconnection architecture according to claim 5, wherein each of the at least two GPU boards further comprises a plurality of multiplexers; the plurality of multiplexers are in one-to-one correspondence with the plurality of router units, and each of the plurality of multiplexers is communicably connected with the N communication ports.
7. A GPU interconnection architecture according to claim 6, wherein each of the plurality of multiplexers is configured to: and receiving an access request sent by a corresponding router unit, and sending the access request to one of the N communication ports according to the count value.
8. A GPU interconnection architecture according to claim 6, wherein each of the at least two GPU cards further comprises a bridge, each of the plurality of multiplexers communicatively coupled with the N communication ports via the bridge; the bridge comprises N connecting modules, and the N connecting modules correspond to the N communication ports one to one; each of the N connection modules is communicably connected with a corresponding communication port.
9. The GPU interconnect architecture of claim 6, in which each of the plurality of multiplexers comprises a register that stores addresses of at least two of the N communication ports.
10. A method of implementing a GPU interconnect architecture, the GPU interconnect architecture comprising at least two GPU boards interconnected, each of the at least two GPU boards comprising a display memory and N communication ports, and using the at least two communication ports to access the display memory of a GPU board interconnected with the GPU board, where N is an integer greater than or equal to 2, the method comprising:
configuring each of the at least two GPU cards to: and accessing the display memory of the GPU card interconnected with the GPU card by using at least two communication ports of the N communication ports in turn according to the counting value.
11. A computing device comprising a GPU interconnect architecture according to any of claims 1-9.
12. A computer program product comprising computer executable instructions, wherein the computer executable instructions, when executed by a processor, perform the method of claim 10.
13. A computer-readable storage medium having stored thereon computer-executable instructions that, when executed, perform the method of claim 10.
CN202211663517.XA 2022-12-23 2022-12-23 GPU (graphics processing Unit) interconnection architecture, method for realizing GPU interconnection architecture and computing equipment Pending CN115981853A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211663517.XA CN115981853A (en) 2022-12-23 2022-12-23 GPU (graphics processing Unit) interconnection architecture, method for realizing GPU interconnection architecture and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211663517.XA CN115981853A (en) 2022-12-23 2022-12-23 GPU (graphics processing Unit) interconnection architecture, method for realizing GPU interconnection architecture and computing equipment

Publications (1)

Publication Number Publication Date
CN115981853A true CN115981853A (en) 2023-04-18

Family

ID=85959188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211663517.XA Pending CN115981853A (en) 2022-12-23 2022-12-23 GPU (graphics processing Unit) interconnection architecture, method for realizing GPU interconnection architecture and computing equipment

Country Status (1)

Country Link
CN (1) CN115981853A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6715042B1 (en) * 2001-10-04 2004-03-30 Cirrus Logic, Inc. Systems and methods for multiport memory access in a multimaster environment
CN102227709A (en) * 2008-10-03 2011-10-26 先进微装置公司 Multi-processor architecture and method
CN107408085A (en) * 2015-01-29 2017-11-28 弩锋股份有限公司 The wide addressing of integrated system for computing system
CN109978751A (en) * 2017-12-28 2019-07-05 辉达公司 More GPU frame renderings
CN112785485A (en) * 2019-11-04 2021-05-11 辉达公司 Techniques for efficient structure-attached memory

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6715042B1 (en) * 2001-10-04 2004-03-30 Cirrus Logic, Inc. Systems and methods for multiport memory access in a multimaster environment
CN102227709A (en) * 2008-10-03 2011-10-26 先进微装置公司 Multi-processor architecture and method
CN107408085A (en) * 2015-01-29 2017-11-28 弩锋股份有限公司 The wide addressing of integrated system for computing system
CN109978751A (en) * 2017-12-28 2019-07-05 辉达公司 More GPU frame renderings
CN112785485A (en) * 2019-11-04 2021-05-11 辉达公司 Techniques for efficient structure-attached memory

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姚寿文 等: "《刚体实时动态仿真基础》", 北京理工大学出版社, pages: 3 *

Similar Documents

Publication Publication Date Title
US9998558B2 (en) Method to implement RDMA NVME device
US10500505B2 (en) Techniques to interact with an application via messaging
US8661207B2 (en) Method and apparatus for assigning a memory to multi-processing unit
CN109428839B (en) CDN scheduling method, device and system
US10802879B2 (en) Method and device for dynamically assigning task and providing resources and system thereof
TW202027003A (en) Method and system for accepting blockchain evidence storage transaction
CN116601601A (en) Method for executing programmable atomic unit resource in multi-process system
CN111338745B (en) Deployment method and device of virtual machine and intelligent device
US8793420B2 (en) System on chip, electronic system including the same, and method of operating the same
KR20160081528A (en) Display controller and Semiconductor Integrated Circuit Device including the same
US20130286028A1 (en) Address generator of image processing device and operating method of address generator
CN115774620B (en) Method, device and computing equipment for realizing memory space mutual access in GPU interconnection architecture
WO2020239082A1 (en) Information displaying method and apparatus, and storage medium
CN115981853A (en) GPU (graphics processing Unit) interconnection architecture, method for realizing GPU interconnection architecture and computing equipment
CN116257320B (en) DPU-based virtualization configuration management method, device, equipment and medium
US20170239565A1 (en) Server apparatus, method, and non-transitory computer-readable medium
US9575759B2 (en) Memory system and electronic device including memory system
CN116150082A (en) Access method, device, chip, electronic equipment and storage medium
US11734007B2 (en) Address generation method, related apparatus, and storage medium
US11500802B1 (en) Data replication for accelerator
CN116529721A (en) On-demand programmable atomic kernel loading
CN112073505A (en) Method for unloading on cloud server, control device and storage medium
CN109861930A (en) Connection method, device and the host of virtual switch and virtual machine
CN115793983B (en) Addressing method, apparatus, system, computing device and storage medium
CN112905192B (en) Method for unloading on cloud server, control device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination