CN116263750A

CN116263750A - Bridge, processing unit and computing system

Info

Publication number: CN116263750A
Application number: CN202111524992.4A
Authority: CN
Inventors: 杨鑫; 陈强; 张良锋; 黄贤鹏; 王胜雷
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2023-06-16

Abstract

The invention discloses a bridge, a processing unit and a computing system, wherein the bridge comprises a first circuit board, and the first circuit board is connected to a first function card through a first interface; a second circuit board connected to the second function card through a second interface; the circuit communication mechanism is arranged between the first circuit board and the second circuit board and is used for communicating the first circuit board and the second circuit board; the distance adjusting mechanism is arranged between the first circuit board and the second circuit board and is used for adjusting the distance between the first circuit board and the second circuit board. According to the bridge provided by the invention, the first circuit board and the second circuit board are communicated through the circuit communication mechanism, and the distance between the first circuit board and the second circuit board is adjusted through the distance adjusting mechanism, so that a user can change the interface interval of the bridge according to the requirement, the bridge is applied to wider scenes, and the utilization rate of the bridge is improved.

Description

Bridge, processing unit and computing system

Technical Field

The present invention relates to the field of computer hardware, and in particular, to a bridge, a processing unit, and a computing system.

Background

With the continued development of computer technology, the computational demands of Artificial Intelligence (AI) and High Performance Computing (HPC) continue to grow, and thus multiprocessor systems supporting seamless connections between processors (e.g., GPUs) are increasingly needed so that they can cooperate as a vast accelerator. Because of the limited PCIe bandwidth, bottlenecks are typically created. In order to construct a powerful end-to-end computing platform, faster and more expansive interconnections are needed.

NVLink is an interconnect and its communication protocol developed and developed by NVDIA corporation. NVLink adopts a point-to-point structure and serial transmission, is used for connecting a Central Processing Unit (CPU) and a Graphic Processor (GPU), and can also be used for interconnecting a plurality of GPUs. Two GPUs are connected through the NVLink bridge, the unidirectional transmission speed can reach 50GB/s, the bidirectional transmission speed can reach 100GB/s, the current bandwidth of the PCIe bus is far exceeded, the expansion of the GPU performance is realized, and the requirement of high-quality display computing workload is met.

However, the interface pitch of the bridge is generally fixed and invariable (e.g., 3 slots (slot), 4 slots, etc.), and the user cannot adjust according to the use requirement. Therefore, there is a need to propose a new bridge to solve the above-mentioned problems.

Disclosure of Invention

In the summary, a series of concepts in a simplified form are introduced, which will be further described in detail in the detailed description. The summary of the invention is not intended to define the key features and essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

The invention provides a bridge, comprising:

a first circuit board connected to the first function card through a first interface;

a second circuit board connected to the second function card through a second interface;

the circuit communication mechanism is arranged between the first circuit board and the second circuit board and is used for communicating the first circuit board and the second circuit board;

the distance adjusting mechanism is arranged between the first circuit board and the second circuit board and is used for adjusting the distance between the first circuit board and the second circuit board.

Further, at least two connection terminals are arranged on the first circuit board, at least two connection terminals are arranged on the second circuit board, and one of the connection terminals on the first circuit board is connected with one of the connection terminals on the second circuit board through the circuit communication mechanism.

Further, the circuit communication mechanism comprises a third circuit board, two ends of the third circuit board are respectively provided with a connecting terminal, one connecting terminal of the third circuit board is connected to one of the connecting terminals on the first circuit board, and the other connecting terminal of the third circuit board is connected to one of the connecting terminals on the second circuit board.

Further, the distance between the two connecting terminals of the third circuit board is one or more slot pitches.

Further, the distance adjusting mechanism comprises a first guide post and a second guide post which are arranged in parallel, one end of the first guide post is relatively fixed with the first circuit board, and one end of the second guide post is relatively fixed with the second circuit board.

Further, the distance adjusting mechanism further comprises a base, a clamping structure is arranged in the base, the first guide pillar and the second guide pillar penetrate through the base, and the first guide pillar and the second guide pillar are prevented from sliding in the base at will through the clamping structure.

Further, the clamping structure comprises a cylinder, the side face of the cylinder is contacted with the surfaces of the first guide post and the second guide post, and anti-skid lines are arranged on the contact surfaces of the cylinder and the first guide post and the second guide post.

Further, the function card comprises a display card, a sound card or a network card, and the function card is arranged on a card slot of the host.

The invention also provides a processing unit comprising a bridge as described above.

The present invention also provides a computing system comprising at least one processor and a memory coupled to the at least one processor, the computing system comprising a bridge as described above.

According to the bridge provided by the invention, the first circuit board and the second circuit board are communicated through the circuit communication mechanism, and the distance between the first circuit board and the second circuit board is adjusted through the distance adjusting mechanism, so that a user can change the interface interval of the bridge according to the requirement, the bridge is applied to wider scenes, and the utilization rate of the bridge is improved.

Drawings

The following drawings are included to provide an understanding of the invention and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the invention and their description to explain the principles of the invention.

In the accompanying drawings:

FIG. 1A illustrates a schematic diagram of a bridge according to one embodiment;

FIG. 1B illustrates a top view of a bridge according to one embodiment;

FIG. 1C illustrates a bottom view of a bridge according to one embodiment;

FIG. 2A illustrates a front view of a first mode of a bridge according to one embodiment;

FIG. 2B illustrates a front view of a second mode of a bridge according to one embodiment;

FIG. 3 illustrates a parallel processing unit according to one embodiment;

FIG. 4A illustrates a general processing cluster within the parallel processing unit of FIG. 3, according to one embodiment;

FIG. 4B illustrates a memory partition unit of the parallel processing unit of FIG. 3, according to one embodiment;

FIG. 5A illustrates the streaming multiprocessor of FIG. 4A, according to an embodiment;

FIG. 5B is a conceptual diagram of a processing system implemented using the Parallel Processing Unit (PPU) of FIG. 3, according to one embodiment;

fig. 5C illustrates an exemplary system in which the various architectures and/or functions of the various previous embodiments may be implemented.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present invention and not all embodiments of the present invention, and it should be understood that the present invention is not limited by the example embodiments described herein. Based on the embodiments of the invention described in the present application, all other embodiments that a person skilled in the art would have without inventive effort shall fall within the scope of the invention.

Example 1

Currently, the interface pitch of the bridge is generally fixed and invariable (e.g., 3 slots (slots), 4 slots, etc.), and the user cannot adjust according to the use requirement.

In view of the above, the present embodiment provides a bridge, referring to fig. 1A-2B, the bridge includes a first circuit board 110 connected to a first function card, a second circuit board 120 connected to a second function card, and a circuit communication mechanism 130 and a distance adjustment mechanism 140 disposed between the first circuit board 110 and the second circuit board 120, wherein the circuit communication mechanism 130 is used for communicating the first circuit board 110 and the second circuit board 120, and the distance adjustment mechanism 140 is used for adjusting a distance between the first circuit board 110 and the second circuit board 120.

Illustratively, the first circuit board 110 and the second circuit board 120 may be various types of circuit boards, such as ceramic circuit boards, alumina ceramic circuit boards, aluminum nitride ceramic circuit boards, printed Circuit Boards (PCBs), aluminum substrates, high frequency boards, thick copper plates, ultra-thin circuit boards, etc., to which the present invention is not limited.

Illustratively, the first circuit board 110 and the second circuit board 120 are two independent circuit boards, which are capable of relative movement.

In one embodiment, the first circuit board 110 and the second circuit board 120 are both L-shaped, with long sides of the two being arranged in parallel and short sides of the two being arranged opposite.

Illustratively, the first circuit board 110 is provided with a first interface 113 for connecting to a first functional card (not shown), and the second circuit board 120 is provided with a second interface 123 for connecting to a second functional card (not shown).

Illustratively, the function card includes, but is not limited to, a video card, a sound card, a network card, etc., which is not limited to the present invention. Functional cards are typically inserted on a computer motherboard, also known as a motherboard, system board, logic board, motherboard, backplane, etc., which is the center or main circuit board that forms a complex electronic system, such as an electronic computer. A typical motherboard can provide a series of grooves for engaging devices such as processors, graphics cards, sound cards, network cards, hard disks, memory, external devices, and the like. They are typically plugged directly into the relevant slots or wired. The most important component on the motherboard is the Chipset (Chipset) which provides a common platform for the motherboard to connect to and control the communication of different devices. The chipset also includes support for different expansion slots, such as processor, PCI, ISA, AGP, and PCI Express. The chipset also provides additional functionality for the motherboard, such as integrated display core, integrated sound card, integrated infrared communication technology, bluetooth and Wi-Fi functionality, etc.

In one embodiment, the graphics card is comprised of a Graphics Processor (GPU), a memory, a circuit board, BIOS firmware, and the like. Implementing NVLink interconnections for multiple GPUs through an NVLink bridge may form a Parallel Processing Unit (PPU), as described in detail below in connection with FIG. 3.

Illustratively, the first circuit board 110 is provided with at least two

connection terminals

111, 112, and the second circuit board 120 is provided with at least two

connection terminals

121, 122.

Further, one of the

connection terminals

111, 112 on the first circuit board 110 is connected to one of the

connection terminals

121, 122 on the second circuit board 120 by a circuit communication mechanism 130.

Illustratively, the circuit communication mechanism 130 includes a third circuit board,

connection terminals

131, 132 are respectively disposed at both ends of the third circuit board, one connection terminal 13 of the third circuit board is connected to one of the

connection terminals

111, 112 on the first circuit board 110, and the other connection terminal 132 of the third circuit board is connected to one of the

connection terminals

121, 122 on the second circuit board 120.

Further, the distance between the two

connection terminals

131, 132 of the third circuit board is one or more slot pitches. In one embodiment, each slot pitch is about 20.3mm. Preferably, the distance between the two

connection terminals

111, 112 of the first circuit board is 1/2 slot pitch, and the distance between the two

connection terminals

121, 122 of the second circuit board is 1/2 slot pitch.

In one embodiment, the third circuit board communicates the short sides of the first circuit board 110 and the second circuit board 120, and the first circuit board 110, the second circuit board 120, and the third circuit board together form a "concave" arrangement, as shown in fig. 1A-1C.

Referring to fig. 2A and 2B, the bridge includes a first mode and a second mode.

In one embodiment, as shown in fig. 2A, when the connection terminal 131 of the third circuit board is connected to the connection terminal 111 of the first circuit board 110, and the connection terminal 132 of the third circuit board is connected to the connection terminal 121 of the second circuit board 120, the bridge is in the first mode, i.e., the interface narrow pitch mode.

In one embodiment, as shown in fig. 2B, when the connection terminal 131 of the third circuit board is connected to the connection terminal 112 of the first circuit board 110, and the connection terminal 132 of the third circuit board is connected to the connection terminal 122 of the second circuit board 120, the bridge is in the second mode, i.e., the interface wide pitch mode.

In one embodiment, in the interface narrow pitch mode, the first interface 113 is 3 slots (3 slots) from the second interface 123, and in the interface wide pitch mode, the first interface 113 is 4 slots (4 slots) from the second interface 123.

Illustratively, the distance adjustment mechanism 140 includes a first guide post 141, a second guide post 142, a base 143, and a detent structure 144. The first end of the first guide post 141 is fixedly connected with the first circuit board 110, the first end of the second guide post 141 is fixedly connected with the second circuit board 120, and the first guide post 141 and the second guide post 142 can move relatively under the action of external force.

Further, the first guide post 141 and the second guide post 142 penetrate through the base 143, and a clamping structure 144 is disposed in the base 143, so that the first guide post 141 and the second guide post 142 can be prevented from sliding in the base 143 at will through the clamping structure 144.

In one embodiment, the detent structure 144 comprises a cylinder that is spring-connected to the base 143, the sides of the cylinder contact the surfaces of the first and second guide posts 141, 142, and the contact surfaces of the cylinder and the first and second guide posts 141, 142 are each provided with anti-slip features, including but not limited to bumps, waves, gear-like protrusions, etc.

In one embodiment, when the bridge is switched from the first mode to the second mode by an external force, the first circuit board 110 and the second circuit board 120 are subjected to a tensile force, the first guide post 141 and the second guide post 142 move back to each other, the connection terminal 131 of the third circuit board is separated from the connection terminal 111 of the first circuit board 110 and connected to the connection terminal 112 of the first circuit board 110, and the connection terminal 132 of the third circuit board is separated from the connection terminal 121 of the second circuit board 120 and connected to the connection terminal 122 of the second circuit board 120.

In one embodiment, the end of the first guide post 141 and the end of the second guide post 142 are further provided with blocking members to limit the distance of the first guide post 141 and the second guide post 142 moving back to each other, so as to prevent the first guide post 141 and the second guide post 142 from sliding out of the base.

In one embodiment, when the bridge is switched from the second mode to the first mode by an external force, the first circuit board 110 and the second circuit board 120 are pressurized, the first guide post 141 and the second guide post 142 move toward each other, the connection terminal 131 of the third circuit board is separated from the connection terminal 112 of the first circuit board 110 and connected to the connection terminal 111 of the first circuit board 110, and the connection terminal 132 of the third circuit board is separated from the connection terminal 122 of the second circuit board 120 and connected to the connection terminal 121 of the second circuit board 120.

In one embodiment, the distance adjusting mechanism 140 connects the middle parts of the first circuit board 141 and the second circuit board 142, so that the stress is more uniform, and the mode switching process is smoother.

Example two

Referring to fig. 3, a Parallel Processing Unit (PPU) 300 is a multi-threaded processor implemented on one or more integrated circuit devices. PPU300 is a delay hiding architecture designed to process many threads in parallel. A thread (i.e., an execution thread) is an instance of a set of instructions configured to be executed by PPU 300. In one embodiment, PPU300 is a Graphics Processing Unit (GPU) configured to implement a graphics rendering pipeline for processing three-dimensional (3D) graphics data to generate two-dimensional (2D) image data for display on a display device, such as a Liquid Crystal Display (LCD) device. In other embodiments, PPU300 may be used to perform general purpose computations. Although an exemplary parallel processor is provided herein for purposes of illustration, it should be particularly noted that the processor is set forth for purposes of illustration only and that any processor may be used in addition to and/or in place of the processor.

One or more PPUs 300 may be configured to accelerate thousands of High Performance Computing (HPC), data centers, and machine learning applications. PPU300 may be configured to accelerate a wide variety of deep learning systems and applications including automated driving car platforms, deep learning, high precision speech, image and text recognition systems, intelligent video analysis, molecular modeling, drug development, disease diagnosis, weather forecast, big data analysis, astronomy, molecular dynamics simulation, financial modeling, robotics, factory automation, real-time language translation, online search optimization, and personalized user recommendation, among others.

As shown in FIG. 3, PPU300 includes an input/output (I/O) unit 305, a front end unit 315, a scheduler unit 320, a work distribution unit 325, a hub 330, a crossbar (Xbar) 370, one or more General Processing Clusters (GPCs) 350, and one or more partition units 380. The PPU300 may be connected to a host processor or other PPU300 via one or more high speed NVLink 310 interconnects. PPU300 may be connected to a host processor or other peripheral device via interconnect 302. PPU300 may also be connected to a local memory that includes a plurality of memory devices 304. In one embodiment, the local memory may include a plurality of Dynamic Random Access Memory (DRAM) devices. The DRAM devices may be configured as a High Bandwidth Memory (HBM) subsystem in which multiple DRAM dies (die) are stacked within each device.

NVLink310 interconnect enables the system to be extended and includes one or more PPUs 300 in combination with one or more CPUs, support cache coherency between PPU300 and the CPUs, and CPU hosting. Data and/or commands may be sent by NVLink310 to and from other units of PPU300 through hub 330, such as one or more replication engines, video encoders, video decoders, power management units, etc. (not explicitly shown). NVLink310 is described in more detail in connection with FIG. 5B.

The I/O unit 305 is configured to send and receive communications (e.g., commands, data, etc.) from a host processor (not shown) over the interconnect 302. The I/O unit 305 may communicate with the host processor directly via the interconnect 302 or through one or more intermediary devices such as a memory bridge. In one embodiment, I/O unit 305 may communicate with one or more other processors (e.g., one or more PPUs 300) via interconnect 302. In one embodiment, I/O unit 305 implements a peripheral component interconnect express (PCIe) interface for communicating over a PCIe bus, and interconnect 302 is a PCIe bus. In alternative embodiments, the I/O unit 305 may implement other types of known interfaces for communicating with external devices.

The I/O unit 305 decodes the data packet received via the interconnect 302. In one embodiment, the data packet represents a command configured to cause PPU 300 to perform various operations. I/O unit 305 sends decoded commands to the various other units of PPU 300 as specified by the commands. For example, some commands may be sent to the front-end unit 315. Other commands may be sent to the hub 330 or other units of the PPU 300, such as one or more replication engines, video encoders, video decoders, power management units, etc. (not explicitly shown). In other words, I/O unit 305 is configured to route communications between and among the various logical units of PPU 300.

In one embodiment, programs executed by the host processor encode the command stream in a buffer that provides the PPU 300 with a workload for processing. The workload may include a number of instructions and data to be processed by those instructions. A buffer is an area of memory that is accessible (e.g., read/write) by both the host processor and PPU 300. For example, I/O unit 305 may be configured to access buffers in system memory connected to interconnect 302 via memory requests transmitted through interconnect 302. In one embodiment, the host processor writes the command stream to the buffer and then sends a pointer to the beginning of the command stream to PPU 300. The front end unit 315 receives pointers to one or more command streams. The front-end unit 315 manages one or more streams, reads commands from the streams and forwards the commands to the various units of the PPU 300.

The front end unit 315 is coupled to a scheduler unit 320 that configures various GPCs 350 to process tasks defined by one or more streams. The scheduler unit 320 is configured to track status information related to various tasks managed by the scheduler unit 320. The status may indicate to which GPC350 a task is assigned, whether the task is active or inactive, a priority associated with the task, and so forth. The scheduler unit 320 manages execution of multiple tasks on one or more GPCs 350.

The scheduler unit 320 is coupled to a work distribution unit 325 configured to dispatch tasks for execution on GPCs 350. Work allocation unit 325 may track several scheduled tasks received from scheduler unit 320. In one embodiment, the work distribution unit 325 manages a pending (pending) task pool and an active task pool for each GPC 350. The pool of tasks to be processed may include a number of time slots (e.g., 32 time slots) that contain tasks assigned to be processed by a particular GPC 350. The active task pool may include a number of time slots (e.g., 4 time slots) for tasks being actively processed by the GPC 350. When the GPC350 completes execution of a task, the task is evicted from the active task pool of the GPC350, and one of the other tasks from the pending task pool is selected and scheduled for execution on the GPC 350. If an active task on the GPC350 is already idle, such as while waiting for data dependencies to be resolved, then the active task may be evicted from the GPC350 and returned to the pool of pending tasks, while another task in the pool of pending tasks is selected and scheduled for execution on the GPC 350.

The work distribution unit 325 communicates with one or more GPCs 350 via an XBar (cross bar) 370. XBar 370 is an interconnection network that couples many of the elements of PPU 300 to other elements of PPU 300. For example, the Xbar 370 may be configured to couple the work allocation unit 325 to a particular GPC350. Although not explicitly shown, one or more other units of PPU 300 may also be connected to XBar 370 via hub 330.

Tasks are managed by the scheduler unit 320 and assigned to GPCs 350 by the work distribution unit 325. The GPC350 is configured to process tasks and generate results. The results may be consumed by other tasks within the GPC350, routed to a different GPC350 via XBar 370, or stored in memory 304. Results may be written to memory 304 via partition unit 380, partition unit 380 implementing a memory interface for reading data from memory 304 and writing data to memory 304. The results may be sent to another PPU 304 or CPU via NVLink 310. In one embodiment, PPU 300 includes a number U of partition units 380 equal to the number of separate and distinct memory devices 304 coupled to PPU 300. Partition unit 380 will be described in more detail below in conjunction with FIG. 4B.

In one embodiment, the host processor executes a driver kernel implementing an Application Programming Interface (API) that enables one or more applications to be executed on the host processor to schedule operations for execution on PPU 300. In one embodiment, multiple computing applications are executed simultaneously by PPU 300, and PPU 300 provides isolation, quality of service (QoS), and independent address space for multiple computing applications. The application may generate instructions (e.g., API calls) that cause the driver kernel to generate one or more tasks for execution by PPU 300. The driver kernel outputs tasks to one or more streams being processed by PPU 300. Each task may include one or more related thread groups, referred to herein as thread bundles (warp). In one embodiment, the thread bundle includes 32 related threads that may be executed in parallel. A cooperative thread may refer to multiple threads that include instructions to perform tasks and may exchange data through a shared memory. Threads and collaboration threads are described in more detail in connection with FIG. 5A.

FIG. 4A illustrates a GPC 350 of the PPU 300 of FIG. 3, according to one embodiment. As shown in FIG. 4A, each GPC 350 includes multiple hardware units for processing tasks. In one embodiment, each GPC 350 includes a pipeline manager 410, a pre-raster operations unit (prog) 415, a raster engine 425, a work distribution crossbar (WDX) 480, a Memory Management Unit (MMU) 490, and one or more Data Processing Clusters (DPC) 420. It should be understood that the GPC 350 of fig. 4A may include other hardware units instead of or in addition to the units shown in fig. 4A.

In one embodiment, the operation of the GPC 350 is controlled by the pipeline manager 410. The pipeline manager 410 manages the configuration of one or more DPCs 420 for processing tasks allocated to GPCs 350. In one embodiment, the pipeline manager 410 may configure at least one of the one or more DPCs 420 to implement at least a portion of the graphics rendering pipeline. For example, DPC 420 may be configured to execute a vertex shading program on programmable Streaming Multiprocessor (SM) 440. The pipeline manager 410 may also be configured to route data packets received from the work distribution unit 325 to the appropriate logic units in the GPCs 350. For example, some packets may be routed to fixed function hardware units in the pro 415 and/or raster engine 425, while other packets may be routed to DPC 420 for processing by primitive engine 435 or SM 440. In one embodiment, the pipeline manager 410 may configure at least one of the one or more DPCs 420 to implement a neural network model and/or a computational pipeline.

The PROP unit 415 is configured to route data generated by the raster engines 425 and DPC 420 to a Raster Operations (ROP) unit, described in more detail in connection with FIG. 4B. The PROP unit 415 may also be configured to perform optimization of color blending, organize pixel data, perform address translation, and so forth.

The raster engine 425 includes several fixed function hardware units configured to perform various raster operations. In one embodiment, the raster engine 425 includes a setup engine, a coarse raster engine, a culling engine, a clipping engine, a fine raster engine, and a tile aggregation engine. The setup engine receives the transformed vertices and generates plane equations associated with the geometric primitives defined by the vertices. The plane equations are sent to the coarse raster engine to generate coverage information for the primitives (e.g., x, y coverage masks for the tiles). The output of the coarse raster engine is sent to a culling engine, where fragments associated with primitives that do not pass the z-test are culled, and non-culled fragments are sent to a clipping engine, where fragments that lie outside of the view cone are clipped. Those segments left after clipping and culling may be passed to a fine raster engine to generate attributes of the pixel segments based on plane equations generated by the setup engine. The output of the raster engine 425 includes segments to be processed, for example, by a fragment shader implemented within DPC 420.

Each DPC 420 included in the GPC 350 includes an M-pipeline controller (MPC) 430, a primitive engine 435, and one or more SMs 440.MPC 430 controls the operation of DPC 420 and routes data packets received from pipeline manager 410 to the appropriate units in DPC 420. For example, data packets associated with vertices may be routed to primitive engine 435, with primitive engine 435 configured to extract vertex attributes associated with the vertices from memory 304. Instead, data packets associated with the shading program may be sent to SM 440.

SM 440 includes a programmable streaming processor configured to process tasks represented by multiple threads. Each SM 440 is multi-threaded and is configured to concurrently execute multiple threads (e.g., 32 threads) from a particular thread group. In one embodiment, SM 440 implements SIMD (single instruction, multiple data) architecture, where each thread in a thread group (e.g., warp) is configured to process a different set of data based on the same instruction set. All threads in the thread group execute the same instruction. In another embodiment, the SM 440 implements a SIMT (single instruction, multi-thread) architecture, wherein each thread in a thread group is configured to process a different set of data based on the same instruction set, but wherein the individual threads in the thread group are allowed to diverge during execution. In one embodiment, a program counter, call stack, and execution state are maintained for each thread bundle, enabling concurrency between the thread bundles and serial execution in the thread bundles when threads within the thread bundles diverge. In another embodiment, a program counter, call stack, and execution state are maintained for each individual thread, thereby achieving equal concurrency between all threads within and between thread bundles. When maintaining execution state for each individual thread, threads executing the same instructions may be converged and executed in parallel for maximum efficiency. SM 440 is described in more detail below in conjunction with fig. 5A.

MMU 490 provides an interface between GPC 350 and partition units 380. MMU 490 may provide virtual to physical address translations, memory protection, and arbitration of memory requests. In one embodiment, MMU 490 provides one or more translation look-aside buffers (TLB) for performing translations from virtual addresses to physical addresses in memory 304.

FIG. 4B illustrates a memory partition unit 380 of the PPU 300 of FIG. 3, according to one embodiment. As shown in fig. 4B, memory partition unit 380 includes a Raster Operations (ROP) unit 450, a level two (L2) cache 460, and a memory interface 470. Memory interface 470 is coupled to memory 304. The memory interface 470 may implement 32, 64, 128, 1024 bit data buses, etc. for high speed data transfer. In one embodiment, PPU 300 incorporates U memory interfaces 470, one memory interface 470 for each pair of partition units 380, where each pair of partition units 380 is connected to a corresponding memory device 304. For example, PPU 300 may be connected to up to Y memory devices 304, such as a high bandwidth memory stack or a synchronous dynamic random access memory of graphics double data rate version 5 or other type of persistent memory.

In one embodiment, memory interface 470 implements an HBM2 memory interface, and Y is equal to half of U. In one embodiment, the HBM2 memory stack is located on the same physical package as PPU 300, providing significant power and area savings over conventional GDDR5 SDRAM systems. In one embodiment, each HBM2 stack includes four memory dies and Y is equal to 4, where the HBM2 stack includes two 128-bit lanes per die, a total of 8 lanes and a data bus width of 1024 bits.

In one embodiment, memory 304 supports Single Error Correction Double Error Detection (SECDED) Error Correction Code (ECC) to protect data. ECC provides higher reliability for computing applications that are sensitive to data corruption. In large clustered computing environments, reliability is particularly important where PPU 300 processes very large data sets and/or runs applications for long periods of time.

In one embodiment, PPU 300 implements a multi-level memory hierarchy. In one embodiment, memory partition unit 380 supports unified memory to provide a single unified virtual address space for CPU and PPU 300 memory, enabling sharing of data between virtual memory systems. In one embodiment, the frequency of access of PPU 300 to memory located on other processors is tracked to ensure that memory pages are moved to the physical memory of PPU 300 that accesses the page more frequently. In one embodiment, NVLink 310 supports an address translation service that allows PPU 300 to directly access the CPU's page tables and provides full access to CPU memory by PPU 300.

In one embodiment, the replication engine transfers data between the PPUs 300 or between the PPUs 300 and the CPU. The replication engine may generate page faults for addresses that are not mapped to the page table. Memory partition unit 380 may then service the page fault, map the address into a page table, and then the replication engine may perform the transfer. In conventional systems, fixed memory (e.g., non-pageable) is operated for multiple replication engines between multiple processors, which significantly reduces available memory. Because of the hardware paging error, the address can be passed to the replication engine without concern of whether the memory page resides and whether the replication process is transparent.

Data from memory 304 or other system memory may be retrieved by memory partition unit 380 and stored in L2 cache 460, L2 cache 460 being located on-chip and shared among the various GPCs 350. As shown, each memory partition unit 380 includes a portion of the L2 cache 460 associated with a corresponding memory device 304. Lower level caches may then be implemented in multiple units within the GPC 350. For example, each SM 440 can implement a level one (L1) cache. The L1 cache is a private memory dedicated to a particular SM 440. Data from L2 caches 460 may be fetched and stored in each L1 cache for processing in the functional units of SM 440. L2 cache 460 is coupled to memory interface 470 and XBar 370.

The ROP unit 450 performs a graphic raster operation related to pixel colors such as color compression, pixel blending, and the like. ROP unit 450 also implements a depth test with raster engine 425, receiving the depth of the sample locations associated with the pixel fragments from the culling engine of raster engine 425. The depth of the sample locations associated with the fragment relative to the corresponding depth in the depth buffer is tested. If the fragment passes the depth test of the sample location, ROP unit 450 updates the depth buffer and sends the results of the depth test to raster engine 425. It will be appreciated that the number of partition units 380 may be different than the number of GPCs 350, and thus each ROP unit 450 may be coupled to each GPC 350.ROP unit 450 tracks data packets received from different GPCs 350 and determines to which GPC 350 the results generated by ROP unit 450 are routed through Xbar 370. Although ROP unit 450 is included within memory partition unit 380 in fig. 4B, in other embodiments ROP unit 450 may be external to memory partition unit 380. For example, ROP unit 450 may reside in GPC 350 or another unit.

FIG. 5A illustrates the streaming multiprocessor 440 of FIG. 4A, according to one embodiment. As shown in fig. 5A, SM 440 includes an instruction cache 505, one or more scheduler units 510, a register file 520, one or more processing cores 550, one or more Special Function Units (SFUs) 552, one or more load/store units (LSUs) 554, an interconnect network 580, a shared memory/L1 cache 570.

As described above, the work distribution unit 325 schedules tasks for execution on the GPCs 350 of the PPUs 300. A task is assigned to a particular DPC 420 within a GPC 350 and may be assigned to SM 440 if the task is associated with a shader program. The scheduler unit 510 receives tasks from the work allocation unit 325 and manages instruction scheduling of one or more thread blocks assigned to the SM 440. The scheduler unit 510 schedules thread blocks for execution as thread bundles of parallel threads, wherein each thread block is assigned at least one thread bundle. In one embodiment, 32 threads are executed per thread bundle. Scheduler unit 510 may manage a plurality of different thread blocks, allocate bundles of threads to different thread blocks, and then dispatch instructions from a plurality of different collaboration groups to the various functional units (i.e., core 550, SFU 552, and LSU 554) during each clock cycle.

A collaboration group is a programming model for organizing groups of communication threads that allows a developer to express the granularity at which threads are communicating, enabling richer, more efficient parallel decomposition to be expressed. The collaboration initiation API supports synchronicity between thread blocks to execute parallel algorithms. Conventional programming models provide a single simple structure for synchronizing collaborative threads: a barrier (e.g., syncthreads () function) across all threads of a thread block. However, programmers often want to define thread groups at a granularity less than the thread block granularity and synchronize within the defined groups, enabling higher performance, design flexibility, and software reuse in the form of a collective full-group functional interface (collective group-wide function interface).

The collaboration group enables a programmer to explicitly define a thread group at sub-block (e.g., as small as a single thread) and multi-block granularity and perform collective operations such as synchronicity on threads in the collaboration group. The programming model supports clean combinations across software boundaries so that libraries and utility functions can be securely synchronized in their local environment without the need to make assumptions about convergence. The collaboration group primitives enable new modes of collaborative parallelism, including producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire thread block grid.

Dispatch unit 515 is configured to transfer instructions to one or more functional units. In this embodiment, scheduler unit 510 includes two dispatch units 515 that enable two different instructions from the same thread bundle to be scheduled during each clock cycle. In alternative embodiments, each scheduler element 510 may include a single dispatch element 515 or additional dispatch elements 515.

Each SM 440 includes a register file 520 that provides a set of registers for the functional units of the SM 440. In one embodiment, register file 520 is divided between each functional unit such that each functional unit is assigned a dedicated portion of register file 520. In another embodiment, register file 520 is divided between different thread bundles executed by SM 440. Register file 520 provides temporary storage for operands connected to the functional unit's data path.

Each SM 440 includes L processing cores 550. In one embodiment, SM 440 includes a large number (e.g., 128, etc.) of different processing cores 550. Each core 550 may include fully pipelined, single-precision, double-precision, and/or mixed-precision processing units, including floating-point arithmetic logic units and integer arithmetic logic units. In one embodiment, the floating point arithmetic logic unit implements the IEEE 754-2008 standard for floating point arithmetic. In one embodiment, cores 550 include 64 single precision (32 bit) floating point cores, 64 integer cores, 32 double precision (64 bit) floating point cores, and 8 tensor cores (tensor cores).

The tensor cores are configured to perform matrix operations, and in one embodiment, one or more tensor cores are included in the core 550. In particular, the tensor core is configured to perform deep learning matrix operations, such as convolution operations for neural network training and reasoning. In one embodiment, each tensor core operates on a 4×4 matrix and performs a matrix multiply and accumulate operation d=a×b+c, where A, B, C and D are 4×4 matrices.

In one embodiment, matrix multiplication inputs A and B are 16-bit floating point matrices, while accumulation matrices C and D may be 16-bit floating point or 32-bit floating point matrices. The tensor core sums the operations on 16-bit floating point input data and 32-bit floating point. The 16-bit floating-point multiplication requires 64 operations, resulting in a full-precision product, which is then accumulated using 32-bit floating-point addition with other intermediate products of a 4 x 4 matrix multiplication. In practice, the tensor core is used to perform larger two-dimensional or higher-dimensional matrix operations established by these smaller elements. APIs (such as CUDA 9C++ APIs) disclose specialized matrix loading, matrix multiplication and accumulation, and matrix store operations to effectively use tensor cores from the CUDA-C++ program. At the CUDA level, the thread bundle level interface assumes that a 16 x 16 size matrix spans all 32 threads of the thread bundle.

Each SM 440 also includes M SFUs 552 that perform special functions (e.g., attribute evaluation, reciprocal square root, etc.). In one embodiment, SFU 552 may include a tree traversal unit configured to traverse the hierarchical tree data structure. In one embodiment, SFU 552 may include a texture unit configured to perform texture map filtering operations. In one embodiment, the texture unit is configured to load a texture map (e.g., a 2D array of texture pixels) from the memory 304 and sample the texture map to produce sampled texture values for use in a shader program executed by the SM 440. In one embodiment, the texture map is stored in shared memory/L1 cache 470. Texture units implement texture operations, such as filtering operations using mipmaps (i.e., texture maps of different levels of detail). In one embodiment, each SM 440 includes two texture units.

Each SM 440 also includes N LSUs 554 that implement load and store operations between shared memory/L1 cache 570 and register file 520. Each SM 440 includes an interconnection network 580 connecting each functional unit to the register file 520 and LSU 554 to the register file 520, shared memory/L1 cache 570. In one embodiment, the interconnection network 580 is a crossbar that may be configured to connect any functional unit to any register in the register file 520, and to connect the LSU 554 to a register file and a memory location in the shared memory/L1 cache 570.

Shared memory/L1 cache 570 is an on-chip memory array that allows data storage and communication between SM440 and primitive engine 435, as well as between threads in SM 440. In one embodiment, shared memory/L1 cache 570 includes a storage capacity of 128KB and is in the path from SM440 to partition unit 380. The shared memory/L1 cache 570 may be used for cache reads and writes. One or more of shared memory/L1 cache 570, L2 cache 460, and memory 304 are backing stores.

Combining the data caching and shared memory functions into a single memory block provides the best overall performance for both types of memory accesses. This capacity can be used by the program as a cache that does not use shared memory. For example, if the shared memory is configured to use half the capacity, then the texture and load/store operations may use the remaining capacity. Integration within the shared memory/L1 cache 570 causes the shared memory/L1 cache 570 to function as a high throughput pipeline for streaming data, and at the same time provides high bandwidth and low latency access to frequently reused data.

When configured for general-purpose parallel computing, a simpler configuration may be used compared to graphics processing. Specifically, the fixed function graphics processing unit shown in FIG. 3 is bypassed, creating a simpler programming model. In a general parallel computing configuration, work allocation unit 325 directly assigns and allocates thread blocks to DPCs 420. Threads in the block execute the same program, use unique thread IDs in the computation to ensure that each thread generates a unique result, use SM440 to execute the program and perform the computation, use shared memory/L1 cache 570 to communicate between threads, and use LSU 554 to read and write global memory through shared memory/L1 cache 570 and memory partition unit 380. When configured for general parallel computing, SM440 can also write commands that scheduler unit 320 can use to initiate new work on DPC 420.

PPU 300 may be included in a desktop computer, laptop computer, tablet computer, server, supercomputer, smart phone (e.g., wireless, handheld device), personal Digital Assistant (PDA), digital camera, vehicle, head mounted display, handheld electronic device, and the like. In one embodiment, PPU 300 is contained on a single semiconductor substrate. In another embodiment, PPU 300 is included on a system-on-a-chip (SoC) along with one or more other devices, such as additional PPUs 300, memory 304, reduced Instruction Set Computer (RISC) CPUs, memory Management Units (MMUs), digital-to-analog converters (DACs), and the like.

In one embodiment, PPU 300 may be included on a graphics card that includes one or more memory devices 304. The graphics card may be configured to interface with a PCIe slot on a motherboard of a desktop computer. In yet another embodiment, PPU 300 may be an Integrated Graphics Processing Unit (iGPU) or parallel processor contained within a chipset of a motherboard.

Example III

Systems with multiple GPUs and CPUs are used in a variety of industries because developers expose and exploit more parallelism in applications such as artificial intelligence computing. High performance GPU acceleration systems with tens to thousands of compute nodes are deployed in data centers, research institutions, and supercomputers to solve even larger problems. As the number of processing devices in high performance systems increases, communication and data transmission mechanisms need to expand to support this increased bandwidth.

Fig. 5B is a conceptual diagram of a processing system 500 implemented using PPU300 of fig. 3, according to one embodiment. The exemplary system 500 may be configured to implement the method 200 shown in fig. 2A. Processing system 500 includes a CPU 530, a switch 510, and each of a plurality of PPUs 300, and a corresponding memory 304.NVLink 310 provides a high-speed communication link between each PPU 300. Although a particular number of NVLink 310 and interconnect 302 connections are shown in FIG. 5B, the number of connections to each PPU300 and CPU 530 may vary. Switch 510 interfaces between interconnect 302 and CPU 530. The PPU300, memory 304, and NVLink 310 may be located on a single semiconductor platform to form parallel processing module 525. In one embodiment, switch 510 supports two or more protocols that interface between various different connections and/or links.

In another embodiment (not shown), NVLink 310 provides one or more high-speed communication links between each PPU300 and CPU 530, and switch 510 interfaces between interconnect 302 and each PPU 300. PPU300, memory 304, and interconnect 302 may be located on a single semiconductor platform to form parallel processing module 525. In yet another embodiment (not shown), interconnect 302 provides one or more communication links between each PPU300 and CPU 530, and switch 510 interfaces between each PPU300 using NVLink 310 to provide one or more high-speed communication links between PPUs 300. In another embodiment (not shown), NVLink 310 provides one or more high-speed communication links between PPU300 and CPU 530 through switch 510. In yet another embodiment (not shown), interconnect 302 provides one or more communication links directly between each PPU 300. One or more NVLink 310 high-speed communication links may be implemented as physical NVLink interconnects or on-chip or on-die interconnects using the same protocol as NVLink 310.

In the context of this specification, a single semiconductor platform may refer to the only single semiconductor-based integrated circuit fabricated on a die or chip. It should be noted that the term single semiconductor platform may also refer to a multi-chip module with increased connectivity that simulates on-chip operation and is substantially improved by utilizing conventional bus implementations. Of course, the various circuits or devices may also be placed separately or in various combinations of semiconductor platforms, depending on the needs of the user. Alternatively, parallel processing module 525 may be implemented as a circuit board substrate and each of PPU 300 and/or memory 304 may be a packaged device. In one embodiment, the CPU 530, the switch 510, and the parallel processing module 525 are located on a single semiconductor platform.

In one embodiment, the signaling rate of each NVLink 310 is 20 to 25 gigabits/second, and each PPU 300 includes six NVLink 310 interfaces (as shown in FIG. 5B, each PPU 300 includes five NVLink 310 interfaces). Each NVLink 310 provides a data transfer rate of 25 gigabits per second in each direction, with six links providing 300 gigabits per second. When CPU 530 also includes one or more NVLink 310 interfaces, NVLink 310 may be dedicated to PPU-to-PPU communications as shown in FIG. 5B, or some combination of PPU-to-PPU and PPU-to-CPU.

In one embodiment, NVLink 310 allows direct load/store/atomic access from CPU 530 to memory 304 of each PPU 300. In one embodiment, NVLink 310 supports coherency operations, allowing data read from memory 304 to be stored in the CPU 530 cache hierarchy, reducing CPU 530 cache access latency. In one embodiment, NVLink 310 includes support for Address Translation Services (ATS), allowing PPU 300 to directly access page tables within CPU 530. One or more NVLink 310 may also be configured to operate in a low power mode.

Fig. 5C illustrates an exemplary system 565 in which the various architectures and/or functions of the various previous embodiments may be implemented. The exemplary system 565 may be configured to implement the method 200 shown in fig. 2A.

As shown, a system 565 is provided that includes at least one central processing unit 530 connected to a communication bus 575. The communication bus 575 may be implemented using any suitable protocol, such as PCI (peripheral component interconnect), PCI-Express, AGP (accelerated graphics Port), hyperTransport, or any other bus or point-to-point communication protocol(s). The system 565 also includes a main memory 540. Control logic (software) and data are stored in main memory 540, and main memory 540 may take the form of Random Access Memory (RAM).

The system 565 also includes an input device 560, a parallel processing system 525, and a display device 545, such as a conventional CRT (cathode ray tube), LCD (liquid crystal display), LED (light emitting diode), plasma display, or the like. User input may be received from an input device 560 (e.g., keyboard, mouse, touchpad, microphone, etc.). Each of the foregoing modules and/or devices may even be located on a single semiconductor platform to form the system 565. Alternatively, the individual modules may also be placed separately or in various combinations of semiconductor platforms, depending on the needs of the user.

Further, the system 565 may be coupled for communication purposes to a network (e.g., a telecommunications network, a Local Area Network (LAN), a wireless network, a Wide Area Network (WAN) (such as the internet), a peer-to-peer network, a cable network, etc.) through a network interface 535.

The system 565 may also include secondary storage (not shown). Secondary storage 610 includes, for example, hard disk drives and/or removable storage drives, representing floppy disk drives, magnetic tape drives, optical disk drives, digital Versatile Disk (DVD) drives, recording devices, universal Serial Bus (USB) flash memory. The removable storage drive reads from and/or writes to a removable storage unit in a well known manner.

Computer programs or computer control logic algorithms may be stored in main memory 540 and/or secondary storage. Such computer programs, when executed, enable the system 565 to perform various functions. Memory 540, storage, and/or any other storage are possible examples of computer-readable media.

The architecture and/or functionality of the various preceding figures may be implemented in the context of a general purpose computer system, a circuit board system, a game console system dedicated for entertainment purposes, a dedicated system, and/or any other desired system. For example, system 565 may take the form of a desktop computer, laptop computer, tablet computer, server, supercomputer, smart phone (e.g., wireless, handheld device), personal Digital Assistant (PDA), digital camera, vehicle, head mounted display, handheld electronic device, mobile telephone device, television, workstation, game console, embedded system, and/or any other type of logic.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above illustrative embodiments are merely illustrative and are not intended to limit the scope of the present invention thereto. Various changes and modifications may be made therein by one of ordinary skill in the art without departing from the scope and spirit of the invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in order to streamline the invention and aid in understanding one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof in the description of exemplary embodiments of the invention. However, the method of the present invention should not be construed as reflecting the following intent: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be combined in any combination, except combinations where the features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

The foregoing description is merely illustrative of specific embodiments of the present invention and the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the scope of the present invention. The protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A bridge, comprising:

2. The bridge of claim 1, wherein the first circuit board has at least two connection terminals disposed thereon, and the second circuit board has at least two connection terminals disposed thereon, one of the connection terminals on the first circuit board being connected to one of the connection terminals on the second circuit board by the circuit communication mechanism.

3. The bridge of claim 2, wherein the circuit communication mechanism includes a third circuit board having connection terminals provided at both ends thereof, one of the connection terminals of the third circuit board being connected to one of the connection terminals on the first circuit board, and the other connection terminal of the third circuit board being connected to one of the connection terminals on the second circuit board.

4. The bridge of claim 3, wherein a distance between two connection terminals of the third circuit board is one or more slot pitches.

5. The bridge of claim 1, wherein the distance adjustment mechanism comprises a first guide post and a second guide post arranged in parallel, wherein one end of the first guide post is fixed relative to the first circuit board, and one end of the second guide post is fixed relative to the second circuit board.

6. The bridge of claim 5, wherein the distance adjustment mechanism further comprises a base in which a detent is provided, the first guide post and the second guide post penetrating the base, the detent preventing the first guide post and the second guide post from sliding freely in the base.

7. The bridge of claim 6, wherein the detent structure comprises a cylinder, a side of the cylinder contacts surfaces of the first guide post and the second guide post, and contact surfaces of the cylinder and the first guide post and the second guide post are provided with anti-slip lines.

8. The bridge of claim 1, wherein the function card comprises a graphics card, a sound card, or a network card, the function card being mounted on a card slot of a host.

9. A processing unit, characterized in that it comprises a bridge according to any of claims 1-8.

10. A computing system comprising at least one processor and a memory coupled to the at least one processor, wherein the computing system comprises the bridge of any of claims 1-8.