CN116719667A - Method for reducing time consumption of GPU (graphics processing unit) for realizing ECC (error correction based on MAC (media access control) and hardware structure thereof - Google Patents

Method for reducing time consumption of GPU (graphics processing unit) for realizing ECC (error correction based on MAC (media access control) and hardware structure thereof Download PDF

Info

Publication number
CN116719667A
CN116719667A CN202310457019.8A CN202310457019A CN116719667A CN 116719667 A CN116719667 A CN 116719667A CN 202310457019 A CN202310457019 A CN 202310457019A CN 116719667 A CN116719667 A CN 116719667A
Authority
CN
China
Prior art keywords
mac
gpu
error correction
ecc
time consumption
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310457019.8A
Other languages
Chinese (zh)
Inventor
赵晨
高武
王敬斯
邰宇浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202310457019.8A priority Critical patent/CN116719667A/en
Publication of CN116719667A publication Critical patent/CN116719667A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1044Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices with specific ECC/EDC distribution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/06Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
    • H04L9/0618Block ciphers, i.e. encrypting groups of characters of a plain text message using fixed encryption transformation
    • H04L9/0631Substitution permutation network [SPN], i.e. cipher composed of a number of stages or rounds each involving linear and nonlinear transformations, e.g. AES algorithms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/06Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
    • H04L9/0643Hash functions, e.g. MD5, SHA, HMAC or f9 MAC
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/06Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
    • H04L9/065Encryption by serially and continuously modifying data stream elements, e.g. stream cipher systems, RC4, SEAL or A5/3
    • H04L9/0656Pseudorandom key sequence combined element-for-element with data sequence, e.g. one-time-pad [OTP] or Vernam's cipher
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method for reducing time consumption of a GPU for realizing ECC (error correction based on a media access control) and a hardware structure thereof, wherein when MEEs in a storage controller MC (Memory Controller) detect that error correction is needed for reading data errors, encryption operations in all the storage controllers are suspended; the MAC computation load scheduler MCLS (MAC computing load scheduling) in the GPU distributes MAC computation operations required for error correction to MAC engines in other memory controllers; because the GPU comprises a plurality of independent and concurrent DRAM channels and corresponding storage controllers, MAC calculation operations required by MAC engines in the storage controllers for performing error correction in a concurrent manner can reduce the time consumption of the GPU for realizing ECC error correction based on the MAC. The method provided by the invention has practicability.

Description

Method for reducing time consumption of GPU (graphics processing unit) for realizing ECC (error correction based on MAC (media access control) and hardware structure thereof
Technical Field
The invention belongs to the technical field of computer system structures and integrated circuit designs, and particularly relates to a hardware structure and a method for reducing the time consumption of GPU (Graphic processing unit) for realizing ECC (Error Correcting Codes) error correction based on MAC (Message authentication code).
Background
With the widespread application of GPU in many fields such as computer vision, natural language processing, high-performance computing, etc., the security is generally concerned, and the construction of a GPU trusted execution environment (TEE: trusted Execution Environment) has urgent practical requirements. At present, commercial GPUs have not been developed with products for providing a TEE, and some research efforts attempt to construct a GPU TEE by using a software or software-hardware cooperation method. The memory encryption engine (MEE: memory Encryption Engine) ensures data security of the processor off-chip main memory DRAM (Dynamic Random Access Memory), an important component of TEE related technology. The GPU system architecture of the integrated MEE is shown in fig. 1: AES (Advanced Encryption Standard) the engine encrypts the data based on AES algorithm, guaranteeing Confidentiality of the data; the MAC engine generates the MAC of the data based on a hash algorithm, and ensures the Integrity (Integrity) of the data. AES encryption typically uses a counter (counter) mode, as shown in fig. 2, where each block of data in the DRAM corresponds to a counter whose value is used as a leaf node of BMT (Bonsai Merkle Tree) to guarantee Freshness (Freshness) of the data based on the BMT. Memory encryption operations generate large amounts of security metadata (counter data, BMT data, MAC data) that are stored in off-chip DRAM, whose memory access operations consume a large amount of memory bandwidth. Although MEE typically caches secure metadata by integrating counter/MAC/BMT caches, these caches have limited capacity due to many factors such as chip area, power consumption, etc., and massive secure metadata access still severely degrades the system performance of the GPU.
Related research work (see "Analyzing Secure Memory Architecture for GPUs", "plus: bandwidth-Efficient Memory Security for GPUs", etc.) shows that among three types of security metadata, the amount of MAC data is the largest and the consumption of memory Bandwidth is the largest. Some research efforts have attempted to store MAC data in ECC DRAM in CPU systems (see "SYNERGY: rethinking Secure-Memory Design for Error-Correcting Memories," et al), detect and correct errors based on MAC data, and utilize ECC DRAM channels with regular data DRThe concurrent nature of the AM channel avoids occupation of regular data DRAM bandwidth by MAC data accesses, thereby reducing the impact of memory encryption on processor system performance. SECDED (Single-Error Correcting Double-Error detection) is the most commonly used ECC mechanism, generating 8bits of ECC data per 8B of regular data, enabling Single bit Error correction and double bit Error detection. When ECC error correction is implemented based on MAC, MAC data can be stored in an ECC memory, and conventional data and security metadata are distributed as shown in FIG. 3 (a). The MAC-based capability to detect any number of bit flip errors and error correction by an exhaustive approach is shown in fig. 3 (b) in comparison to the error correction/detection capability of SECDED ECC. Assuming that the number of 8B fields is N, the number of inversion bits is N, and the maximum value M of the number of MAC calculation times required by error correction by adopting an exhaustion method isWhen n changes from 1 to 3, M increases from 256 to 2763520. If the MAC-based error correction has the same error correction capability as SECDED ECC, i.e. 1-bit flip error is corrected in the 8B field, M is +.>When n=3, m=1048576. When N or N is large, a large number of MAC calculations are required based on MAC error correction, which consumes a large amount of time. Based on publicly available literature displays, there is currently no related art record of how to reduce how much time is consumed by a GPU to implement ECC error correction based on MAC.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method for reducing the time consumption of the GPU for realizing ECC error correction based on MAC and a hardware structure thereof, when MEE in a storage controller MC (Memory Controller) detects that error correction is needed for reading data, encryption operation in all the storage controllers is suspended; the MAC computation load scheduler MCLS (MAC computing load scheduling) in the GPU distributes MAC computation operations required for error correction to MAC engines in other memory controllers; because the GPU comprises a plurality of independent and concurrent DRAM channels and corresponding storage controllers, MAC calculation operations required by MAC engines in the storage controllers for performing error correction in a concurrent manner can reduce the time consumption of the GPU for realizing ECC error correction based on the MAC. The method provided by the invention has practicability.
The technical scheme adopted by the invention for solving the technical problems comprises the following steps:
step 1: in the GPU, MAC data generated by memory encryption is stored in the ECC DRAM, and because the GPU can access the MAC data and the conventional data concurrently, the MEE does not need to integrate an MAC cache to cache the MAC data;
step 2: when the MEE in the memory controller MC detects that the read data errors need to be corrected, the encryption operation in all the memory controllers is suspended;
step 3: a MAC computation load scheduler MCLS in the GPU distributes MAC computation operations required by error correction to MAC engines in other storage controllers;
step 4: because the GPU comprises a plurality of independent and concurrent DRAM channels and corresponding storage controllers, MAC calculation operations required by MAC engines in the storage controllers for performing error correction in a concurrent manner can reduce the time consumption of the GPU for realizing ECC error correction based on the MAC.
A hardware structure for realizing a method for reducing the time consumption of a GPU to realize ECC based on MAC comprises N DRAMs, N MP (memory partition), one MCLS and L SM (streaming multiprocessor);
the MP comprises MC and L2 cache; n DRAMs are connected with N MPs in one-to-one correspondence, and the DRAMs are connected to MC of the MP; the MC of each MP is connected to the MCLS;
the L2 cache of each MP is connected to a Interconnection network bus;
the L SMs are all connected to the Interconnection network bus.
The beneficial effects of the invention are as follows:
in the prior art, if ECC check based on MAC implementation has the same error correction capability as SECDED ECC, when the number of 8B fields and the number of inversion bits are divided into N and N, the maximum value M of MAC calculation times required by exhaustive error correction isSince the data error is not detected based on the MAC checkThe specific bit number of bit flip can be judged, so that the maximum value M of the MAC calculation times required for completing error correction is corrected to be +.>Taking the cache line with the size of GPU L2 128B as an example, n=16, when P is 2, 3, and 4, if the number of pipeline MAC engines integrated by MEE is 1 and the operating frequency thereof is 1GHz, the longest time consumption T for implementing ECC error correction based on MAC is about 0.49ms, 147.29ms, and 30681.83ms, respectively. When an MEE detects an error and corrects the error in a certain MC, the MCLS distributes the MAC calculation load to the MAC engines in other MC. Assuming that the number of the MAC engines in the MC and each MC is N and K respectively, M and T can be reduced to 1/(K multiplied by N) of the original value, and the time consumption for realizing ECC error correction based on the MAC is greatly reduced.
It should be noted that when the number P of supported error correction bits is large, the time required for error correction is still a huge value even if the MAC engine among the plurality of MC is called, and the GPU does not have practical applicability to implementing ECC check based on MAC. Fortunately, the probability of bit flipping of multiple 8B fields simultaneously is very low, which is also proved by related research work, so that P can be set to a smaller value, and the method provided by the invention has practicability.
Drawings
Fig. 1 is a schematic diagram of a GPU system architecture of an integrated MEE.
FIG. 2 is a schematic diagram of Counter encryption mode.
FIG. 3 is a schematic diagram showing the data storage distribution and error correction/detection capability of ECC verification based on MAC.
FIG. 4 is a schematic diagram of a GPU system with MAC computation load inter-MC scheduling and distribution function according to the present invention
Detailed Description
The invention will be further described with reference to the drawings and examples.
The invention mainly relates to a GPU memory encryption engine system structure for realizing ECC check based on MAC, which utilizes a plurality of memory encryption engines in the GPU to realize concurrent MAC calculation and reduces the time consumption for realizing ECC error correction based on MAC.
And by adopting the ECC DRAM to store the MAC data, the concurrent access memory of the MAC data and the conventional data can reduce the negative influence of memory encryption on the performance of the GPU system. ECC error correction and error detection can be achieved by adopting an exhaustive method based on MAC, but more error correction time is required, especially when multiple bit flipping occurs. The invention provides a method for reducing the time consumption of a GPU (graphics processing unit) for realizing ECC (error correction based on a media access control) and a hardware system structure thereof, and the performance of a GPU system is improved.
The invention provides a method for reducing time consumption of a GPU for realizing ECC (error correction based on MAC), and a corresponding GPU system structure is shown in FIG. 4. The MAC data generated by memory encryption is stored in the ECC DRAM, and the MEE does not need to integrate the MAC cache to cache the MAC data because the MAC data and the regular data can be accessed concurrently. When the MEE in a certain memory controller MC detects that the read data error requires error correction, the encryption operations in all the memory controllers are suspended, and the MAC computation load scheduler MCLS distributes the MAC computation operations required for error correction to the MAC engines in other memory controllers. Because the GPU comprises a plurality of independent and concurrent DRAM channels and corresponding storage controllers, MAC calculation operations required by MAC engines in the storage controllers for performing error correction in a concurrent manner can remarkably reduce the time consumption of the GPU for realizing ECC error correction based on MAC.
A GPU hardware structure for reducing the time consumption of a GPU for realizing ECC (error correction based on a MAC) comprises N DRAMs (dynamic random access memory), N MPs, one MCLS (micro control LS) and L SMs;
the MP comprises MC and L2 cache; n DRAMs are connected with N MPs in one-to-one correspondence, and the DRAMs are connected to MC of the MP; the MC of each MP is connected to the MCLS;
the L2 cache of each MP is connected to a Interconnection network bus;
the L SMs are all connected to the Interconnection network bus.

Claims (2)

1. The method for reducing the time consumption of the GPU for realizing ECC (error correction based on the MAC) is characterized by comprising the following steps of:
step 1: in the GPU, MAC data generated by memory encryption is stored in the ECC DRAM, and because the GPU can access the MAC data and the conventional data concurrently, the MEE does not need to integrate an MAC cache to cache the MAC data;
step 2: when the MEE in the memory controller MC detects that the read data errors need to be corrected, the encryption operation in all the memory controllers is suspended;
step 3: a MAC computation load scheduler MCLS in the GPU distributes MAC computation operations required by error correction to MAC engines in other storage controllers;
step 4: because the GPU comprises a plurality of independent and concurrent DRAM channels and corresponding storage controllers, MAC calculation operations required by MAC engines in the storage controllers for performing error correction in a concurrent manner can reduce the time consumption of the GPU for realizing ECC error correction based on the MAC.
2. A hardware structure for implementing the method of claim 1, comprising N DRAMs, N MPs, one MCLS, and L SMs;
the MP comprises MC and L2 cache; n DRAMs are connected with N MPs in one-to-one correspondence, and the DRAMs are connected to MC of the MP; the MC of each MP is connected to the MCLS;
the L2 cache of each MP is connected to a Interconnection network bus;
the L SMs are all connected to the Interconnection network bus.
CN202310457019.8A 2023-04-25 2023-04-25 Method for reducing time consumption of GPU (graphics processing unit) for realizing ECC (error correction based on MAC (media access control) and hardware structure thereof Pending CN116719667A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310457019.8A CN116719667A (en) 2023-04-25 2023-04-25 Method for reducing time consumption of GPU (graphics processing unit) for realizing ECC (error correction based on MAC (media access control) and hardware structure thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310457019.8A CN116719667A (en) 2023-04-25 2023-04-25 Method for reducing time consumption of GPU (graphics processing unit) for realizing ECC (error correction based on MAC (media access control) and hardware structure thereof

Publications (1)

Publication Number Publication Date
CN116719667A true CN116719667A (en) 2023-09-08

Family

ID=87874110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310457019.8A Pending CN116719667A (en) 2023-04-25 2023-04-25 Method for reducing time consumption of GPU (graphics processing unit) for realizing ECC (error correction based on MAC (media access control) and hardware structure thereof

Country Status (1)

Country Link
CN (1) CN116719667A (en)

Similar Documents

Publication Publication Date Title
Mutlu et al. Rowhammer: A retrospective
US10572164B2 (en) Systems and methods for improving efficiencies of a memory system
US10303622B2 (en) Data write to subset of memory devices
KR102198611B1 (en) Method of correcting error in a memory
EP3716071B1 (en) Combined secure message authentication codes (mac) and device correction using encrypted parity with multi-key domains
US9983930B2 (en) Systems and methods for implementing error correcting code regions in a memory
US20180183577A1 (en) Techniques for secure message authentication with unified hardware acceleration
CN113076219B (en) High-energy-efficiency on-chip memory error detection and correction circuit and implementation method
US20060077750A1 (en) System and method for error detection in a redundant memory system
US20230236934A1 (en) Instant write scheme with dram submodules
Chen et al. Memguard: A low cost and energy efficient design to support and enhance memory system reliability
Gurumurthi et al. HBM3 RAS: Enhancing resilience at scale
KR102519891B1 (en) Granular refresh rate control for memory devices
US9147499B2 (en) Memory operation of paired memory devices
CN116719667A (en) Method for reducing time consumption of GPU (graphics processing unit) for realizing ECC (error correction based on MAC (media access control) and hardware structure thereof
US20220413959A1 (en) Systems and methods for multi-use error correcting codes
US20240086551A1 (en) Data compression method and apparatus, electronic device, and storage medium
CN115016981A (en) Setting method of storage area, data reading and writing method and related device
WO2023055806A1 (en) A method and apparatus for protecting memory devices via a synergic approach
US20200233819A1 (en) Memory rank design for a memory channel that is optimized for graph applications
US8964495B2 (en) Memory operation upon failure of one of two paired memory devices
Soltani et al. RandShift: An energy-efficient fault-tolerant method in secure nonvolatile main memory
US20230236933A1 (en) Shadow dram with crc+raid architecture, system and method for high ras feature in a cxl drive
US20240143206A1 (en) Memory controller to perform in-line data processing and efficiently organize data and associated metadata in memory
US20240070024A1 (en) Read Data Path for a Memory System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination