CN117573418A - Processing method, system, medium and equipment for video memory access exception - Google Patents

Processing method, system, medium and equipment for video memory access exception Download PDF

Info

Publication number
CN117573418A
CN117573418A CN202410058593.0A CN202410058593A CN117573418A CN 117573418 A CN117573418 A CN 117573418A CN 202410058593 A CN202410058593 A CN 202410058593A CN 117573418 A CN117573418 A CN 117573418A
Authority
CN
China
Prior art keywords
target data
terminal
access
gpu
video memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410058593.0A
Other languages
Chinese (zh)
Other versions
CN117573418B (en
Inventor
郭帆
王鲲
陈飞
邹懋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vita Technology Beijing Co ltd
Original Assignee
Vita Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vita Technology Beijing Co ltd filed Critical Vita Technology Beijing Co ltd
Priority to CN202410058593.0A priority Critical patent/CN117573418B/en
Publication of CN117573418A publication Critical patent/CN117573418A/en
Application granted granted Critical
Publication of CN117573418B publication Critical patent/CN117573418B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Storage Device Security (AREA)

Abstract

The disclosure relates to a processing method, a system, a medium and equipment for video memory access exception, and relates to the technical field of GPU video memory. Thus, when the GPU generates an access error due to the fact that the address corresponding to the accessed target data is in the second terminal, the target data can be acquired through the method, and therefore normal access of the GPU to the target data is achieved.

Description

Processing method, system, medium and equipment for video memory access exception
Technical Field
The disclosure relates to the technical field of GPU video memories, in particular to a method, a system, a medium and equipment for processing video memory access exception.
Background
A GPU (Graphics Processing Unit, graphics processor) is a parallel computing device, and is generally widely used in graphics processing, AI (Artificial Intelligence ), high performance computing, and other fields, and when the GPU accesses a video memory, if an abnormal situation such as Page Fault/illegal address/permission limitation is encountered, it may trigger a GPU memory access exception (Page Fault), and at this time, the GPU Page Fault processing procedure may be entered to resume the GPU video memory access.
Disclosure of Invention
The disclosure aims to provide a processing method, a system, a medium and equipment for video memory access abnormality, when an access error occurs in a GPU due to the fact that an address corresponding to accessed target data is in a second terminal, the target data can be obtained through the mode of the disclosure, and therefore normal access of the GPU to the target data is achieved. Further, the GPU of the first terminal can access the data of the second terminal, so that the video memory of the GPU of the first terminal is expanded, and resource sharing among a plurality of terminals is realized.
In order to achieve the above object, in a first aspect, the present disclosure provides a method for processing a video memory access exception, applied to a first terminal, the method for processing a video memory access exception includes:
in response to detecting that the GPU fails to access the target data in the kernel state, determining whether the target data is located in the second terminal;
reporting access abnormality information corresponding to access failure to a user state under the condition that the target data is positioned at a second terminal, so that an I/O interface in the user state acquires the target data from the second terminal; wherein, the access anomaly information at least comprises: address information corresponding to the target data;
copying the target data into a video memory, and controlling the GPU to revisit the target data in a kernel state.
Optionally, the determining whether the target data is located in the second terminal includes:
acquiring pre-established address terminal corresponding relation information;
and determining whether the target data is positioned at the second terminal according to the address corresponding to the target data and the address terminal corresponding relation information.
Optionally, the copying the target data to the video memory includes:
determining a first subspace for storing the target data from a video memory, and copying the target data into the first subspace;
and, the above method further comprises:
and establishing a mapping relation between the first subspace and the acquired address corresponding to the target data.
Optionally, the controlling the GPU to revisit the target data in the kernel state includes:
and acquiring the target data from the first subspace by utilizing an acquisition address corresponding to the target data.
Optionally, after the copying the target data to the first subspace, the method further includes:
generating first notification information for reporting the running system in a user mode;
and the running system converts the current processing process from a user mode to a kernel mode based on the first notification information.
Optionally, in the case that the target data is not located in the second terminal, the method further includes:
and clearing access exception information for failure in accessing the target data, and restoring the GPU access related hardware to a state before accessing the target data.
Optionally, the method further comprises:
in the case where the I/O interface and the second terminal have established a communication link, the I/O interface acquires target data from the second terminal using the established communication link;
and generating indication information for establishing a communication link in the case that the I/O interface does not establish the communication link with the second terminal.
In a second aspect, the present disclosure further provides a processing system for a video memory access exception, applied to a first terminal, where the processing system for a video memory access exception includes:
the detection unit is used for determining whether the target data are positioned at the second terminal or not in response to the detection that the GPU fails to access the target data in the kernel state;
the acquisition unit is used for reporting access abnormality information corresponding to access failure to a user state when the target data is positioned at the second terminal, so that an I/O interface in the user state acquires the target data from the second terminal; wherein, the access anomaly information at least comprises: address information corresponding to the target data;
and the recovery unit is used for copying the target data into the video memory and controlling the GPU to revisit the target data in the kernel state.
In a third aspect, the present disclosure also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any one of the preceding aspects for handling a video memory access exception.
In a fourth aspect, the present disclosure also provides an electronic device, including: a memory having a computer program stored thereon; a processor, configured to execute the computer program in the memory, so as to implement the steps of the method for processing a video memory access exception according to any one of the foregoing first aspects.
By adopting the technical scheme, at least the following beneficial technical effects can be achieved:
when the failure of the GPU to access the target data is detected, whether the target data is located in the second terminal can be determined, and when the target data is located in the second terminal, access abnormality information corresponding to the access failure can be reported to a user state, so that the target data can be acquired from the second terminal by using an I/O interface in the user state, and after the target data is acquired, the target data is copied to a video memory, and at the moment, the normal access of the GPU to the target data can be realized by re-executing the step of the GPU to access the target data. Thus, when the GPU generates an access error due to the fact that the address corresponding to the accessed target data is in the second terminal, the target data can be acquired through the method, and therefore normal access of the GPU to the target data is achieved.
Meanwhile, as the GPU of the first terminal can access the data of the second terminal, the GPU of the first terminal is expanded in a video memory, and resource sharing among a plurality of terminals is realized.
Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate the disclosure and together with the description serve to explain, but do not limit the disclosure. In the drawings:
FIG. 1 is a flow chart illustrating a method of handling a memory access exception according to an exemplary embodiment of the present disclosure.
FIG. 2 is an interactive schematic diagram illustrating a method of handling a memory access exception according to an exemplary embodiment of the present disclosure.
FIG. 3 is an interactive schematic diagram illustrating yet another method of handling memory access exceptions according to an exemplary embodiment of the present disclosure.
FIG. 4 is a block diagram illustrating a processing system for a memory access exception according to an exemplary embodiment of the present disclosure.
Fig. 5 is a block diagram of an electronic device, according to an exemplary embodiment of the present disclosure.
Detailed Description
Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the disclosure, are not intended to limit the disclosure.
It should be noted that, all actions for acquiring signals, information or data in the present disclosure are performed under the condition of conforming to the corresponding data protection rule policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.
GPU Page Fault may occur when a GPU accesses unallocated or released memory areas. In the related art, when the GPU Page Fault occurs, the kernel mode may perform corresponding processing so as to resume the normal access of the GPU. For example, the kernel mode may run a GPU Page Fault handler, which may be understood as a program that is executed when the GPU encounters an invalid or unmapped memory address when accessing its memory, and may generally proceed as follows: and allocating a new GPU video memory, and migrating memory pages from the CPU memory/other GPU video memories on the same machine. And the processing modes are all carried out in the kernel state, so that when the target data to be accessed is not currently stored in the first terminal, the GPU Page Fault processing program cannot be run in the kernel state to find the target data, and therefore error exit can only be reported.
That is, when the GPU Page Fault occurs, the related art may run the GPU Page Fault handler in kernel state to restore access to the data, but the processing method only can process the local situation (including the local CPU memory and the GPU video memory) of the target data, and if the target data is not local, the target data cannot be queried in this way. That is, the processing manner of the related art for the GPU Page Fault is single, so that the probability of success of processing for the GPU Page Fault is not high.
In view of this, in the processing method for abnormal access to a video memory according to the embodiment of the present disclosure, when an access error occurs in the GPU due to the address corresponding to the accessed target data at the second terminal, the target data may also be obtained by using the method of the present disclosure, so that normal access of the GPU to the target data is achieved. Meanwhile, the GPU of the first terminal can access the data of the second terminal, so that the video memory of the GPU of the first terminal is expanded, and resource sharing among a plurality of terminals is realized.
Fig. 1 is a flowchart illustrating a method for processing a video memory access exception according to an exemplary embodiment of the present disclosure, where the method for processing a video memory access exception may be applied to a first terminal, and the first terminal may include a GPU, and the GPU includes a video memory. As shown in fig. 1, the processing method for the video memory access exception may include the following steps.
In S101, in response to detecting that the GPU fails to access the target data in the kernel state, it is determined whether the target data is located at the second terminal.
It should be understood that the second terminal and the first terminal may be understood as two terminal devices independent of each other. That is, other terminals than the first terminal may be understood as the second terminal, and an association relationship may be established in advance between the second terminal and the first terminal, the association relationship being embodied in such a manner that communication is possible between the second terminal and the first terminal.
As an example, when the GPU accesses data, it is generally queried whether the data is in the video memory, and if the data is not in the video memory, a GPU access failure (Page Fault) may occur.
It should be noted that, when a failure of the GPU to access the target data is detected, the target data may be located in the first terminal (e.g., CPU memory) or may be located in the second terminal.
If the target data is located in the first terminal, the kernel mode may possibly recover the GPU Page Fault handler (which may be understood as a recovery program preset by the kernel mode for recovering the access exception); if the target data is located in the second terminal, the GPU Page Fault handler is obviously impossible to recover in the kernel mode.
That is, when detecting that the GPU fails to access the target data, the present disclosure determines whether the target data is located at the second terminal, so as to better implement processing for the GPU Page Fault.
In S102, when the target data is located in the second terminal, the access exception information corresponding to the access failure is reported to the user mode, so that the I/O interface in the user mode obtains the target data from the second terminal.
Here, the access abnormality information includes at least: address information corresponding to the target data.
As an example, the I/O interface may establish a communication link with the second terminal, after which the target data may be acquired from the second terminal.
As an example, the I/O interface may query the second terminal according to address information corresponding to the target data, and establish a communication link with the second terminal, and then may acquire the target data from the second terminal.
In S103, the target data is copied to the memory, and the GPU is controlled to revisit the target data in the kernel state.
As an example, the GPU may access the data in the video memory when accessing the data, and thus, in order to make the GPU access the target data normally, the target data may be copied to the video memory.
As an example, the step of re-executing the GPU to access the target data may be understood as running a preset GPU Page Fault handler in kernel mode, where the target data already exists in the video memory, so that the current GPU Page Fault may be cleared when the GPU Page Fault handler is running in kernel mode.
Therefore, in the method disclosed by the invention, when the failure of accessing the target data by the GPU is detected, whether the target data is positioned at the second terminal can be determined, and when the target data is positioned at the second terminal, the access abnormality information corresponding to the access failure can be reported to the user state, so that the target data can be acquired from the second terminal by using the I/O interface in the user state, and after the target data is acquired, the target data is copied to the video memory, and at the moment, the normal access of the GPU to the target data can be realized by re-executing the step of accessing the target data by the GPU. Thus, when the GPU generates an access error due to the fact that the address corresponding to the accessed target data is in the second terminal, the target data can be acquired through the method, and therefore normal access of the GPU to the target data is achieved.
Meanwhile, as the GPU of the first terminal can access the data of the second terminal, the GPU of the first terminal is expanded in a video memory, and resource sharing among a plurality of terminals is realized.
In some embodiments, the "determining whether the target data is located at the second terminal" in S101 may specifically include:
acquiring pre-established address terminal corresponding relation information; and determining whether the target data is positioned at the second terminal according to the address corresponding to the target data and the address terminal corresponding relation information.
As an example, the pre-established address terminal correspondence information may indicate: and accessing the mapping relation between the address and the terminal. Specifically, the terminal corresponding to the address may be determined according to the address.
It should be appreciated that when detecting that the GPU fails to access the target data in the kernel mode, the pre-established address terminal correspondence information may be obtained, so that it may be determined efficiently whether the target data is located in the second terminal.
It will be appreciated that the second terminal may establish a communication link with the first terminal such that the target data may be conveniently obtained from the second terminal.
In some embodiments, in the event that the I/O interface has established a communication link with the second terminal, the I/O interface obtains the target data from the second terminal using the established communication link.
As an example, if the I/O interface in the user mode has already established a communication link with the second terminal, then the target data may be obtained from the second terminal directly based on the communication link.
Accordingly, in the case where the I/O interface has not established a communication link with the second terminal, generating indication information for establishing the communication link; at this time, the user may establish a communication link between the I/O interface and the second terminal according to the indication of the indication information.
Of course, there are many specific ways of establishing the communication link, and for brevity of description, details are not described herein, for example, the communication protocol used by the established communication link may include, but is not limited to, TCP (Transmission Control Protocol ), RDMA (Remote Direct Memory Access, remote direct memory access protocol), and the like.
In some embodiments, copying the target data into the memory may include:
determining a first subspace for storing target data from a video memory, and copying the target data to the first subspace; and, the above method may further include:
and establishing a mapping relation between the first subspace and the acquired address corresponding to the target data.
As an example, when the GPU re-accesses the target data in the kernel mode, the GPU needs to access a space in the memory indicated by the address, so that normal access to the target data can be achieved, therefore, a subspace for storing the target data can be determined in the memory, and a mapping relationship can be established between the subspace and the address of the target data, so that when the GPU re-accesses the target data in the kernel mode, the GPU can achieve acquisition to the target data.
In some embodiments, controlling the GPU to revisit the target data in the kernel state may specifically include:
and acquiring the target data from the first subspace by utilizing the acquisition address corresponding to the target data.
As an example, since the target data is already stored in the first subspace of the video memory at present, the first subspace may be determined based on the acquisition address corresponding to the target data, and the target data may be acquired from the first subspace. In this way, the target data can be successfully acquired when the GPU revisits the target data in the kernel mode.
In some embodiments, after copying the target data to the first subspace, the following steps may also be performed:
generating first notification information for reporting the running system in a user mode;
the running system converts the current processing process from the user mode to the kernel mode based on the first notification information.
As an example, since the target data is acquired and copied to the video memory in the user mode, and since the re-accessing the target data communication needs to be performed in the kernel mode, after the target data is copied to the first subspace, the first notification information may be generated, and then the current process may be switched from the user mode to the kernel mode, so that the GPU may re-access the target data in the kernel mode.
In some embodiments, before controlling the GPU to revisit the target data in the kernel state, the following steps may also be performed:
and clearing access exception information for failure of accessing the target data, and restoring the GPU access related hardware to a state before accessing the target data.
It should be appreciated that when a process is executing, an exception or interrupt may occur, causing the process to be suspended from execution. At this time, the hardware will save the current hardware context (including the information of CPU registers, memory addresses, etc.) so as to correctly restore to the state before the exception occurs when the execution is resumed.
That is, the access exception information for the failure to access the target data is cleared, and the hardware associated with the restoration of the GPU access is restored to the state before the access to the target data, so that the GPU can conveniently revisit the target data in the kernel state. It will also be appreciated that the GPU may be facilitated to revisit instructions of the target data in the kernel state and may access the target data.
For easy understanding, fig. 2 may be described in conjunction with fig. 2, where fig. 2 may be understood as a specific interaction diagram corresponding to one possible implementation of the processing method for a video memory access exception according to the present disclosure, and as can be seen from fig. 2, if an application runs on a GPU1 device of a first terminal, if a video memory address va2 of a GPU2 device on a second terminal is accessed (the address does not actually map a physical video memory on the GPU 1), a hardware page fault interrupt may be triggered. That is, in FIG. 2, arrow 201 between the user application and the GPU1 device may indicate that GPU2 device memory address va2 on the second terminal is to be accessed, after which execution may continue sequentially in the order of arrows 202-209. That is, the hardware interrupt is monitored and filtered by the monitor reporting module in the kernel mode of the method, if the address range belongs to the target GPU (which can be understood as if the determined address indicates the second terminal), the information of the abnormal address (including the address, the access type, the GPU ID where the abnormality occurs, etc.) is packaged, and sent to the "user mode exception handling module" through a signal/system call. And after a user Fault (error occurs in a user program) event is monitored by a monitoring module (not shown in fig. 2, the monitoring module can be used for monitoring events in a user state and a kernel state), page Fault exception information can be analyzed, a physical server where an exception address is located and a target GPU ID are found through recorded index information, then data can be read from a GPU2 device of a second terminal through a network I/O access interface, a mapping from va2 to a physical video memory can be established on a physical GPU1 device, and the data read from the GPU2 device is copied into va2 on the GPU1 device. The monitoring module in the user mode can also inform the abnormal recovery execution module in the kernel mode through system call, recover the hardware context through cleaning the abnormal information, re-execute the last abnormal instruction and re-access the abnormal address, so that the needed data can be obtained.
Specifically, in fig. 2, the listening and reporting module may specifically perform the following functions:
and monitoring hardware interrupt in a kernel mode, screening out Page Fault, packaging Page Fault information, and reporting the Page Fault information to a user mode processing program through system call/signals. Specifically, the anomaly monitoring module is configured to monitor hardware interrupt of the physical device, and screen out a Page Fault (also a process indicated by arrows 202-203 in fig. 2) to be reported; the exception reporting module is responsible for packaging the Page Fault information to be reported, and sending the Page Fault information to the user state handler (a process indicated by an arrow 204 in fig. 2) through a signal or a system call.
The user state exception handling module may specifically perform the following functions:
and supporting to receive and analyze Page Fault abnormal information in a user mode, accessing required data through a user mode I/O interface, recovering an abnormal address, and copying corresponding data. Specifically, the anomaly receiving module is responsible for receiving Page Fault information reported by the kernel mode, and resolving related information such as an error address, an error authority and the like (a process indicated by an arrow 205 in fig. 2); an I/O access module responsible for invoking a user-mode I/O interface to obtain the required data from the network/storage (the process indicated by arrow 206 in fig. 2); and the video memory recovery data copying module: for restoring access to the abnormal address by allocating new memory/remapping/modifying access rights, etc., and copying the corresponding data into the memory (the process indicated by arrow 207 in fig. 2).
The abnormality recovery execution module may specifically execute the following functions:
after the user state exception handling is completed, the system call is invoked to notify the kernel state to restore the execution of the application program, the hardware context is restored by clearing the exception information, and then an instruction with an exception can be re-executed, and the exception address is re-accessed (the process indicated by the arrows 208-209 in fig. 2).
Therefore, through the mutual coordination function of the modules, GPU access abnormality processing can be performed in a user mode, and in this way, data stored on the video memory of other devices can be obtained, namely, the effect of cross-machine access video memory without perception of a user can be realized.
For ease of understanding, the description may be continued with reference to fig. 3, where fig. 3 may be understood as a schematic view of one possible access of the present disclosure, and it may be seen from fig. 3 that solid line connection arrows may characterize direct access, and broken line connection arrows may be understood as indirect access from a user state by way of the present disclosure; that is, the application program may directly access the local memory, or may indirectly access the memory of the other terminal (that is, by executing the steps in the processing method for memory access exception provided by the present disclosure, indirect access to the data in the memory of the other terminal is implemented).
The method and the device can acquire resources in the video memory in a cross-device manner, which is equivalent to expanding the video memory space of the terminal, so that the video memory spaces of different devices can be shared, and the capability of the GPU is improved under the condition that the GPU hardware device is not changed.
However, by running the GPU Page Fault (specifically, may be a GPU user Fault, and the GPU user Fault may be understood as that when the user program tries to access the memory area of the GPU, if the area is protected by the system or cannot be accessed, the GPU user Fault processing program will be generated, that is, the processing method for the video memory access exception of the present disclosure may be triggered when accessing the GPU video memory on another server, so that the processing method for the video memory access exception of the present disclosure may be triggered, GPU Page Fault information is reported to the user state, and data is transferred to the application through a network interface such as TCP/RDMA, so as to achieve the effect of the user-unaware cross-machine access video memory.
FIG. 4 is a schematic diagram of a processing system for a video memory access exception according to an exemplary embodiment of the disclosure, as shown in FIG. 4, applied to a first terminal, where the processing system 400 for a video memory access exception includes: a detection unit 410, an acquisition unit 420, and a recovery unit 430; wherein the detection unit 410, the acquisition unit 420 and the recovery unit 430 may correspond to the functions performed by one or more of the modules in fig. 2, with only slight differences in naming;
a detecting unit 410, configured to determine, in response to detecting that the GPU fails to access the target data in the kernel state, whether the target data is located in the second terminal;
an obtaining unit 420, configured to report access exception information corresponding to an access failure to a user state when the target data is located in a second terminal, so that an I/O interface in the user state obtains the target data from the second terminal; wherein, the access anomaly information at least comprises: address information corresponding to the target data;
and the recovery unit 430 is configured to copy the target data to the video memory, and control the GPU to revisit the target data in the kernel state.
In some embodiments, the detection unit 410 is specifically further configured to:
acquiring pre-established address terminal corresponding relation information;
and determining whether the target data is positioned at the second terminal according to the address corresponding to the target data and the address terminal corresponding relation information.
In some embodiments, the recovery unit 430 is specifically further configured to:
determining a first subspace for storing the target data from a video memory, and copying the target data into the first subspace;
and, the processing system 400 for video memory access exception is specifically further configured to:
and establishing a mapping relation between the first subspace and the acquired address corresponding to the target data.
In some embodiments, the recovery unit 430 is specifically further configured to:
and acquiring the target data from the first subspace by utilizing an acquisition address corresponding to the target data.
In some embodiments, after the target data is copied to the first subspace, the processing system 400 for a video memory access exception is specifically further configured to:
generating first notification information for reporting the running system in a user mode;
and the running system converts the current processing process from a user mode to a kernel mode based on the first notification information.
In some embodiments, in a case where the target data is not located in the second terminal, the processing system 400 for a video memory access exception is specifically further configured to:
and clearing access exception information for failure in accessing the target data, and restoring the GPU access related hardware to a state before accessing the target data.
In some embodiments, the processing system 400 for a video memory access exception is specifically further configured to:
in the case where the I/O interface and the second terminal have established a communication link, the I/O interface acquires target data from the second terminal using the established communication link;
and generating indication information for establishing a communication link in the case that the I/O interface does not establish the communication link with the second terminal.
With respect to the processing system for a video memory access exception in the above embodiment, the specific manner in which the respective modules or services perform operations has been described in detail in the embodiment regarding the processing method for a video memory access exception, and will not be described in detail herein.
Fig. 5 is a block diagram of an electronic device 500, shown in accordance with an exemplary embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 may include: a processor 501, a memory 502. The electronic device 500 may also include one or more of a multimedia component 503, an input/output (I/O) interface 504, and a communication component 505.
The processor 501 is configured to control the overall operation of the electronic device 500, so as to complete all or part of the steps in the processing method for the video memory access exception. The memory 502 is used to store various types of data to support operation at the electronic device 500, which may include, for example, instructions for any application or method operating on the electronic device 500, as well as application-related data, such as contact data, messages sent and received, pictures, audio, video, and so forth. The Memory 502 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as static random access Memory (Static Random Access Memory, SRAM for short), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM for short), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM for short), programmable Read-Only Memory (Programmable Read-Only Memory, PROM for short), read-Only Memory (ROM for short), magnetic Memory, flash Memory, magnetic disk, or optical disk. The multimedia component 503 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen, the audio component being for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in the memory 502 or transmitted through the communication component 505. The audio assembly further comprises at least one speaker for outputting audio signals. The I/O interface 504 provides an interface between the processor 501 and other interface modules, which may be a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 505 is used for wired or wireless communication between the electronic device 500 and other devices. Wireless communication, such as Wi-Fi, bluetooth, near field communication (Near Field Communication, NFC for short), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or one or a combination of more of them, is not limited herein. The corresponding communication component 505 may thus comprise: wi-Fi module, bluetooth module, NFC module, etc.
In an exemplary embodiment, the electronic device 500 may be implemented by one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASIC), digital signal processors (Digital Signal Processor, abbreviated as DSP), digital signal processing devices (Digital Signal Processing Device, abbreviated as DSPD), programmable logic devices (Programmable Logic Device, abbreviated as PLD), field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGA), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described method for handling memory access exceptions.
In another exemplary embodiment, a computer readable storage medium is also provided that includes program instructions that when executed by a processor implement the steps described above for a memory access exception. For example, the computer readable storage medium may be the memory 502 including program instructions described above, which are executable by the processor 501 of the electronic device 500 to perform the method of handling a video memory access exception described above.
In another exemplary embodiment, a computer program product is also provided, the computer program product comprising a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-described processing method for a memory access exception when executed by the programmable apparatus.
The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present disclosure within the scope of the technical concept of the present disclosure, and all the simple modifications belong to the protection scope of the present disclosure.
In addition, the specific features described in the above embodiments may be combined in any suitable manner without contradiction. The various possible combinations are not described further in this disclosure in order to avoid unnecessary repetition.
Moreover, any combination between the various embodiments of the present disclosure is possible as long as it does not depart from the spirit of the present disclosure, which should also be construed as the disclosure of the present disclosure.

Claims (10)

1. The processing method for the video memory access exception is characterized by being applied to a first terminal, and comprises the following steps:
in response to detecting that the GPU fails to access the target data in the kernel state, determining whether the target data is located in the second terminal;
reporting access abnormality information corresponding to access failure to a user state under the condition that the target data is positioned at a second terminal, so that an I/O interface in the user state acquires the target data from the second terminal; wherein, the access anomaly information at least comprises: address information corresponding to the target data;
copying the target data into a video memory, and controlling the GPU to revisit the target data in a kernel state.
2. The method of claim 1, wherein the determining whether the target data is located at the second terminal comprises:
acquiring pre-established address terminal corresponding relation information;
and determining whether the target data is positioned at the second terminal according to the address corresponding to the target data and the address terminal corresponding relation information.
3. The method of claim 1, wherein copying the target data into a video memory comprises:
determining a first subspace for storing the target data from a video memory, and copying the target data to the first subspace;
and, the method further comprises:
and establishing a mapping relation between the first subspace and the acquired address corresponding to the target data.
4. The method of claim 3, wherein controlling the GPU to revisit the target data in the kernel state comprises:
and acquiring the target data from the first subspace by utilizing an acquisition address corresponding to the target data.
5. A method according to claim 3, wherein after said copying said target data to said first subspace, said method further comprises:
generating first notification information for reporting the running system in a user mode;
and the running system converts the current processing process from a user mode to a kernel mode based on the first notification information.
6. The method of claim 1, wherein in the event that the target data is not located at a second terminal, the method further comprises:
and clearing access exception information for failure in accessing the target data, and restoring the GPU access related hardware to a state before accessing the target data.
7. The method according to claim 1, wherein the method further comprises:
in the case that the I/O interface has established a communication link with the second terminal, the I/O interface acquires target data from the second terminal using the established communication link;
in case the I/O interface has not established a communication link with the second terminal, generating indication information for establishing a communication link.
8. The processing system for the video memory access exception is applied to a first terminal, and comprises:
the detection unit is used for determining whether the target data is positioned at the second terminal or not in response to detecting that the GPU fails to access the target data in the kernel state;
the acquisition unit is used for reporting access abnormality information corresponding to access failure to a user state under the condition that the target data is positioned at a second terminal, so that an I/O interface in the user state acquires the target data from the second terminal; wherein, the access anomaly information at least comprises: address information corresponding to the target data;
and the recovery unit is used for copying the target data into the video memory and controlling the GPU to revisit the target data in the kernel state.
9. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the steps of the method for handling a memory access exception according to any of claims 1-7.
10. An electronic device, comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to implement the steps of the method for handling a memory access exception of any one of claims 1-7.
CN202410058593.0A 2024-01-15 2024-01-15 Processing method, system, medium and equipment for video memory access exception Active CN117573418B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410058593.0A CN117573418B (en) 2024-01-15 2024-01-15 Processing method, system, medium and equipment for video memory access exception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410058593.0A CN117573418B (en) 2024-01-15 2024-01-15 Processing method, system, medium and equipment for video memory access exception

Publications (2)

Publication Number Publication Date
CN117573418A true CN117573418A (en) 2024-02-20
CN117573418B CN117573418B (en) 2024-04-23

Family

ID=89886643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410058593.0A Active CN117573418B (en) 2024-01-15 2024-01-15 Processing method, system, medium and equipment for video memory access exception

Country Status (1)

Country Link
CN (1) CN117573418B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118069402A (en) * 2024-04-16 2024-05-24 沐曦集成电路(上海)有限公司 Task package execution error processing method
CN118503016A (en) * 2024-07-17 2024-08-16 武汉凌久微电子有限公司 GPU command exception recovery method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120144232A1 (en) * 2010-12-03 2012-06-07 International Business Machines Corporation Generation of Standby Images of Applications
WO2016127600A1 (en) * 2015-02-12 2016-08-18 中兴通讯股份有限公司 Exception handling method and apparatus
CN113674133A (en) * 2021-07-27 2021-11-19 阿里巴巴新加坡控股有限公司 GPU cluster shared video memory system, method, device and equipment
CN114595065A (en) * 2022-03-15 2022-06-07 北京有竹居网络技术有限公司 Data acquisition method and device, storage medium and electronic equipment
CN115599510A (en) * 2021-07-08 2023-01-13 华为技术有限公司(Cn) Processing method and corresponding device for page fault exception
CN115686805A (en) * 2021-07-22 2023-02-03 腾讯科技(深圳)有限公司 GPU resource sharing method and device, and GPU resource sharing scheduling method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120144232A1 (en) * 2010-12-03 2012-06-07 International Business Machines Corporation Generation of Standby Images of Applications
WO2016127600A1 (en) * 2015-02-12 2016-08-18 中兴通讯股份有限公司 Exception handling method and apparatus
CN115599510A (en) * 2021-07-08 2023-01-13 华为技术有限公司(Cn) Processing method and corresponding device for page fault exception
CN115686805A (en) * 2021-07-22 2023-02-03 腾讯科技(深圳)有限公司 GPU resource sharing method and device, and GPU resource sharing scheduling method and device
CN113674133A (en) * 2021-07-27 2021-11-19 阿里巴巴新加坡控股有限公司 GPU cluster shared video memory system, method, device and equipment
CN114595065A (en) * 2022-03-15 2022-06-07 北京有竹居网络技术有限公司 Data acquisition method and device, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118069402A (en) * 2024-04-16 2024-05-24 沐曦集成电路(上海)有限公司 Task package execution error processing method
CN118503016A (en) * 2024-07-17 2024-08-16 武汉凌久微电子有限公司 GPU command exception recovery method

Also Published As

Publication number Publication date
CN117573418B (en) 2024-04-23

Similar Documents

Publication Publication Date Title
CN117573418B (en) Processing method, system, medium and equipment for video memory access exception
US11392461B2 (en) Method and apparatus for processing information
CN108959916B (en) Method, device and system for accessing secure world
CN110780918B (en) Middleware container processing method and device, electronic equipment and storage medium
US9733976B2 (en) Method and apparatus for SYSRET monitoring of system interactions
US11055416B2 (en) Detecting vulnerabilities in applications during execution
WO2021213171A1 (en) Server switching method and apparatus, management node and storage medium
CN111800490A (en) Method and device for acquiring network behavior data and terminal equipment
JP2012190460A (en) Device for improving fault tolerance of processor
CN113467981A (en) Exception handling method and device
US11251976B2 (en) Data security processing method and terminal thereof, and server
CN113923008A (en) Malicious website interception method, device, equipment and storage medium
CN113672471A (en) Software monitoring method, device, equipment and storage medium
CN115952491B (en) Method, device, electronic equipment and medium for hook objective function
CN111177716B (en) Method, device, equipment and storage medium for acquiring executable file in memory
CN117472623A (en) Method, device, equipment and storage medium for processing memory fault
CN107818034B (en) Method and device for monitoring running space of process in computer equipment
CN105868038B (en) Memory error processing method and electronic equipment
CN108459899B (en) Information protection method and device
CN111177726A (en) System vulnerability detection method, device, equipment and medium
JP2006039763A (en) Guest os debug supporting method and virtual computer manager
CN110287039A (en) Analog interface configuration method, medium, device and calculating equipment
WO2022185626A1 (en) Monitoring system
US11392438B2 (en) Responding to unresponsive processing circuitry
CN112241283B (en) Software upgrading method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant