CN115686855A - Cache data access scheduling method, processor, electronic device and storage medium - Google Patents

Cache data access scheduling method, processor, electronic device and storage medium Download PDF

Info

Publication number
CN115686855A
CN115686855A CN202211379879.6A CN202211379879A CN115686855A CN 115686855 A CN115686855 A CN 115686855A CN 202211379879 A CN202211379879 A CN 202211379879A CN 115686855 A CN115686855 A CN 115686855A
Authority
CN
China
Prior art keywords
cache
user
users
index
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211379879.6A
Other languages
Chinese (zh)
Inventor
蒲永杰
张广勇
段亦涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Youdao Information Technology Beijing Co Ltd
Original Assignee
Netease Youdao Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Youdao Information Technology Beijing Co Ltd filed Critical Netease Youdao Information Technology Beijing Co Ltd
Priority to CN202211379879.6A priority Critical patent/CN115686855A/en
Publication of CN115686855A publication Critical patent/CN115686855A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The embodiment of the invention provides a cache data access scheduling method, a processor, electronic equipment and a storage medium. The method comprises the following steps: responding to the user access, and allocating a cache block in a video memory for the user; generating cache index information corresponding to the user according to the cache blocks distributed to the user; grouping users belonging to the same batch processing task to instruct a GPU to batch process the cache data of the users in the same group; and generating a cache number list corresponding to each group of users based on the cache index information of the users and storing the cache number list to a video memory. The method can enable the GPU to carry out efficient and reliable access operation and scheduling operation on the cache data in the video memory, greatly reduce data transmission operation between the video memory and the memory and further improve the calculation efficiency of the GPU.

Description

Cache data access scheduling method, processor, electronic device and storage medium
Technical Field
The embodiment of the invention relates to the field of computer processors, in particular to a cache data access scheduling method, a processor, an electronic device and a storage medium.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Thus, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
In deep learning applications, many applied inference processes employ flow computations. In the stream computing, input data is not input into a computing device at one time, but is divided into a plurality of segments, and one segment is input at intervals for inference computation.
In the calculation process of a Graphics Processing Unit (GPU) in a computer, data is read from a video memory to a calculation Unit and then calculated. If only one segment is calculated each time, the read model data is used only once, and the memory access amount is large. If a batch of fragments are calculated each time, namely, batch processing is performed once each time, model data read each time is used for multiple times, and accordingly memory access is small, and the GPU calculation efficiency is improved. However, in order to improve the model performance, the automatic speech recognition model former model widely used in industrial practice generally stores a plurality of calculation results of a previous segment in a cache, and participates in calculation after a next segment arrives, thereby reducing the calculation amount.
If the calculation results in the former model are stored in the memory, a large amount of transmission operations will be generated, and the transmission bandwidth will be seriously insufficient, so that the GPU is in a state of waiting for data transmission most of the time, which greatly restricts the calculation efficiency of the GPU, and therefore, the calculation results need to be stored in the video memory in order to improve the calculation efficiency. However, the cache data in the video memory is inconvenient to manage, and particularly when the cache data coexists with batch processing, the operations of accessing and scheduling the cache data in the video memory are more complicated.
In view of this, it is desirable to provide an access scheduling scheme for cache data, so as to perform efficient and reliable access operation and scheduling operation on the cache data in the video memory, thereby greatly reducing data transmission operation between the video memory and the memory, and further improving the computation efficiency of the GPU.
Disclosure of Invention
In view of improving the computational efficiency, in the prior art, the computation result of the GPU is usually cached in the video memory so as to facilitate the access and the call of data, but since the GPU has limited management capability on the cached data and a large amount of transmission operations are generated when the GPU repeatedly transmits between the video memory and the memory, the management of the cached data in the video memory is often complicated.
For this reason, an improved access scheduling scheme for the buffered data is highly needed to efficiently and reliably manage the buffered data in the video memory.
In this context, embodiments of the present invention are intended to provide an access scheduling method for cached data, a processor, an electronic device, and a storage medium.
In a first aspect of embodiments of the present invention, a method for scheduling access to cache data is provided, including: responding to the user access, and allocating a cache block in a video memory for the user; generating cache index information corresponding to the user according to the cache blocks distributed to the user; the cache index information includes: the cache number index of the cache block occupied by the user; grouping users belonging to the same batch processing task to instruct a GPU to batch process the cache data of the users in the same group; based on the cache index information of the users, generating a cache number list corresponding to each group of users and storing the cache number list to the video memory; the cache number list is used for indicating the GPU to access the cache block corresponding to each cache number index in the cache number list.
In an embodiment of the present invention, the generating cache index information corresponding to a user according to a cache block allocated to the user includes: generating cache index information corresponding to a user; updating a hash table in the memory according to cache index information corresponding to the user; the hash table is used for registering the corresponding relation between the user and the cache index information.
In an embodiment of the present invention, the method for scheduling access to cached data further includes: and in response to the disconnection of the user, clearing the information of the user in the hash table to release the cache block occupied by the user.
In an embodiment of the present invention, the method for scheduling access to cached data further includes: responding to the GPU scheduling signal, transmitting cache data in a cache block occupied by a user from a video memory of a first GPU to a memory, and then transmitting the cache data to a video memory of a second GPU; wherein the GPU scheduling signal is a signal that indicates a user to schedule from a first GPU to a second GPU; and updating the cache index information corresponding to the user in the hash table.
In an embodiment of the present invention, the length of the cache block is a fixed preset value; accordingly, each user occupies N cache blocks, where N is a positive integer.
In an embodiment of the present invention, the grouping users belonging to the same batch processing task includes: if the operator in the GPU is a cache correlation operator, grouping the users according to the number of cache blocks occupied by the users so as to enable the number of cache blocks occupied by the users in the same group to be the same; the row number of the input data matrix of the cache correlation operator is related to the input length, and the column number is related to the cache length.
In an embodiment of the present invention, the grouping users belonging to the same batch processing task further includes: if the operators in the GPU are cache-independent operators, dividing users belonging to the same batch processing task into a group; the row number of the input data matrix of the cache independent operator is related to the input length, and the column number is fixed.
In an embodiment of the present invention, after grouping users according to the number of cache blocks occupied by the users, the method further includes: if the operator is a non-sensitive operator, performing 0 complementing operation on the invalid region of the cache block corresponding to each group of users to make the cache length of each cache data in the same group of users consistent; and calculating the non-sensitive operator according to the operation result of the non-sensitive operator before and after the operation of 0 complementing.
In an embodiment of the present invention, the cache index information further includes: caching a length index; correspondingly, after grouping the users according to the number of cache blocks occupied by the users, the method further comprises the following steps: if the operator is a sensitive operator, generating a cache length index and adding the cache length index into cache index information of a user; and calculating results of the sensitive operator before and after the operation of complementing 0 are different.
In an embodiment of the present invention, the cache index information further includes: additionally caching a number index and an input length index; correspondingly, the access scheduling method for the cache data further comprises the following steps: responding to input data of a user, generating the input length index, calculating the sum of the input length and the cache length, if the sum of the input length and the cache length is larger than the fixed preset value, allocating an additional cache block for the user, and generating an additional cache number index according to the additional cache block; and adding the input length index and the additional cache number index into cache index information of the user.
In a second aspect of embodiments of the present invention, there is provided an access scheduling processor for caching data, including: a cache allocation unit configured to: responding to user access, and allocating a cache block in a video memory for a user; an index generation unit configured to: generating cache index information corresponding to the user according to the cache blocks distributed to the user and storing the cache index information to a memory unit; a user grouping unit configured to: grouping users belonging to the same batch processing task to instruct a GPU to batch process the cache data of the users in the same group; the index generation unit is further configured to: generating a cache number list corresponding to each group of users according to the grouping result of the user grouping unit and the cache index information of the users, and storing the cache number list to the video memory; the memory unit stores a corresponding relation table of the user and the cache index information; wherein the cache index information includes: a cache number index of a cache block occupied by a user; the cache number list is used for indicating the GPU to access the cache block corresponding to each cache number index in the cache number list.
In a third aspect of embodiments of the present invention, there is provided an electronic device, comprising: a processor; and a memory storing executable program instructions that, when executed by the processor, cause the electronic device to implement the method of any of the first aspects.
In a fourth aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon computer program instructions that, when executed by one or more processors, cause the electronic device to implement the method of any one of the first aspect.
According to the access scheduling method of the cache data, the corresponding cache block can be distributed in the video memory for the user, and the cache index information of the cache block occupied by the user and the user is associated, so that the user and the information of the cache block are bound, when the batch processing requirement is generated, a cache number list can be generated according to the grouping result of the user belonging to the same batch processing task and the cache index information of the same group of users, each cache number index in the cache number list can guide the GPU to access and call the cache data in the corresponding cache block, the process not only saves the time consumed by reading the data from the memory in the traditional GPU computing process, but also can realize the overall planning and control of the GPU access scheduling operation, realize the efficient and reliable cache management, remarkably reduce the management difficulty of the cache in the video memory and guarantee the efficiency and the stability of the GPU computing process.
Further, in some embodiments, in order to balance the workload among the multiple GPUs, the memory may be used as a transfer station for scheduling the cache data, the data of one user is moved from the video memory of the current GPU to the video memory of another GPU, and the cache index information corresponding to the user is updated to ensure that the cache data can be normally accessed and scheduled subsequently. Moreover, since the interval time between two segments of a user is enough to complete two transmission tasks, the task delay of the user is not affected by the scheduling operation.
Further, in some embodiments, users are grouped according to the number of cache blocks occupied by the users, and the users with cache data in the same cache length range are grouped into one group and then perform a 0 complementing operation, so that the problem that in order to adapt to cache data with a larger cache length, invalid data filled with cache data with a smaller cache length is too much, and unnecessary waste of computing resources is caused is avoided.
Furthermore, in some embodiments, a cache length index may be added in the cache index information, and the cache length index guides an operator sensitive to the 0-complementing operation to perform corresponding calculation, so as to prevent data in an invalid region of the cache block from affecting an operator calculation result.
Furthermore, the users can be grouped according to the number of cache blocks occupied by the users while the calculation of the operators is guided through the cache length indexes, so that the problem of increased calculation amount caused by the calculation of the index guidance operators is solved, and the loss of calculation efficiency caused by the introduction of the indexes during the calculation of the operators is made up through grouping.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 schematically illustrates a block diagram of an exemplary computing system 100 suitable for implementing embodiments of the present invention;
FIG. 2 is a flow chart of a method for scheduling access to cached data according to an embodiment of the present invention;
FIG. 3 schematically shows a diagram of a cache block of a user according to one embodiment of the invention;
FIG. 4 schematically shows a diagram of a cache block of a user according to another embodiment of the invention;
FIG. 5 schematically illustrates a flow diagram of a method of grouping users in a same batch according to one embodiment of the invention;
FIG. 6 schematically shows a diagram of a user's cache block according to a further embodiment of the invention;
FIG. 7 is a block diagram schematically illustrating an access scheduling processor for caching data according to an embodiment of the present invention;
FIG. 8 schematically shows a block diagram of an electronic device of an embodiment of the invention;
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 illustrates a block diagram of an exemplary computing system 100 suitable for implementing embodiments of the present invention. As shown in fig. 1, computing system 100 may include: a Central Processing Unit (CPU) 101, a Random Access Memory (RAM) 102, a Read Only Memory (ROM) 103, a system bus 104, a hard disk controller 105, a keyboard controller 106, a serial interface controller 107, a parallel interface controller 108, a display controller 109, a hard disk 110, a keyboard 111, a serial external device 112, a parallel external device 113, and a display 114. Among these devices, coupled to the system bus 104 are a CPU 101, a RAM 102, a ROM 103, a hard disk controller 105, a keyboard controller 106, a serial controller 107, a parallel controller 108, and a display controller 109. The hard disk 110 is coupled to the hard disk controller 105, the keyboard 111 is coupled to the keyboard controller 106, the serial external device 112 is coupled to the serial interface controller 107, the parallel external device 113 is coupled to the parallel interface controller 108, and the display 114 is coupled to the display controller 109. It should be understood that the block diagram of the architecture depicted in FIG. 1 is for purposes of illustration only and is not intended to limit the scope of the present invention. In some cases, certain devices may be added or subtracted as the case may be.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, method or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.) or a combination of hardware and software, and is referred to herein generally as a "circuit," module "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied in the medium.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive example) of the computer readable storage medium may include, for example: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
Embodiments of the present invention will be described below with reference to flowchart illustrations of methods and block diagrams of apparatuses (or systems) of embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for distinction only and not limitation.
The principles and spirit of the present invention are explained in detail below with reference to several exemplary embodiments of the present invention.
Summary of The Invention
The inventor finds that, for the GPU calculation process, due to the large amount of cache data and frequent access, if all cache data are transmitted back to the memory, the transmission bandwidth will be seriously insufficient, so that the GPU is in a state of waiting for data transmission most of the time, and the calculation efficiency of the GPU is greatly restricted. However, if the cache data is stored in the video memory, the GPU is difficult to adapt to such complicated cache management work.
The cache data is stored in the video memory by adopting an index-based cache data access scheduling method, the CPU calculates the moving and storage information of the cache data in advance, cache index information corresponding to a user is generated to indicate the storage address of the cache data of the user, and then the cache index information is transmitted to the video memory, so that the GPU is indicated to access the cache data to a corresponding position for calculation, and the processing capacity of the CPU can be indirectly utilized to perform complex cache management.
Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Application scene overview
In deep learning applications, many applied inference processes employ flow computations. In the stream computing, input data is not input into a computing device at one time, but is divided into a plurality of segments, and one segment is input at intervals for inference computation.
In particular, the former model, which is an automatic speech recognition model with a high recognition rate in streaming automatic speech recognition, is widely used in industrial practice. The coding process of the former model is high in calculation amount, and in order to improve performance, a plurality of calculation results of a previous segment are generally stored, and when a next segment arrives, the calculation results are involved together, so that the calculation amount is reduced.
In practical application, in order to improve the coding performance of the former model, the memory access bottleneck of the cache data needs to be reduced by using the locality of the data, so that a plurality of segments of a plurality of users need to be combined into a batch of data to be calculated in the GPU, and the batch of data is returned to each user after the calculation is completed, which causes the access scheduling of the cache data in the video memory to become more complex.
Exemplary method
It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.
In the technical scheme of the invention, the acquisition, storage, application and the like of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
An access scheduling method of cached data according to an exemplary embodiment of the present invention is described below with reference to the accompanying drawings.
Fig. 2 schematically shows a flowchart of an access scheduling method for cache data according to an embodiment of the present invention. Referring to fig. 2, a method for scheduling access to cache data according to an embodiment of the present invention may include:
in step 201, in response to a user access, a buffer block in a video memory is allocated to the user.
When a user establishes connection each time, the CPU allocates a cache block for a newly accessed user and records the cache block for subsequently storing input data and cache data of the newly accessed user.
It should be noted that, in some embodiments, a user may initiate a simple connection establishment request, where the connection establishment request is only used to apply for the CPU to allocate a cache block in the video memory for the user, and has no substantial data content. In other embodiments, the user may be considered to be accessed when the user generates input data, and at this time, the CPU allocates a cache block to the newly accessed user and stores the cache data obtained by calculating the input data of the user into the cache block allocated to the user.
In step 202, cache index information corresponding to the user is generated according to the cache blocks allocated to the user.
Wherein, caching the index information includes: the cache number index of the cache block occupied by the user may also be regarded as the cache number index of the cache block allocated to the user. The cache number index corresponds to the cache block one to one, the cache number index can be regarded as an identity ID or a cache address number of the cache block, and the GPU can identify the location of the cache block and access or schedule cache data in the cache block under the guidance of the cache number index.
In some embodiments, the generation process of the cache index information may be regarded as an update process of a correspondence table between users and cache index information in the memory.
Illustratively, the process of generating the cache index information corresponding to the user according to the cache blocks allocated to the user includes: generating cache index information corresponding to a user; and updating the hash table in the memory according to the cache index information corresponding to the user.
A hash table is stored in the memory of the CPU and used for registering the corresponding relation between the user and the cache index information. In some embodiments, before allocating a cache block to a user, the CPU may divide the display memory into a plurality of cache blocks and establish a hash table based on the cache number index of each cache block, where the hash table has a corresponding relationship between a user ID and the cache number index, and the process of allocating the cache block by the CPU may quickly find the corresponding cache number index by inputting the user ID, that is, the process of updating the corresponding relationship between the user ID and the cache number index in the hash table is equivalent to the process of updating the corresponding relationship between the user ID and the cache number index.
It should be noted that the memory also has a queue storing idle cache block numbers, when the cache block is used, a number is selected from the head of the queue, that is, the allocated cache block is determined, and the cache block number is added to the hash table to facilitate the subsequent search according to the user ID; when the cache block is released, the number to be released is added to the tail of the queue, and meanwhile, the mapping relation between the user ID and the cache block number is deleted in the hash table.
Correspondingly, in some embodiments of the present invention, the method for scheduling access to the cache data may further include: and in response to the disconnection of the user, clearing the information of the user in the hash table to release the cache block occupied by the user. Correspondingly, when the user disconnects, the number of the cache block released by the user needs to be added to the queue storing the free cache block number.
Exemplarily, assuming that the user a is disconnected, the correspondence between the user a and the cache number index a is deleted in the hash table, and the cache number index a may be re-added to the queue as a free cache block. In the subsequent allocation process, the CPU can call the cache number index a from the queue and add the correspondence between the cache number index a and the new user ID to the hash table, so as to allocate the cache block corresponding to the cache number index a to the newly accessed user.
In step 203, users belonging to the same batch processing task are grouped.
In the embodiment of the invention, batch processing refers to putting a plurality of similar calculation tasks together for calculation, so that model data read at a single time can be fully utilized, and the calculation performance of a GPU can be fully utilized.
Because the cache data of each user is different in length, the cache lengths of the cache data of the users in the same batch processing task may be different greatly, so that the users belonging to the same batch processing task can be grouped, and the GPU performs batch processing on the cache data of the users in the same group.
It should be noted that, if the cache lengths of the users in the same batch processing task are already consistent, the users in the same batch processing task may be directly divided into one group.
In step 204, based on the cache index information of the users, a cache number list corresponding to each group of users is generated and stored in the video memory.
The plurality of users are divided into the same group, which indicates that when the GPU executes the batch processing task of the group of users, cache blocks of all the users in the group need to be accessed, so that the CPU generates a cache number list in advance according to the cache index information of the group of users and stores the cache number list in the video memory for the GPU to read. The cache number list includes a cache number index of a cache block occupied by each user in the group, and may be used to instruct the GPU to access the cache block of each user in the group.
Illustratively, when the GPU calculates that the cache data needs to be read at each layer, the GPU may go to the corresponding cache block position according to the cache number list in the video memory, and obtain the relevant cache data to perform calculation. In the process, the GPU can find the cache block to be accessed according to the cache number index in the cache number list without allocating resources to retrieve the storage position of the cache data.
According to the access scheduling method of the cache data, the corresponding cache block can be distributed in the video memory for the user, and the cache index information of the cache block occupied by the user and the user is associated, so that the user and the information of the cache block are bound, when the batch processing requirement is generated, a cache number list can be generated according to the grouping result of the user belonging to the same batch processing task and the cache index information of the same group of users, each cache number index in the cache number list can guide the GPU to access and call the cache data in the corresponding cache block, the process not only saves the time consumed by reading the data from the memory in the traditional GPU computing process, but also can realize the overall planning and control of the GPU access scheduling operation, realize the efficient and reliable cache management, remarkably reduce the management difficulty of the cache in the video memory and guarantee the efficiency and the stability of the GPU computing process.
In practical application, even if a user has established a connection with one GPU, it cannot be guaranteed that all tasks generated by the user can be calculated in the GPU in time, and when the GPU is congested, the response time of the tasks may be long, so that in the practical application process, the user who has established the connection may need to be dispatched from one GPU to another GPU.
In this case, the access scheduling method for the cache data according to some embodiments of the present invention may transmit the cache data in the cache block occupied by the user from the video memory of the first GPU to the memory and then to the video memory of the second GPU in response to the GPU scheduling signal. Wherein the GPU scheduling signal is a signal indicating that a user is scheduled from a first GPU to a second GPU.
That is, in the scheduling process of the user, the memory of the CPU plays a role of a transfer station, and bridges the transmission of the cache data of the user between the two video memories.
After the user cache data is transferred successfully, the cache index information corresponding to the user in the hash table needs to be updated in time.
It should be noted that, since the interval time between two segments of a user is sufficient to complete two transmission tasks, the task delay of the user is not affected by the scheduling operation.
In some embodiments of the present invention, the CPU may divide the video memory into cache blocks before allocating the cache blocks for the user. Unlike the dynamic length which can be used simply by the cache block in the memory, the cache block in the video memory needs to be provided with a special length mechanism.
In the process of continuously establishing and closing the connection, if a dynamic length mechanism is adopted, the video memory is cut into more and more fragments, so that when a subsequent user accesses, the CPU hardly allocates a video memory space for the video memory, or a long time is required to search for allocating a space with a proper length for the video memory, which is disadvantageous to performance release of the GPU.
The following is an exemplary illustration of how cache blocks may be partitioned according to some embodiments of the present invention.
In the cache block dividing process of this embodiment, the length of the cache block is set to a fixed preset value, and accordingly, the video memory space occupied by each user is an integer block cache block.
Further, as the user's cache data is accumulated over time, the length of the user's cache data exceeds the length of one cache block, and at this time, the CPU allocates an additional cache block to the user to store the data.
Specifically, the cache index information of the user further includes: the number index and the input length index are additionally cached.
And when the user is a newly accessed user, responding to the input data of the user, generating the input length index, and judging whether an additional cache block needs to be allocated to the user according to the input length. If the input length is larger than the fixed preset value, which indicates that the single cache block is not enough to store the input data of the user, an additional cache block is allocated for the user, an additional cache number index is generated according to the additional cache block, and the input length index and the additional cache number index are added to the cache index information of the user.
When the user is an accessed user, generating an input length index in response to input data of the user; and calculating the sum of the input length and the cache length, if the sum of the input length and the cache length is greater than a fixed preset value, allocating an additional cache block for the user, generating an additional cache number index according to the additional cache block, and adding the input length index and the additional cache number index to the cache index information of the user.
It should be noted that, in the above process, the cache length of the new access user may also be regarded as 0, and after determining whether an additional cache block needs to be allocated to the new access user, the sum of the input length and the cache length is calculated, and if the sum of the input length and the cache length is greater than a fixed preset value, an additional cache block is allocated to the new access user, and an additional cache number index is generated according to the additional cache block.
Since a fixed-length cache block is used, it is inevitable that, as shown in fig. 3, an invalid area where data is empty exists in the column dimension in the cache block occupied by cache data of a part of users. The invalid areas are equivalent to valid areas in which cache data are stored, and the invalid areas can affect cache correlation operators when the GPU performs batch processing calculation.
In the embodiment of the present invention, the cache correlation operator refers to an operator in which the number of rows of the input data matrix is correlated with the input length, and the number of columns is correlated with the cache length.
As the number of frames of each user segment is different, the number of lines occupied by the user cache data is different, as shown in fig. 4, a CPU needs to calculate a corresponding cache number index in advance and transmit the cache number index to a GPU, so that each operator can perform corresponding processing based on the cache number index.
If the number of columns occupied by the user cache data is different, the number of columns of the input data matrix of the cache correlation operator is related to the cache length, and therefore additional processing is required. Normally, a 0 complementing operation is performed on an invalid region, but a sensitive operator sensitive to the 0 complementing operation exists in cache correlation operators, and the 0 complementing operation affects the calculation result, so that the calculation result is wrong. For example: softmax operator, which requires the calculation of ∑ e x X =0, e x =1, the summation result is affected.
For such sensitive operators, some embodiments of the invention introduce a cache length index into the cache index information; correspondingly, after grouping the users, if the operator is a sensitive operator, generating a cache length index and adding the cache length index into cache index information of the users. The cache length index can indicate the length of the cache data processed by the sensitive operator, and the condition that the invalid region is also contained in the cache length of the cache data by the sensitive operator for processing so as to cause the error of a calculation result is avoided.
If the cache correlation operator is a non-sensitive operator, that is, the calculation results of the operators before and after the 0 complementing operation are the same, for example: and the matrix multiplication operator is used for performing 0 complementing operation on the invalid region of the cache block corresponding to each group of users after grouping the users so as to enable the cache length of each cache data in the same group of users to be consistent.
Further, a grouping mode when an operator in the GPU is a cache correlation operator is introduced:
in this embodiment, if the operator in the GPU is a cache correlation operator, the users are grouped according to the number of cache blocks occupied by the users, so that the number of cache blocks occupied by the users in the same group is the same.
The calculation amount is increased due to the fact that the operator calculation is guided by the cache length index in the 0 complementing operation mode.
Taking the operation of complementing 0 as an example, if the users are not grouped according to the number of the cache blocks occupied by the users, the operation of complementing 0 is directly performed, in order to ensure that the cache lengths of the users in the same batch are consistent, the user with the smallest cache length needs to adapt to the user with the largest cache length, and the cache length is filled to the largest cache length through the operation of complementing 0, which may cause a sudden increase in the occupancy of data 0 in the cache data of the user with the smallest cache length, and even may cause a decrease in the data reading speed of the GPU and an increase in the calculation amount because all the additional cache blocks occupied by the user are data 0.
After grouping the users according to the number of the cache blocks occupied by the users, the 0-padding operation only fills the remaining area of the cache block occupied by the current user, that is, the 0-padding length of each user does not exceed the length of one cache block.
The above is an explanation of the processing method of the cache-associated operator, and the following is an explanation of the processing method of the cache-unrelated operator.
Since the number of rows of the input data matrix of the cache independent operator is related to the input length, the number of columns of the input data matrix is fixed. Therefore, the difference of the cache blocks in the column number dimension does not affect the column number of the input data matrix of the cache irrelevant operator, so that if the operator in the GPU is the cache irrelevant operator, users belonging to the same batch of processing tasks can be divided into a group and sent to the operator in the GPU for calculation without additional processing.
Based on the above description, the method for grouping users in the same batch provided by some embodiments of the present invention is explained with reference to fig. 5.
Fig. 5 is a flowchart schematically illustrating a method for grouping users in the same batch according to an embodiment of the present invention.
Referring to fig. 5, the method for grouping users in the same batch provided by this embodiment includes:
in step 501, it is determined whether an operator corresponding to the batch processing task is a cache correlation operator.
If yes, sequentially executing steps 502 to 505;
if not, go to step 506.
In step 502, users are grouped according to the number of cache blocks occupied by the users.
After grouping, the number of cache blocks occupied by the users in the same group is the same. In the process of dividing the cache blocks, the length of the cache blocks is set to be a fixed preset value, and accordingly, the video memory space occupied by each user is an integer block cache block, so that grouping can be realized according to the number of the cache blocks occupied by the user.
In step 503, it is determined whether the operator is a sensitive operator.
If yes, go to step 504;
if not, go to step 505.
In step 504, a cache length index is generated and added to the user's cache index information.
The cache length index reflects the actual length of the user cached data and may indicate the number of columns of the operator input data matrix to avoid being affected by the invalid region.
In step 505, a 0-complementing operation is performed on the invalid area of the cache block corresponding to each group of users.
And filling the invalid region in the cache data of the same group completely by complementing 0 so that the cache data of the user occupies the cache block where the cache data of the user is located.
In step 506, the users belonging to the same batch processing task are grouped into a group for the GPU to calculate the cache data of the users in the same group.
For ease of understanding, the following description is made with reference to the accompanying drawings.
Taking fig. 6 as an example, fig. 6 schematically provides cache blocks of users at three time points, wherein, at the first time point, users 1 to 4 are grouped into a group, and their tasks are calculated together in the GPU; at a second time point, the user 2 is divided into a group separately, and the user 1, the user 3 and the user 4 are divided into a group for calculation; at a third point in time, user 2 and user 4 form a group, and user 1 and user 3 form a group for calculation.
Note that the blank block in fig. 6 indicates an invalid area in the user cache data.
The detailed operation of steps 501 to 506 has already been described in detail in the foregoing, and will not be further described here.
It should be noted that the method for grouping users in the same batch shown in fig. 5 is only an exemplary illustration in the present invention, and should not be construed as the only limitation to the present invention.
By the same-batch user grouping method, users can be grouped according to the number of the cache blocks occupied by the users, the users with cache data in the same cache length range are grouped into one group and then subjected to 0 complementing operation, and the phenomenon that in order to adapt to the cache data with the larger cache length, the cache data with the smaller cache length is filled with too much invalid data, so that unnecessary computing resources are wasted is avoided.
And a cache length index is added into the cache index information, and the operator sensitive to the operation of 0 complementing is guided to carry out corresponding calculation through the cache length index, so that the influence of data in an invalid region of a cache block on an operator calculation result can be effectively prevented.
Furthermore, the same batch of user grouping method can guide calculation of operators through cache length indexes and simultaneously group users according to the number of cache blocks occupied by the users, so that the problem of increased calculation amount caused by the calculation of the index guide operators is solved, and the loss of calculation efficiency caused by the introduction of indexes during the calculation of the operators is compensated through grouping.
Exemplary device
Having described the method of the exemplary embodiment of the present invention, an access scheduling processor for caching data of the exemplary embodiment of the present invention will be described with reference to fig. 7.
Fig. 7 schematically shows a block diagram of an access scheduling processor for caching data according to an embodiment of the present invention. As shown in fig. 7, the processor includes:
a buffer allocation unit 701 configured to: responding to user access, and allocating a cache block in a video memory for a user;
an index generating unit 702 connected to the buffer allocating unit 701 and configured to: generating cache index information corresponding to the user for the cache block distributed by the user according to the cache distribution unit and storing the cache index information to the memory unit;
the user grouping unit 703 is connected to the index generating unit 702, and is configured to: grouping users belonging to the same batch processing task to indicate a GPU to perform batch processing on the cache data of the users in the same group;
accordingly, the index generation unit 702 is further configured to: generating a cache number list corresponding to each group of users according to the grouping result of the user grouping unit and the cache index information of the users, and storing the cache number list to a video memory;
a memory unit 704 storing a correspondence table between users and cache index information;
wherein the cache index information includes: a cache number index of a cache block occupied by a user; the cache number list is used for indicating the GPU to access the cache block corresponding to each cache number index in the cache number list.
It should be noted that the processor is disposed in the CPU, performs overall management on user data stored in the display memory, generates a cache serial number list according to a storage location of the cache data, and stores the cache serial number list into the display memory, where cache index information in the cache serial number list may indicate the GPU to access and schedule the cache data to a corresponding location, so as to implement indirect management of the display memory by the CPU and fully utilize data processing capability of the CPU.
In correspondence with the foregoing functional embodiment, the embodiment of the present invention further provides an electronic device as shown in fig. 8. Fig. 8 schematically shows a block diagram of the electronic device of the embodiment of the present invention. The electronic device 800 shown in fig. 8 includes: a processor 810; and a memory 820, the memory 820 having stored thereon executable program instructions that, when executed by the processor 810, cause the electronic device to implement any of the methods as previously described.
In the electronic device 800 of fig. 8, only constituent elements related to the present embodiment are shown. Thus, it will be apparent to those of ordinary skill in the art that: electronic device 800 may also include common constituent elements that are different from the constituent elements shown in fig. 8.
Processor 810 may control the operation of electronic device 800. For example, the processor 810 controls the operation of the electronic device 800 by executing programs stored in the memory 820 on the electronic device 800. The processor 810 may be implemented by a Central Processing Unit (CPU), an Application Processor (AP), an artificial intelligence processor chip (IPU), etc., provided in the electronic device 800. However, the present disclosure is not limited thereto. In this embodiment, the processor 810 may be implemented in any suitable manner. For example, the processor 810 may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth.
The memory 820 may be used for hardware to store various data, instructions processed in the electronic device 800. For example, the memory 820 may store processed data and data to be processed in the electronic device 800. The memory 820 may store data sets that have been processed or to be processed by the processor 810, such as user input data, cache index information, and the like. Further, the memory 820 may store applications, drivers, and the like to be driven by the electronic device 800. For example: the memory 820 may store various programs related to task type recognition, operator type recognition, and the like to be executed by the processor 810. The memory 820 may be a DRAM, but the disclosure is not limited thereto. The memory 820 may include at least one of volatile memory or nonvolatile memory. The non-volatile memory may include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, phase change RAM (PRAM), magnetic RAM (MRAM), resistive RAM (RRAM), ferroelectric RAM (FRAM), and the like. Volatile memory can include Dynamic RAM (DRAM), static RAM (SRAM), synchronous DRAM (SDRAM), PRAM, MRAM, RRAM, ferroelectric RAM (FeRAM), and the like. In an embodiment, the memory 820 may include at least one of a Hard Disk Drive (HDD), a Solid State Drive (SSD), a high density flash memory (CF), a Secure Digital (SD) card, a Micro-digital (Micro-SD) card, a Mini secure digital (Mini-SD) card, an extreme digital (xD) card, a cache (caches), or a memory stick.
In summary, specific functions implemented by the memory 820 and the processor 810 of the electronic device 800 provided in the embodiments of the present disclosure may be explained in comparison with the foregoing embodiments in the present disclosure, and technical effects of the foregoing embodiments may be achieved, and are not described herein again.
Alternatively, the disclosure may also be embodied as a non-transitory machine-readable storage medium (or a computer-readable storage medium, or a machine-readable storage medium) having stored thereon computer program instructions (or a computer program, or computer instruction code) for weight estimation of pigs only, which, when executed by a processor of an electronic device (or an electronic device, a server, etc.), cause the processor to perform part or all of the various steps of the above-described method according to the present application.
It should be noted that although in the above detailed description several means or sub-means of the electronic device are mentioned, this division is only not mandatory. Indeed, the features and functions of two or more of the devices described above may be embodied in one device, according to embodiments of the invention. Conversely, the features and functions of one apparatus described above may be further divided into embodiments by a plurality of apparatuses.
Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
Use of the verbs "comprise", "comprise" and their conjugations in this application does not exclude the presence of elements or steps other than those stated in this application. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.

Claims (10)

1. An access scheduling method for cache data, comprising:
responding to user access, and allocating a cache block in a video memory for a user;
generating cache index information corresponding to the user according to the cache blocks distributed to the user; the cache index information includes: the cache number index of the cache block occupied by the user;
grouping users belonging to the same batch processing task to instruct a Graphic Processing Unit (GPU) to batch process the cache data of the users in the same group;
based on the cache index information of the users, a cache number list corresponding to each group of users is generated and stored in the video memory; the cache number list is used for indicating the GPU to access the cache block corresponding to each cache number index in the cache number list.
2. The method according to claim 1, wherein the generating cache index information corresponding to the user according to the cache block allocated to the user comprises:
generating cache index information corresponding to a user;
updating a hash table in the memory according to cache index information corresponding to the user; the hash table is used for registering the corresponding relation between the user and the cache index information.
3. The method of claim 1, wherein the access scheduling of the buffered data,
the length of the cache block is a fixed preset value;
accordingly, each user occupies N cache blocks, where N is a positive integer.
4. The method for scheduling access to cached data according to claim 3, wherein the grouping users belonging to the same batch processing task comprises:
if the operator in the GPU is a cache correlation operator, grouping the users according to the number of cache blocks occupied by the users so as to enable the number of cache blocks occupied by the users in the same group to be the same; the row number of the input data matrix of the cache correlation operator is related to the input length, and the column number is related to the cache length.
5. The method for accessing and scheduling cached data according to claim 4, wherein after grouping users according to the number of cache blocks occupied by the users, further comprising:
if the operator is a non-sensitive operator, performing 0 complementing operation on the invalid region of the cache block corresponding to each group of users to make the cache length of each cache data in the same group of users consistent; and calculating the non-sensitive operator according to the operation result of the non-sensitive operator before and after the operation of 0 complementing.
6. The method according to claim 4, wherein the access scheduling method for the buffered data,
the cache index information further includes: caching a length index;
correspondingly, after grouping the users according to the number of cache blocks occupied by the users, the method further comprises the following steps:
if the operator is a sensitive operator, generating a cache length index and adding the cache length index into cache index information of a user; and calculating results of the sensitive operator before and after the 0 complementing operation are different.
7. The method for scheduling access to cached data as recited in claim 3,
the cache index information further includes: additionally caching a number index and an input length index;
correspondingly, the access scheduling method for the cache data further comprises the following steps:
responding to input data of a user, generating the input length index, calculating the sum of the input length and the cache length, if the sum of the input length and the cache length is larger than the fixed preset value, allocating an additional cache block for the user, and generating an additional cache number index according to the additional cache block;
and adding the input length index and the additional cache number index into cache index information of the user.
8. An access scheduling processor for caching data, comprising:
a cache allocation unit configured to: responding to the user access, and allocating a cache block in a video memory for the user;
an index generation unit configured to: generating cache index information corresponding to the user according to the cache blocks distributed to the user and storing the cache index information to a memory unit;
a user grouping unit configured to: grouping users belonging to the same batch processing task to instruct a GPU to batch process the cache data of the users in the same group;
the index generation unit is further configured to: generating a cache number list corresponding to each group of users according to the grouping result of the user grouping unit and the cache index information of the users, and storing the cache number list to the video memory;
the memory unit stores a corresponding relation table of the user and the cache index information;
wherein the cache index information includes: a cache number index of a cache block occupied by a user; the cache number list is used for indicating the GPU to access the cache block corresponding to each cache number index in the cache number list.
9. An electronic device, comprising:
a processor; and
a memory storing executable program instructions that, when executed by the processor, cause the electronic device to implement the method of any of claims 1-7.
10. A computer-readable storage medium having stored thereon computer program instructions that, when executed by one or more processors, cause the processors to implement the method of any one of claims 1-7.
CN202211379879.6A 2022-11-04 2022-11-04 Cache data access scheduling method, processor, electronic device and storage medium Pending CN115686855A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211379879.6A CN115686855A (en) 2022-11-04 2022-11-04 Cache data access scheduling method, processor, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211379879.6A CN115686855A (en) 2022-11-04 2022-11-04 Cache data access scheduling method, processor, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN115686855A true CN115686855A (en) 2023-02-03

Family

ID=85049194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211379879.6A Pending CN115686855A (en) 2022-11-04 2022-11-04 Cache data access scheduling method, processor, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN115686855A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117850730A (en) * 2024-03-08 2024-04-09 青岛罗博数码科技有限公司 Method and system for displaying pictures by intelligent pen box

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117850730A (en) * 2024-03-08 2024-04-09 青岛罗博数码科技有限公司 Method and system for displaying pictures by intelligent pen box
CN117850730B (en) * 2024-03-08 2024-05-28 青岛罗博数码科技有限公司 Method and system for displaying pictures by intelligent pen box

Similar Documents

Publication Publication Date Title
US10467152B2 (en) Dynamic cache management for in-memory data analytic platforms
US10572383B2 (en) Caching a block of data in a multi-tenant cache storage device based on space usage boundary estimates
US10204175B2 (en) Dynamic memory tuning for in-memory data analytic platforms
JP5516744B2 (en) Scheduler, multi-core processor system, and scheduling method
CN111124951B (en) Method, apparatus and computer program product for managing data access
US11150949B2 (en) Resource release method, resource allocation method, devices, and computer program products
US11366758B2 (en) Method and devices for managing cache
US20190188239A1 (en) Dual phase matrix-vector multiplication system
US9547520B1 (en) Virtual machine load balancing
US11928580B2 (en) Interleaving memory requests to accelerate memory accesses
US20180004698A1 (en) Network-accessible data volume modification
CN114911596B (en) Scheduling method and device for model training, electronic equipment and storage medium
US8566532B2 (en) Management of multipurpose command queues in a multilevel cache hierarchy
US20160210171A1 (en) Scheduling in job execution
US10204060B2 (en) Determining memory access categories to use to assign tasks to processor cores to execute
CN111124270A (en) Method, apparatus and computer program product for cache management
CN115543965A (en) Cross-machine-room data processing method, device, storage medium, and program product
CN115686855A (en) Cache data access scheduling method, processor, electronic device and storage medium
CN118012788A (en) Data processor, data processing method, electronic device, and storage medium
US10754773B2 (en) Selection of variable memory-access size
US9189405B2 (en) Placement of data in shards on a storage device
CN108228323B (en) Hadoop task scheduling method and device based on data locality
Chen et al. Data prefetching and eviction mechanisms of in-memory storage systems based on scheduling for big data processing
JP5776813B2 (en) Multi-core processor system, control method and control program for multi-core processor system
CN111459402A (en) Magnetic disk controllable buffer writing method, controller, hybrid IO scheduling method and scheduler

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination