CN115022405B - Intelligent cache acceleration system and method of deep learning cloud platform - Google Patents
Intelligent cache acceleration system and method of deep learning cloud platform Download PDFInfo
- Publication number
- CN115022405B CN115022405B CN202210957648.2A CN202210957648A CN115022405B CN 115022405 B CN115022405 B CN 115022405B CN 202210957648 A CN202210957648 A CN 202210957648A CN 115022405 B CN115022405 B CN 115022405B
- Authority
- CN
- China
- Prior art keywords
- cache
- data set
- node
- path
- component
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0811—Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45595—Network integration; Enabling network access in virtual machine instances
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The invention provides an intelligent cache acceleration system and method of a deep learning cloud platform. The intelligent cache acceleration system of the deep learning cloud platform comprises: the system comprises a data set use frequency statistic component, a multi-level cache automatic deployment component, a mounted data set path component, a cache optimization scheduling component according to the data set and the like. The cache optimization scheduling component needs to be modified according to a specific scheduler used by the platform, and the rest parts are independent components.
Description
Technical Field
The invention relates to the field of computers, deep learning platforms and cache systems, in particular to an intelligent cache acceleration system and method of a deep learning cloud platform.
Background
With the rapid development of the deep learning technology, the demand of model training on computing capacity is gradually increased, so that the deep learning cloud platform appears in the field of view of the public. At present, containerization is basically realized on a deep learning platform, namely, a training task runs on the platform in a form of a Docker container. In the deep learning field, a training task needs to read a large-capacity data set from a back end, and the data set mainly takes the form of a large number of small files, such as pictures, so that the I/O throughput of a storage system becomes a performance bottleneck of the training task. To solve this problem, we need to design a cache system to improve the I/O performance by caching the data set locally (compute nodes), thereby speeding up the training of the model.
Aiming at the problem of low read-write I/O performance of an application program in a scene of a large number of small files, a solution is provided in a classic paper of facing a needle in Haystack, namely Facebook's photo storage, and a plurality of implementation schemes such as an open source distributed file system SeaweedFS are provided based on the paper. The method realizes that a large number of small picture files (Needle) are aggregated into a large file (Volume) at the back end, and the specified file is read through the Volume ID and the logical offset of the small file. There are some problems with this solution: firstly, the performance is poor, after SeaweedFS is mounted in a shared file system form through a filer, the read-write I/O performance loss is obvious due to the fact that the conversion of an access protocol exists; secondly, the combination with the existing system is difficult, the SeaweedFS is used as a set of independent distributed storage, files need to be accessed in a URL form similar to object storage, and the method is not consistent with widely used deep learning frames (TensorFlow, pyTorch and the like), the frames are used for reading and writing data set files in a file system form, and the object storage mode of the SeaweedFS invades the reading and writing logic of users. Aiming at the defects, the invention provides an intelligent cache acceleration system and method of a deep learning cloud platform with a multi-level cache system, which completely fit the working process of a containerized deep learning platform, and the cache characteristics are completely transparent to users; on the other hand, the invention can automatically cache the hot spot data set with high use frequency by utilizing various media such as the memory of the node (namely the physical server), the SSD (solid state disk) and the like, thereby realizing the high-efficiency read-write acceleration effect.
Disclosure of Invention
In order to overcome the technical defects in the prior art, realize the monitoring of the use frequency of a data set, the automatic deployment strategy of the cache of the data set and the optimization of the affinity of the data set in the aspect of scheduling, the invention provides an intelligent cache acceleration system and method of a deep learning cloud platform.
In one aspect, an intelligent cache acceleration system of a deep learning cloud platform is provided, the system comprising: the system comprises a data set use frequency statistic component, a multi-level cache automatic deployment component, a mounted data set path component, a cache optimization scheduling component according to the data set and the like. The cache optimization scheduling component needs to be modified according to a specific scheduler used by the platform, and the rest parts are independent components.
The data set use frequency statistical component is used for finding out a data set with high user use frequency so as to determine an object of cache deployment; the data set usage frequency statistics component is configured to first use an administrator account to obtain a user authentication token (i.e., token, a string of characters used for the system to confirm whether an access user has a corresponding authority), and request the platform to obtain configuration information (job _ config) of all training tasks; the starting time of the training task can be acquired from the configuration information (job _ config), and all training task information submitted in a specified time period (for example, the past 7 days can be defaulted) can be filtered according to the requirement;
the configuration information (jobconfig) includes the following information: data set information and detailed information of a plurality of data sets used by the training task; wherein one training task may use one or more data sets. The structure of the data set detailed information further includes a user ID to which the data set belongs, a boolean value (representing whether the data set is public or not), a 'dataseitid' field (representing a data set name used by the task), and a 'mountPath' field (recording a mounting path of the data set in a Docker container of a training task); thus, a data set name list used by the training task can be obtained from the configuration information (jobconfig) filtered according to time; in the obtained list, the using times of each data set are obtained through accumulation, and then sorting is carried out, so that the using frequency of each data set in a specified time period can be counted.
And the multi-level cache automatic deployment component is used for carrying out multi-level caching on the data set. The multi-level cache uses two storage media for the compute nodes: SSD and memory. For SSD, directly copying the cache data set to a cache path in the SSD; for the memory, because the SSD space of the compute node is limited, the cache path is mapped into the memory in a manner of using tmpfs (tmpfs is a memory-based file system, all files are stored in the memory, and the performance is much higher than that of a disk-based file system), so that the data sets can be cached into the memory under the cache path as in the case of using the SSD, except that the data sets occupy the memory space, the read-write performance is better;
the automatic multi-level cache deployment component is configured to firstly obtain the data set use frequency counted by the data set use frequency counting component, calculate the gain value of each data set according to the data set use frequency and the data set capacity (namely, the data set is deployed as an expected acceleration effect brought by the cache, the larger the data set capacity is, the higher the use frequency is, the better the acceleration effect is), convert the cache deployment into a 0-1 knapsack problem (namely, the node cache capacity is of knapsack size, the data set must be completely cached on the node, and only a part of a certain data set cannot be cached), and calculate the data set to be cached by each computing node and a corresponding cache path by using a dynamic programming method. And subsequently copying the corresponding data set to the cache path of the corresponding node through a remote execution tool (such as an idle) of the Linux platform. The component is set to run periodically by means of a periodic execution tool (e.g., crontab) under the Linux platform, and may be configured to run every 7 days by default, for example, and periodically update the data set cache on the compute node.
The mounting data set path component is used for mounting the storage paths on the computing nodes into a Docker container (or called task container, training container or Docker container) of the training task so as to be used by the computing task; the mount dataset path component includes a determine dataset path component for determining whether a cache exists on a server before the dataset cache is mounted into the Docker container, i.e., for determining whether a cached dataset exists on a server to which the training task is scheduled. And mounting the corresponding data set storage path into the training task container. After adding multiple cache paths on the computing nodes in the container, the node needs to be added with cache information of the data set, and whether the data set needed by the task is cached or not is determined before mounting. Json file.
The data set path judging component is configured to call the component in a task container starting script, judge the data set cache by reading the data set cache information, and directly mount a corresponding cache path to a data set path in a container if a data set specified by a task exists in a specified file (such as a JSON file); if the data set does not exist, the data set is not cached in a local node, a remote NFS (Network File System) storage path in default mounting task configuration is adopted, and the data set is read from the NFS. This enables cache path mapping that is transparent to the user.
And the optimization scheduling component preferentially schedules the training task to the node cached with the corresponding data set according to the cache information of each node. The optimized scheduling component is configured to, first, include a table dataset _ cache _ detail, which can represent the data set caching condition of each node, and includes the following fields: node IP address, cache dataset name. Secondly, modifying the scheduling logic of the scheduler according to the scheduler used by the platform, adding a query node cache function in a code of the scheduler, and acquiring data set information of the node cache by querying a dataset _ cache _ detail table by the function; this function adds a preference to the dataset cache when screening nodes, and then returns a list of nodes containing the corresponding dataset cache (when all nodes do not contain the specified dataset cache, the function screening function fails). It should be noted that the screening generated in this step is only a scheduling suggestion, and the actual mount still needs to select whether to mount according to the node data set caching condition, that is, when the local cache information dataset _ cache.json conflicts with the database table dataset _ cache _ detail information, the local cache information is taken as the standard, which is the function realized by the last component.
In another aspect, an intelligent cache acceleration method for a deep learning cloud platform is provided, which includes:
s01, acquiring configuration information (jobConfig), and counting the use frequency of a data set:
the data set use frequency statistic component is used for finding out a data set with high user use frequency so as to determine a cache deployment object; the data set use frequency statistical component is configured to firstly use an administrator account to obtain a user authentication token and request a platform to obtain configuration information (job _ config) of all training tasks; then filtering out the training tasks submitted in a specified period (e.g., can default to the past 7 days) from all the configuration information (job _ config);
s02, calculating data set gain according to the data set use frequency, calculating node cache deployment conditions by a dynamic planning method, completing cache deployment, and obtaining node cache data set information:
firstly, acquiring the use frequency of the data set counted by the data set use frequency counting component, calculating the gain value of each data set according to the use frequency, converting cache deployment into a 0-1 knapsack problem (namely, the cache capacity of a node is the size of a knapsack, the data set must be completely cached on the node, and only a part of a certain data set cannot be cached), and calculating the data set to be cached by each calculation node and a corresponding cache path by using a dynamic programming method. And subsequently copying the corresponding data set to the cache path of the corresponding node through remote execution tools such as an idle tool and the like. The component is set to run periodically by the crontab, for example, it may be configured by default to run every 7 days, updating the data set cache on the compute node;
s03, the scheduler inquires the node cache data set information and performs screening:
modifying the scheduling logic of a scheduler used by the platform according to the scheduler, adding a query node cache function in a code of the scheduler, and acquiring data set information of node cache by the function through querying a dataset _ cache _ detail table; this function adds a preference to the dataset cache when screening nodes, and then returns a list of nodes containing the corresponding dataset cache (the function screening function fails when all nodes do not contain the specified dataset cache). It should be noted that the screening generated in this step is only a scheduling suggestion, and the actual mount still needs to select whether to mount according to the node data set caching condition, that is, when the local cache information dataset _ cache.json conflicts with the database table dataset _ cache _ detail information, the local cache information is taken as the standard, which is the function realized by the last component.
S04, judging whether the node contains a task use data set:
the function is modified and realized in a scheduler code, a function for judging whether a node cache exists is added, and the dataset _ cache _ detail data table is inquired to obtain data set cache information of all nodes in the cluster. Then inquiring whether a node containing a data set cache required by the training task exists in the node list, if so, filtering out a node list containing the required data set cache, and scheduling the training task to the node containing the cache; if all the nodes do not contain the required data set cache, the screening process is ignored, and normal scheduling is carried out;
s05, reading a cache configuration file, wherein the cache configuration file comprises the node cache data set information;
s06, judging a data set path according to the cache configuration file;
judging the caching of the data set comprises the following steps: calling a data set path judging component in a task container starting script, judging the cache of the data set by reading the cache information of the data set, and directly mounting a corresponding cache path to the data set path in the container if the data set specified by the task exists in a specified file (such as a JSON file); if the data set does not exist, the data set is not cached in a local node, a remote NFS (Network File System) storage path in default mounting task configuration is adopted, and the data set is read from the NFS. This achieves cache path mapping that is transparent to the user.
The function of judging the data set path is realized by reading a cache record JSON file on a node, wherein the file records the data set cache information actually existing on the node, and the data set cache information comprises a data set name, a cache type and a cache path. After the program reads the JSON file, inquiring the name of a data set used by the task, if the JSON file does not have the data set and indicates that the cache does not exist, returning a remote NFS storage path to the training task; if the data set exists in the JSON file and the cache type is SSD, the cache is indicated in the SSD, and a corresponding local SSD storage path is returned to the training task; if the data set exists in the JSON file and the cache type is Memory, the cache is indicated in the Memory, and a local tmpfs storage path is returned to the training task;
and S07, reading the mounting path in the step S06 by a training task in the container, and starting training.
On a containerized machine learning training platform, all training tasks run in the form of a Docker container, a physical path needs to be mounted to a mounting point inside the container when the inside of the container wants to access files on a physical machine, so that external files can be accessed in the container through the mounting point path. The data set path judging function returns corresponding data set path information to the training task, and the physical path is mounted to the data set path in the container, so that the data set cache on the reading node of the training task can be used for training.
The invention has the beneficial effects that: the multi-level cache system fully utilizes various idle storage media on the computing node, and can obviously improve the read-write I/O performance of a deep learning training task; on the other hand, the system is matched with the current mainstream containerization deep learning platform working process, the task is preferentially distributed to the computing nodes with the local cache in the aspect of scheduling, the data set cache is completely transparent to the user in the aspect of use, and the relevant codes are trained without any change.
Drawings
The advantages and spirit of the present invention can be further understood by the following detailed description of the invention and the accompanying drawings.
FIG. 1 is a block diagram of the structure of an intelligent cache acceleration system of a deep learning cloud platform according to the present invention;
fig. 2 is a processing flow chart of the intelligent cache acceleration method of the deep learning cloud platform according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to fig. 1-2 in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides an intelligent cache acceleration system of a deep learning cloud platform, and fig. 1 is a structural module diagram of the intelligent cache acceleration system of the deep learning cloud platform, and the system provided by the scheme is based on the design of a current mainstream containerized deep learning cloud platform, and provides the intelligent cache acceleration system of the deep learning cloud platform, and the system comprises: the system comprises a data set use frequency statistic component, a multi-level cache automatic deployment component, a mounted data set path component, a cache optimization scheduling component according to the data set and the like. The cache optimization scheduling component needs to be modified according to a specific scheduler used by the platform, and the rest parts are independent components.
The data set use frequency statistical component analyzes the log information of the training task to obtain the use times of each data set in a specified time period, and therefore, the data sets to be deployed subsequently are determined to serve as caches.
And the multi-level cache automatic deployment component calculates the data sets with the highest profit by a dynamic planning algorithm according to the use frequency of the data sets, and actually deploys the data sets to the idle memory and SSD space of the server.
And the mounting data set path component mounts the corresponding physical path into a Docker container of the training task by judging whether the data set used by the training task has a local cache or not, so that the data set can be ensured to be accessed by the training task.
The optimization scheduling component schedules the training task to the server containing the required data set cache as much as possible, so that the acceleration effect of the data set cache is exerted to the maximum extent.
The intelligent cache acceleration system of the deep learning cloud platform is described in detail below by combining specific configuration modes of the components.
In order to find out the data set with high user use frequency and determine the object of cache deployment, a data set use frequency statistical component is developed. The component is a Python script, firstly, an administrator account is used for obtaining a user authentication token (token), and configuration information (job _ config) of all training tasks is requested to be obtained from a platform; the training tasks submitted within a specified period (default to the past 7 days) are then filtered out of all the job _ configs, an example of the job _ config format is as follows:
{'apiVersion': '2.0', 'code': {'timeStamp': '1609894565523', 'version': '1', 'projectId': 'slowfast', 'mountPath': '/code'}, 'output': {'jobId': 1}, 'datasets': [{'owner': '_386152087 e494e7ea022f90333f985aa', 'share': True, 'token': '', 'datasetId': 'imagenet', 'mountPath': '/data/linshiqi047/imagenet'},{'owner': '_20d1931aad6549309295da81e5e1fc6c', 'share': False, 'token': '', 'datasetId': 'energy', 'mountPath': '/data/zhaoliu/energy'}], 'retryCount': 0, 'models': [], 'kind': 'execution', 'taskRoles': [{'gpuType': 'titanxp', 'cpuNumber': 2, 'command': 'e', 'taskNumber': 1, 'taskName': 'Task1', 'shmMB': 8192, 'image': '10.11.3.8:5000/bitahub/pytorch:1.3-py3', 'memoryMB': 16384, 'gpuNumber': 1}], 'jobName': 'slowfast-cmrn7q'}
the above configuration information is in a standard JSON format in the form of key-value pairs, and the system needs to pay attention to 'datasets' items (or fields), in which data set information used by the training task is stored. The value of the 'datasets' item is a list that may contain details of multiple datasets (2 datasets in the example configuration described above), which means that one or more datasets may be used by a training task. In the structure of the detailed information of the data set, 'owner' indicates the user ID to which the data set belongs, 'share' is a Boolean value and represents whether the data set is published, a 'dataseitId' field represents the name of the data set used by the task, and 'mount path' of the data set in a training container is recorded; and then, the names of the used data sets are taken from all the obtained job _ configs, the using times of each data set are obtained through accumulation, and then sequencing is carried out, so that the using frequency of each data set in a specified time period can be counted.
The automatic deployment aspect of the multi-level cache adopts a mechanism of multi-level caching of the data set. Two storage media using computing nodes: SSD and memory. For SSD, directly copying the cache data set to a cache path in the SSD; because the SSD space of the computing node is limited, a cache path is mapped into the memory by using a tmpfs (the tmpfs is a memory-based file system, all files are stored in the memory, and the performance is much higher than that of a disk-based file system), so that the data sets can be cached under the cache path in the memory like using the SSD, and the difference is that the data sets occupy the memory space and have better read-write performance. The component is also written by Python, firstly, the use frequency of a data set counted by the last component is obtained, the gain value of each data set is calculated according to the use frequency, cache deployment is converted into a 0-1 knapsack problem (namely, the cache capacity of a node is the size of a knapsack, the data set must be completely cached on the node, and only a part of a certain data set cannot be cached), and the data set which should be cached by each computing node and a corresponding cache path are calculated by using a dynamic programming method. And subsequently copying the corresponding data set to the cache path of the corresponding node through remote execution tools such as an idle tool and the like. The component is set to run periodically by the crontab, and is configured by default to run every 7 days, updating the data set cache on the compute node.
On the containerized deep learning platform, all training tasks run in the form of a Docker container, which means that a storage path on a compute node needs to be mounted in the Docker container to be used by the compute task. According to the cache deployment mechanism designed by the scheme, the caches of the data sets deployed on different servers in the deep learning platform cluster are different; thus, the node to which the training task is scheduled does not necessarily have a corresponding data set cache, and it is necessary to determine whether the cache exists on the server before the data set cache is mounted in the Docker container. In the container, the data set path is mounted as/data (namely, a local SSD storage path or a local tmpfs storage path is mounted), after a plurality of cache paths on the computing node are added, cache information of the data set needs to be added to the node, and whether the data set needed by the task is cached or not is determined before mounting. The cache information of the data set is maintained in a dataset _ cache.
{"dataset_cache": [{"name": "MSCOCO", "cachetype": 2, "target_path": "/opt/cache/MSCOCO"}, {"name": "GLOVE-6B", "cachetype": 2, "target_path": "/opt/cache/GLOVE-6B"}, {"name": "VOC2007", "cachetype": 1, "target_path": "/tmpfs/cache/VOC2007"}, {"name": "GLOVE-42B-300D", "cachetype": 2, "target_path": "/opt/cache/GLOVE-42B-300D"}], "local_info": {"hostname": "GA010", "cache_size_ssd": 100GB, "cache_size_mem": 40GB}}
The JSON file records all data set caches deployed on the server, wherein each entry records information of one data set cache, in the present component, we mainly concern fields of "name", "cachetype" and "target _ path", where "name" represents a data set name, "cachetype" represents a data set cache position, a value of 1 represents that a data set cache is located in the memory, and a value of 2 represents that a data set cache is located on the SSD; "target _ path" records the storage path of the data set cached on the server.
And the data set path judging component is used for judging whether a cache data set exists on the server to which the training task is scheduled, and mounting the corresponding data set storage path into the training task container. The component is realized by Python, the component is called in a task container starting script, the cache of the data set is judged by reading the dataset _ cache.json file, if the data set specified by the task exists in the JSON file, the corresponding cache path is directly mounted to/data in the container (namely, a local SSD storage path or a local tmpfs storage path is mounted); if not, meaning that the data set is not cached at the local node, the remote NFS storage path in the default mount task configuration, the data set is read from the NFS. This enables cache path mapping that is transparent to the user.
The above components ensure that the training tasks are used for local data set caching after being scheduled to the computing nodes, and the scheduling optimization component preferentially schedules the training tasks to the nodes cached with the corresponding data sets according to the caching information of each node. Firstly, a table dataset _ cache _ detail needs to be maintained in a database to represent the caching condition of a data set of each node, and fields in the table comprise: node IP address, cache dataset name. Modifying the scheduling logic of a scheduler used by the platform according to the scheduler, adding a query node cache function in a code of the scheduler, and acquiring data set information of node cache by the function through querying a dataset _ cache _ detail table; this function adds preferences to the data set cache when screening nodes, returning a list of nodes containing the corresponding data set cache (when all nodes do not contain the specified data set cache, the function screening function fails). It should be noted that the screening generated in this step is only a scheduling suggestion, and the actual mount still needs to select whether to mount according to the node data set caching condition, that is, when the local cache information dataset _ cache.json conflicts with the database table dataset _ cache _ detail information, the local cache information is taken as the standard, which is the function realized by the last component.
Fig. 2 is a process flow diagram illustrating an intelligent cache acceleration method of a deep learning cloud platform according to an exemplary embodiment, and referring to fig. 2, the process flow includes the following steps:
s01, acquiring configuration information (jobConfig), and counting the use frequency of a data set:
the data set use frequency statistic component is used for finding out a data set with high user use frequency so as to determine a cache deployment object; the data set use frequency statistical component is configured to firstly use an administrator account to obtain a user authentication token and request a platform to obtain configuration information (job _ config) of all training tasks; then filtering out the training tasks submitted in a specified period (e.g., can default to the past 7 days) from all the configuration information (job _ config);
s02, calculating data set gain according to the data set use frequency, calculating node cache deployment conditions by a dynamic planning method, completing cache deployment, and obtaining node cache data set information:
firstly, acquiring the use frequency of the data set counted by the data set use frequency counting component, calculating the gain value of each data set according to the use frequency, converting cache deployment into a 0-1 knapsack problem (namely, the cache capacity of a node is the size of a knapsack, the data set must be completely cached on the node, and only a part of a certain data set cannot be cached), and calculating the data set to be cached by each calculation node and a corresponding cache path by using a dynamic programming method. And subsequently copying the corresponding data set to the cache path of the corresponding node through remote execution tools such as an idle tool and the like. The component is set to run periodically by the crontab, for example, it may be configured by default to run every 7 days, updating the data set cache on the compute node;
s03, the scheduler inquires the node cache data set information and performs screening:
modifying the scheduling logic of a scheduler used by the platform according to the scheduler, adding a query node cache function in a code of the scheduler, and acquiring data set information of node cache by the function through querying a dataset _ cache _ detail table; this function adds a preference to the dataset cache when screening nodes, and then returns a list of nodes containing the corresponding dataset cache (the function screening function fails when all nodes do not contain the specified dataset cache). It should be noted that the screening generated in this step is only a scheduling suggestion, and the actual mount still needs to select whether to mount according to the node data set caching condition, that is, when the local cache information dataset _ cache.json conflicts with the database table dataset _ cache _ detail information, the local cache information is taken as the standard, which is the function realized by the last component.
S04, judging whether the nodes contain task use data sets: s041, if yes, scheduling the task to a node containing the data set cache;
s042, if not, ignoring the screening and carrying out normal scheduling;
s05, reading a cache configuration file, wherein the cache configuration file comprises the node cache data set information;
judging whether the data set cache exists comprises the following steps: calling a judging data set path component in a task container starting script, judging the cache of the data set by reading the cache information of the data set, and directly mounting a corresponding cache path to/data in a container if the data set specified by the task exists in a specified file (such as a JSON file); if not, meaning that the data set is not cached at the local node, the remote NFS storage path in the default mount task configuration, the data set is read from the NFS. This enables cache path mapping that is transparent to the user.
S06, judging a data set path according to the cache configuration file;
s061, if the cache does not exist, mounting a remote NFS storage path;
s062, if the data are cached in the SSD, mounting a local SSD storage path;
s063, if the local tmpfs storage path is cached in a memory, mounting a local tmpfs storage path;
and S07, reading the mounting path in the step S06 by a training task in the container, and starting training.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The embodiments described in the specification are only preferred embodiments of the present invention, and the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit the present invention. Those skilled in the art can obtain technical solutions through logical analysis, reasoning or limited experiments according to the concepts of the present invention, and all such technical solutions are within the scope of the present invention.
Claims (9)
1. The utility model provides an intelligence cache acceleration system of deep learning cloud platform which characterized in that includes: the system comprises a data set use frequency statistic component, a multi-level cache automatic deployment component, a mounting data set path component and an optimized scheduling component according to data set cache;
the data set use frequency statistic component obtains the use times of each data set in a specified time period by analyzing the log information of the training task, so as to determine which data sets are deployed subsequently as a cache;
the multi-level cache automatic deployment component calculates the data sets with the highest profit by a dynamic programming algorithm according to the use frequency of the data sets, and actually deploys the data sets to the idle memory and/or the SSD space of the server;
the mounting data set path component mounts the corresponding physical path into a Docker container of the training task by judging whether the data set used by the training task has a local cache or not, so that the data set can be ensured to be accessed by the training task;
the optimized scheduling component preferentially schedules the training tasks to the nodes cached with the corresponding data sets according to the cache information of each node, and schedules the training tasks to the server containing the required data set cache as much as possible, so that the acceleration effect of the data set cache is exerted to the maximum extent.
2. The intelligent cache acceleration system of the deep learning cloud platform of claim 1, wherein the data set usage frequency statistics component is configured to first use an administrator account to obtain a user authentication token, and request the platform to obtain configuration information of all training tasks; then, training tasks submitted in a specified time period are filtered out from all the configuration information; the using times of each data set are obtained through accumulation, and then sequencing is carried out, so that the using frequency of each data set in a specified time period can be counted.
3. The intelligent cache acceleration system of the deep learning cloud platform according to claim 1, wherein the multi-level cache automatic deployment component is configured to, first, obtain the data set usage frequency counted by the data set usage frequency counting component, calculate the gain value of each data set according to the usage frequency, convert cache deployment into a 0-1 knapsack problem, calculate the data set that each computing node should cache, and the corresponding cache path using a dynamic programming method; the corresponding data set is subsequently copied to the cache path of the corresponding node.
4. The intelligent cache acceleration system of a deep learning cloud platform of claim 1, wherein the mount data set path component is configured to mount a storage path on a compute node into a Docker container for use by a compute task.
5. The intelligent cache acceleration system of a deep-learning cloud platform of claim 4, wherein the mount dataset path component comprises a determine dataset path component to determine whether a cache exists on a server prior to mounting a dataset cache into a Docker container.
6. The intelligent cache acceleration system of the deep-learning cloud platform of claim 5, wherein the data set path determining component is configured to call the component in a Docker container start script of a training task, determine the cache of the data set by reading the cache information of the data set, and directly mount a corresponding cache path to a corresponding local SSD storage path or local tmpfs storage path in the container if the data set specified by the task exists in a specified file; if not, meaning that the data set is not cached at the local node, the remote NFS storage path in the default mount task configuration, the data set is read from the NFS.
7. The intelligent cache acceleration system for a deep learning cloud platform of claim 1, wherein the optimization scheduling component is configured to, first, contain a list one capable of representing caching statuses of data sets of each node, which includes the following fields: node IP address, cache data set name; secondly, modifying the scheduling logic of a scheduler used by the platform, and adding a query node cache function in a code of the scheduler, wherein the function acquires the data set information cached by the node by querying the first list; the function adds the preference to the data set cache when screening nodes, and then returns a node list containing the corresponding data set cache; when all nodes do not contain the specified data set cache, the function screening function is invalid.
8. The intelligent cache acceleration system of the deep learning cloud platform of claim 7, wherein the screening step generated when the function screens the node is a scheduling suggestion, and the actual mount still needs to select whether to mount according to the node data set cache condition, that is, when the local cache information conflicts with the information of the first list in the database, the local cache information is taken as the standard.
9. The acceleration method of the intelligent cache acceleration system of the deep learning cloud platform according to any one of claims 1 to 8, characterized by comprising the following steps:
s01, acquiring configuration information and counting the use frequency of a data set;
s02, calculating data set gain according to the data set use frequency, calculating node cache deployment conditions by a dynamic programming method, completing cache deployment and obtaining node cache data set information;
s03, the scheduler inquires the node cache data set information and conducts screening;
s04, judging whether the node contains a task use data set or not;
s041, if yes, scheduling the task to a node containing the data set cache;
s042, if not, ignoring the screening and carrying out normal scheduling;
s05, reading a cache configuration file, wherein the cache configuration file comprises the node cache data set information;
s06, judging a data set path according to the cache configuration file;
s061, if the cache does not exist, mounting a remote NFS storage path;
s062, if the data are cached in the SSD, mounting a local SSD storage path;
s063, if the local tmpfs storage path is cached in a memory, mounting a local tmpfs storage path;
and S07, reading the path mounted in the step S06 by a training task in the container, and starting training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210957648.2A CN115022405B (en) | 2022-08-10 | 2022-08-10 | Intelligent cache acceleration system and method of deep learning cloud platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210957648.2A CN115022405B (en) | 2022-08-10 | 2022-08-10 | Intelligent cache acceleration system and method of deep learning cloud platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115022405A CN115022405A (en) | 2022-09-06 |
CN115022405B true CN115022405B (en) | 2022-10-25 |
Family
ID=83065806
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210957648.2A Active CN115022405B (en) | 2022-08-10 | 2022-08-10 | Intelligent cache acceleration system and method of deep learning cloud platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115022405B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8706971B1 (en) * | 2012-03-14 | 2014-04-22 | Netapp, Inc. | Caching and deduplication of data blocks in cache memory |
CN110825705A (en) * | 2019-11-22 | 2020-02-21 | 广东浪潮大数据研究有限公司 | Data set caching method and related device |
CN111124277A (en) * | 2019-11-21 | 2020-05-08 | 苏州浪潮智能科技有限公司 | Deep learning data set caching method, system, terminal and storage medium |
CN111860835A (en) * | 2020-07-17 | 2020-10-30 | 苏州浪潮智能科技有限公司 | Neural network model training method and device |
CN113792885A (en) * | 2021-08-20 | 2021-12-14 | 山东英信计算机技术有限公司 | Execution method and related device for deep learning training |
CN114679283A (en) * | 2020-12-09 | 2022-06-28 | 中兴通讯股份有限公司 | Block chain data request processing method and device, server and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3786783A1 (en) * | 2019-08-30 | 2021-03-03 | Bull SAS | System to assist with the design of an artificial intelligence application, executable on distributed computer platforms |
-
2022
- 2022-08-10 CN CN202210957648.2A patent/CN115022405B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8706971B1 (en) * | 2012-03-14 | 2014-04-22 | Netapp, Inc. | Caching and deduplication of data blocks in cache memory |
CN111124277A (en) * | 2019-11-21 | 2020-05-08 | 苏州浪潮智能科技有限公司 | Deep learning data set caching method, system, terminal and storage medium |
CN110825705A (en) * | 2019-11-22 | 2020-02-21 | 广东浪潮大数据研究有限公司 | Data set caching method and related device |
CN111860835A (en) * | 2020-07-17 | 2020-10-30 | 苏州浪潮智能科技有限公司 | Neural network model training method and device |
CN114679283A (en) * | 2020-12-09 | 2022-06-28 | 中兴通讯股份有限公司 | Block chain data request processing method and device, server and storage medium |
CN113792885A (en) * | 2021-08-20 | 2021-12-14 | 山东英信计算机技术有限公司 | Execution method and related device for deep learning training |
Non-Patent Citations (1)
Title |
---|
一种基于强化学习的混合缓存能耗优化与评价;范浩等;《计算机研究与发展》;20200607(第06期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN115022405A (en) | 2022-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101636742B (en) | Efficient processing of time-bounded messages | |
CN110046133B (en) | Metadata management method, device and system for storage file system | |
CN101390080B (en) | Serving cached query results based on a query portion | |
US8938421B2 (en) | Method and a system for synchronizing data | |
US10922316B2 (en) | Using computing resources to perform database queries according to a dynamically determined query size | |
CN103106152B (en) | Based on the data dispatching method of level storage medium | |
CN111901294A (en) | Method for constructing online machine learning project and machine learning system | |
US9438665B1 (en) | Scheduling and tracking control plane operations for distributed storage systems | |
CN1811704A (en) | System and method for a context-awareness platform | |
US10191663B1 (en) | Using data store accelerator intermediary nodes and write control settings to identify write propagation nodes | |
CN102111438B (en) | Method and device for parameter adjustment and distributed computation platform system | |
CN111596922A (en) | Method for realizing custom cache annotation based on redis | |
US10970113B1 (en) | Systems and methods for orchestrating seamless, distributed, and stateful high performance computing | |
CN109831540A (en) | Distributed storage method, device, electronic equipment and storage medium | |
CN102045399B (en) | Cloud computing mode file system and file reading method | |
CN110351532A (en) | Video big data cloud platform cloud storage service method | |
CN108475201A (en) | A kind of data capture method in virtual machine start-up course and cloud computing system | |
KR100858157B1 (en) | System and merhod for map uapdate, storage medium recording that method program, user termianl | |
CN115022405B (en) | Intelligent cache acceleration system and method of deep learning cloud platform | |
CN108920095A (en) | A kind of data store optimization method and apparatus based on CRUSH | |
CN113297245A (en) | Method and device for acquiring execution information | |
CN115629860A (en) | Software parameter tuning method, container management platform, storage medium and system | |
CN115563075A (en) | Virtual file system implementation method based on microkernel | |
CN114116646A (en) | Log data processing method, device, equipment and storage medium | |
CN114816272A (en) | Magnetic disk management system under Kubernetes environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |