CN115022405A - Intelligent cache acceleration system and method of deep learning cloud platform - Google Patents

Intelligent cache acceleration system and method of deep learning cloud platform Download PDF

Info

Publication number
CN115022405A
CN115022405A CN202210957648.2A CN202210957648A CN115022405A CN 115022405 A CN115022405 A CN 115022405A CN 202210957648 A CN202210957648 A CN 202210957648A CN 115022405 A CN115022405 A CN 115022405A
Authority
CN
China
Prior art keywords
cache
data set
node
path
component
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210957648.2A
Other languages
Chinese (zh)
Other versions
CN115022405B (en
Inventor
胡安
常峰
朱建
王景祥
肖玉
刘海峰
王子磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Zhongke Leinao Intelligent Technology Co ltd
Original Assignee
Hefei Zhongke Leinao Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Zhongke Leinao Intelligent Technology Co ltd filed Critical Hefei Zhongke Leinao Intelligent Technology Co ltd
Priority to CN202210957648.2A priority Critical patent/CN115022405B/en
Publication of CN115022405A publication Critical patent/CN115022405A/en
Application granted granted Critical
Publication of CN115022405B publication Critical patent/CN115022405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides an intelligent cache acceleration system and method of a deep learning cloud platform. The intelligent cache acceleration system of the deep learning cloud platform comprises: the system comprises a data set use frequency statistic component, a multi-level cache automatic deployment component, a mounted data set path component, a cache optimization scheduling component according to the data set and the like. The cache optimization scheduling component needs to be modified according to a specific scheduler used by the platform, and the rest parts are independent components.

Description

Intelligent cache acceleration system and method of deep learning cloud platform
Technical Field
The invention relates to the field of computers, deep learning platforms and cache systems, in particular to an intelligent cache acceleration system and method of a deep learning cloud platform.
Background
With the rapid development of the deep learning technology, the demand of model training on computing capacity is gradually increased, so that the deep learning cloud platform appears in the field of view of the public. At present, containerization is basically realized on a deep learning platform, namely, a training task runs on the platform in a form of a Docker container. In the deep learning field, a training task needs to read a large-capacity data set from a back end, and the data set mainly takes the form of a large number of small files, such as pictures, so that the I/O throughput of a storage system becomes a performance bottleneck of the training task. To solve this problem, we need to design a cache system to improve the I/O performance by caching the data set locally (compute nodes), thereby speeding up the training of the model.
Aiming at the problem of low read-write I/O performance of an application program in a scene of a large number of small files, a solution is provided in a classic paper of facing a needle in Haystack, namely Facebook's photo storage, and a plurality of implementation schemes such as an open source distributed file system SeaweedFS are provided based on the paper. The method realizes that a large number of small picture files (Needle) are aggregated into a large file (Volume) at the back end, and the specified file is read through the Volume ID and the logical offset of the small file. There are some problems with this solution: firstly, the performance is poor, after the SeaweedFS is mounted in a form of a shared file system through a filer, the read-write I/O performance loss is obvious due to the conversion of an access protocol; secondly, the combination with the existing system is difficult, the SeaweedFS is used as a set of independent distributed storage, files need to be accessed in a URL form similar to object storage, and the method is not consistent with widely used deep learning frames (TensorFlow, PyTorch and the like), the frames are used for reading and writing data set files in a file system form, and the object storage mode of the SeaweedFS invades the reading and writing logic of users. Aiming at the defects, the invention provides an intelligent cache acceleration system and method of a deep learning cloud platform with a multi-level cache system, which completely fit the working process of a containerized deep learning platform, and the cache characteristics are completely transparent to users; on the other hand, the invention can automatically cache the hot spot data set with high use frequency by utilizing various media such as the memory of the node (namely the physical server), the SSD (solid state disk) and the like, thereby realizing the high-efficiency read-write acceleration effect.
Disclosure of Invention
In order to overcome the technical defects in the prior art, realize the monitoring of the use frequency of a data set, the automatic deployment strategy of the cache of the data set and the optimization of the affinity of the data set in the aspect of scheduling, the invention provides an intelligent cache acceleration system and method of a deep learning cloud platform.
In one aspect, an intelligent cache acceleration system of a deep learning cloud platform is provided, the system comprising: the system comprises a data set use frequency statistic component, a multi-level cache automatic deployment component, a mounted data set path component, a cache optimization scheduling component according to the data set and the like. The cache optimization scheduling component needs to be modified according to a specific scheduler used by the platform, and the rest parts are independent components.
The data set use frequency statistical component is used for finding out a data set with high user use frequency so as to determine an object of cache deployment; the data set use frequency statistic component is configured to firstly use an administrator account to acquire a user authentication token (token, a string of characters used for a system to confirm whether an access user has corresponding authority), and request a platform to acquire configuration information (job _ configuration) of all training tasks; the starting time of the training task can be acquired from the configuration information (job _ config), and all training task information submitted in a specified time period (for example, the past 7 days can be defaulted) can be filtered according to requirements;
the configuration information (jobconfig) includes the following information: data set information and detailed information of a plurality of data sets used by the training task; wherein one training task may use one or more data sets. The structure of the data set detailed information further includes a user ID to which the data set belongs, a boolean value (representing whether the data set is public or not), a 'dataseitid' field (representing a data set name used by the task), and a 'mountPath' field (recording a mounting path of the data set in a Docker container of a training task); thus, a data set name list used by the training task can be obtained from the configuration information (job _ config) filtered by time; in the obtained list, the use times of each data set are obtained through accumulation, and then sequencing is carried out, so that the use frequency of each data set in a specified time period can be counted.
And the multi-level cache automatic deployment component is used for carrying out multi-level caching on the data set. The multi-level cache uses two storage media for the compute nodes: SSD and memory. For SSD, directly copying the cache data set to a cache path in the SSD; for the memory, because the SSD space of the compute node is limited, the cache path is mapped into the memory in a manner of using tmpfs (tmpfs is a memory-based file system, all files are stored in the memory, and the performance is much higher than that of a disk-based file system), so that the data sets can be cached into the memory under the cache path as using SSD, the difference is that the data sets occupy the memory space, and the read-write performance is better;
the automatic multi-level cache deployment component is configured to firstly obtain the data set use frequency counted by the data set use frequency counting component, calculate the gain value of each data set according to the data set use frequency and the data set capacity (namely, the data set is deployed as an expected acceleration effect brought by the cache, the larger the data set capacity is, the higher the use frequency is, the better the acceleration effect is), convert the cache deployment into a 0-1 knapsack problem (namely, the node cache capacity is of knapsack size, the data set must be completely cached on the node, and only a part of a certain data set cannot be cached), and calculate the data set to be cached by each computing node and a corresponding cache path by using a dynamic programming method. And subsequently copying the corresponding data set to the cache path of the corresponding node through a remote execution tool (such as an idle) of the Linux platform. The component is set to run periodically by means of a periodic execution tool (e.g., crontab) under the Linux platform, and may be configured to run every 7 days by default, for example, and periodically update the data set cache on the compute node.
The mounting data set path component is used for mounting the storage paths on the computing nodes into a Docker container (or called task container, training container or Docker container) of the training task so as to be used by the computing task; the mount dataset path component includes a determine dataset path component for determining whether a cache exists on a server before the dataset cache is mounted into the Docker container, i.e., for determining whether a cached dataset exists on a server to which the training task is scheduled. And mounting the corresponding data set storage path into the training task container. In the container, after a plurality of cache paths on the computing node are added, cache information of a data set needs to be added to the node, and whether a cache exists in the data set required by a task is determined before mounting. Json file.
The data set path judging component is configured to call the component in a task container starting script, judge the data set cache by reading the data set cache information, and directly mount a corresponding cache path to a data set path in a container if a data set specified by a task exists in a specified file (such as a JSON file); if the data set does not exist, the data set is not cached in a local node, a remote NFS (Network File System) storage path in default mounting task configuration is adopted, and the data set is read from the NFS. This enables cache path mapping that is transparent to the user.
And the optimization scheduling component preferentially schedules the training task to the node cached with the corresponding data set according to the cache information of each node. The optimized scheduling component is configured to, first, include a table dataset _ cache _ detail, which can represent the data set caching condition of each node, and includes the following fields: node IP address, cache dataset name. Secondly, modifying the scheduling logic of the scheduler according to the scheduler used by the platform, adding a query node cache function in a code of the scheduler, and acquiring data set information of the node cache by querying a dataset _ cache _ detail table by the function; this function adds a preference to the dataset cache when screening nodes, and then returns a list of nodes containing the corresponding dataset cache (the function screening function fails when all nodes do not contain the specified dataset cache). It should be noted that the screening generated in this step is only a scheduling suggestion, and the actual mounting still needs to be selected whether to mount according to the node data set caching condition, that is, when the local caching information dataset _ cache.
In another aspect, an intelligent cache acceleration method for a deep learning cloud platform is provided, which includes:
s01, acquiring configuration information (jobconfig), statistical data set usage frequency:
the data set use frequency statistic component is used for finding out a data set with high user use frequency so as to determine a cache deployment object; the data set use frequency statistical component is configured to firstly use an administrator account to obtain a user authentication token and request a platform to obtain configuration information (job _ config) of all training tasks; then filtering out the training tasks submitted in a specified period (e.g., may default to the last 7 days) from all configuration information (jobconfig);
s02, calculating the gain of the data set according to the use frequency of the data set, calculating the cache deployment condition of the node by a dynamic planning method, completing the cache deployment, and obtaining the information of the node cache data set:
firstly, acquiring the use frequency of the data set counted by the data set use frequency counting component, calculating the gain value of each data set according to the use frequency, converting cache deployment into a 0-1 knapsack problem (namely, the cache capacity of a node is the size of a knapsack, the data set must be completely cached on the node, and only a part of a certain data set cannot be cached), and calculating the data set to be cached by each calculation node and a corresponding cache path by using a dynamic programming method. And subsequently copying the corresponding data set to the cache path of the corresponding node through remote execution tools such as an idle tool and the like. The component is set to run periodically by the crontab, for example, it may be configured by default to run every 7 days, updating the data set cache on the compute node;
s03, the dispatcher inquires the node cache data set information and carries out screening:
modifying the scheduling logic of a scheduler used by the platform according to the scheduler, adding a query node cache function in a code of the scheduler, and acquiring data set information of node cache by the function through querying a dataset _ cache _ detail table; this function adds a preference to the dataset cache when screening nodes, and then returns a list of nodes containing the corresponding dataset cache (the function screening function fails when all nodes do not contain the specified dataset cache). It should be noted that the screening generated in this step is only a scheduling suggestion, and the actual mount still needs to select whether to mount according to the node data set caching condition, that is, when the local cache information dataset _ cache.json conflicts with the database table dataset _ cache _ detail information, the local cache information is taken as the standard, which is the function realized by the last component.
S04, judging whether the node contains a task use data set:
the function is modified and realized in a scheduler code, a function for judging whether a node cache exists is added, and the dataset _ cache _ detail data table is inquired to obtain data set cache information of all nodes in the cluster. Then inquiring whether a node containing a data set cache required by the training task exists in the node list, if so, filtering out a node list containing the required data set cache, and scheduling the training task to the node containing the cache; if all the nodes do not contain the required data set cache, the screening process is ignored, and normal scheduling is carried out;
s05, reading a cache configuration file, wherein the cache configuration file comprises the node cache data set information;
s06, judging a data set path according to the cache configuration file;
judging the caching of the data set comprises the following steps: calling a data set path judging component in a task container starting script, judging the cache of the data set by reading the cache information of the data set, and directly mounting a corresponding cache path to the data set path in the container if the data set specified by the task exists in a specified file (such as a JSON file); if not, meaning that the dataset is not cached at the local node, remote NFS (network File system), network File system) in default mount task configuration stores the path, and the dataset is read from NFS. This enables cache path mapping that is transparent to the user.
The function of judging the data set path is realized by reading a cache record JSON file on a node, wherein the file records the data set cache information actually existing on the node, and the data set cache information comprises a data set name, a cache type and a cache path. After the program reads the JSON file, inquiring the name of a data set used by the task, if the JSON file does not have the data set and indicates that the cache does not exist, returning a remote NFS storage path to the training task; if the data set exists in the JSON file and the cache type is SSD, the cache is indicated in the SSD, and a corresponding local SSD storage path is returned to the training task; if the data set exists in the JSON file and the cache type is Memory, the cache is indicated in the Memory, and a local tmpfs storage path is returned to the training task;
and S07, reading the mounting path in the step S06 by the training task in the container, and starting training.
On a containerized machine learning training platform, all training tasks run in the form of a Docker container, a physical path needs to be mounted to a mounting point inside the container when the inside of the container wants to access files on a physical machine, so that external files can be accessed in the container through the mounting point path. The data set path judging function returns corresponding data set path information to the training task, and the physical path is mounted to the data set path in the container, so that the data set cache on the reading node of the training task can be used for training.
The invention has the beneficial effects that: the multi-level cache system fully utilizes various idle storage media on the computing node, and can obviously improve the read-write I/O performance of a deep learning training task; on the other hand, the system is matched with the current mainstream containerization deep learning platform working process, the task is preferentially distributed to the computing nodes with the local cache in the aspect of scheduling, the data set cache is completely transparent to the user in the aspect of use, and the relevant codes are trained without any change.
Drawings
The advantages and spirit of the present invention can be further understood by the following detailed description of the invention and the accompanying drawings.
FIG. 1 is a block diagram of the structure of an intelligent cache acceleration system of a deep learning cloud platform according to the present invention;
fig. 2 is a processing flow chart of the intelligent cache acceleration method of the deep learning cloud platform according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to fig. 1-2 in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides an intelligent cache acceleration system of a deep learning cloud platform, and fig. 1 is a structural module diagram of the intelligent cache acceleration system of the deep learning cloud platform, the system is designed based on a current mainstream containerized deep learning cloud platform, and the intelligent cache acceleration system of the deep learning cloud platform is provided, and the system comprises: the system comprises a data set use frequency statistic component, a multi-level cache automatic deployment component, a mounted data set path component, a cache optimization scheduling component according to the data set and the like. The cache optimization scheduling component needs to be modified according to a specific scheduler used by the platform, and the rest parts are independent components.
The data set use frequency statistic component obtains the use times of each data set in a specified time period by analyzing the log information of the training task, and accordingly determines which data sets are deployed subsequently to serve as cache.
And the multi-level cache automatic deployment component calculates the data sets with the highest profit by a dynamic planning algorithm according to the use frequency of the data sets, and actually deploys the data sets to the idle memory and SSD space of the server.
The mounting data set path component mounts the corresponding physical path into the Docker container of the training task by judging whether the local cache exists in the data set used by the training task, thereby ensuring that the data set can be accessed by the training task.
The optimization scheduling component schedules the training task to the server containing the required data set cache as much as possible, so that the acceleration effect of the data set cache is exerted to the maximum extent.
The intelligent cache acceleration system of the deep learning cloud platform is described in detail below by combining specific configuration modes of the components.
In order to find out the data set with high user use frequency and determine the object of cache deployment, a data set use frequency statistical component is developed. The component is a Python script, firstly, an administrator account is used for obtaining a user authentication token (token), and configuration information (job _ config) of all training tasks is requested to be obtained from a platform; the training tasks submitted within a specified period (default to the past 7 days) are then filtered out of all the job _ configs, an example of the job _ config format is as follows:
{'apiVersion': '2.0', 'code': {'timeStamp': '1609894565523', 'version': '1', 'projectId': 'slowfast', 'mountPath': '/code'}, 'output': {'jobId': 1}, 'datasets': [{'owner': '_386152087 e494e7ea022f90333f985aa', 'share': True, 'token': '', 'datasetId': 'imagenet', 'mountPath': '/data/linshiqi047/imagenet'},{'owner': '_20d1931aad6549309295da81e5e1fc6c', 'share': False, 'token': '', 'datasetId': 'energy', 'mountPath': '/data/zhaoliu/energy'}], 'retryCount': 0, 'models': [], 'kind': 'execution', 'taskRoles': [{'gpuType': 'titanxp', 'cpuNumber': 2, 'command': 'e', 'taskNumber': 1, 'taskName': 'Task1', 'shmMB': 8192, 'image': '10.11.3.8:5000/bitahub/pytorch:1.3-py3', 'memoryMB': 16384, 'gpuNumber': 1}], 'jobName': 'slowfast-cmrn7q'}
the above configuration information is in a standard JSON format in the form of key-value pairs, and the system needs to pay attention to the 'datasets' item (or field), in which the data set information used by the training task is stored. The value of the 'datasets' item is a list that may contain details of multiple data sets (2 data sets in the example configuration described above), which means that one or more data sets may be used by a training task. In the structure body of the data set detailed information, 'opener' represents the user ID to which the data set belongs, 'share' is a Boolean value and represents whether the data set is published, a 'datasettId' field represents the name of the data set used by the task, and 'mount path' of the data set in a training container is recorded; and then, the names of the used data sets are taken from all the obtained job _ configs, the using times of each data set are obtained through accumulation, and then sequencing is carried out, so that the using frequency of each data set in a specified time period can be counted.
The automatic deployment aspect of the multi-level cache adopts a mechanism of multi-level caching of the data set. Two storage media using computing nodes: SSD and memory. For SSD, directly copying the cache data set to a cache path in the SSD; because the SSD space of the computing node is limited, a cache path is mapped into the memory by using a tmpfs (the tmpfs is a memory-based file system, all files are stored in the memory, and the performance is much higher than that of a disk-based file system), so that the data sets can be cached under the cache path in the memory like using the SSD, and the difference is that the data sets occupy the memory space and have better read-write performance. The component is also written by Python, firstly, the use frequency of a data set counted by the last component is obtained, the gain value of each data set is calculated according to the use frequency, cache deployment is converted into a 0-1 knapsack problem (namely, the cache capacity of a node is the size of a knapsack, the data set must be completely cached on the node, and only a part of a certain data set cannot be cached), and the data set which should be cached by each computing node and a corresponding cache path are calculated by using a dynamic programming method. And subsequently copying the corresponding data set to the cache path of the corresponding node through remote execution tools such as an idle tool and the like. The component is set to run periodically by the crontab, and is configured by default to run every 7 days, updating the data set cache on the compute node.
On the containerized deep learning platform, all training tasks run in the form of a Docker container, which means that a storage path on a compute node needs to be mounted in the Docker container to be used by the compute task. According to the cache deployment mechanism designed by the scheme, the caches of the data sets deployed on different servers in the deep learning platform cluster are different; thus, the node to which the training task is scheduled does not necessarily have a corresponding data set cache, and it is necessary to determine whether a cache exists on the server before the data set cache is mounted in the Docker container. In the container, the data set path is mounted as/data (namely, a local SSD storage path or a local tmpfs storage path is mounted), after a plurality of cache paths on the computing node are added, cache information of the data set needs to be added to the node, and whether the data set needed by the task is cached or not is determined before mounting. The cache information of the data set is maintained in a dataset _ cache.
{"dataset_cache": [{"name": "MSCOCO", "cachetype": 2, "target_path": "/opt/cache/MSCOCO"}, {"name": "GLOVE-6B", "cachetype": 2, "target_path": "/opt/cache/GLOVE-6B"}, {"name": "VOC2007", "cachetype": 1, "target_path": "/tmpfs/cache/VOC2007"}, {"name": "GLOVE-42B-300D", "cachetype": 2, "target_path": "/opt/cache/GLOVE-42B-300D"}], "local_info": {"hostname": "GA010", "cache_size_ssd": 100GB, "cache_size_mem": 40GB}}
The JSON file records all data set caches deployed on the server, wherein each entry records information of one data set cache, and in the component, the fields of "name", "cachetype" and "target _ path" are mainly concerned about, the "name" represents a data set name, the "cachetype" represents a data set cache position, the value of 1 represents that the data set cache is located in the memory, and the value of 2 represents that the data set cache is located on the SSD; "target _ path" records the storage path of the data set cached on the server.
And the data set path judging component is used for judging whether a cache data set exists on the server to which the training task is scheduled, and mounting the corresponding data set storage path into the training task container. The component is realized by Python, the component is called in a task container starting script, the cache of the data set is judged by reading the dataset _ cache.json file, if the data set specified by the task exists in the JSON file, the corresponding cache path is directly mounted to/data in the container (namely, a local SSD storage path or a local tmpfs storage path is mounted); if not, meaning that the data set is not cached at the local node, the remote NFS storage path in the default mount task configuration, the data set is read from the NFS. This enables cache path mapping that is transparent to the user.
The above components ensure that the training tasks are used for caching the local data set after being scheduled to the computing nodes, and the scheduling optimization component preferentially schedules the training tasks to the nodes cached with the corresponding data sets according to the caching information of each node. Firstly, a table dataset _ cache _ detail needs to be maintained in a database to represent the caching condition of a data set of each node, and fields in the table comprise: node IP address, cache dataset name. Modifying the scheduling logic of a scheduler used by the platform according to the scheduler, adding a query node cache function in a code of the scheduler, and acquiring data set information of node cache by the function through querying a dataset _ cache _ detail table; this function adds preferences to the data set cache when screening nodes, returning a list of nodes containing the corresponding data set cache (when all nodes do not contain the specified data set cache, the function screening function fails). It should be noted that the screening generated in this step is only a scheduling suggestion, and the actual mount still needs to select whether to mount according to the node data set caching condition, that is, when the local cache information dataset _ cache.json conflicts with the database table dataset _ cache _ detail information, the local cache information is taken as the standard, which is the function realized by the last component.
Fig. 2 is a process flow diagram illustrating an intelligent cache acceleration method of a deep learning cloud platform according to an exemplary embodiment, and referring to fig. 2, the process flow includes the following steps:
s01, acquiring configuration information (jobconfig), statistical data set usage frequency:
the data set use frequency statistic component is used for finding out a data set with high user use frequency so as to determine a cache deployment object; the data set use frequency statistical component is configured to firstly use an administrator account to obtain a user authentication token and request a platform to obtain configuration information (job _ config) of all training tasks; then filtering out the training tasks submitted in a specified period (e.g., may default to the last 7 days) from all configuration information (jobconfig);
s02, calculating the gain of the data set according to the use frequency of the data set, calculating the cache deployment condition of the node by a dynamic planning method, completing the cache deployment, and obtaining the information of the node cache data set:
firstly, acquiring the use frequency of the data set counted by the data set use frequency counting component, calculating the gain value of each data set according to the use frequency, converting cache deployment into a 0-1 knapsack problem (namely, the cache capacity of a node is the size of a knapsack, the data set must be completely cached on the node, and only a part of a certain data set cannot be cached), and calculating the data set to be cached by each calculation node and a corresponding cache path by using a dynamic programming method. And subsequently copying the corresponding data set to the cache path of the corresponding node through remote execution tools such as an idle tool and the like. The component is set to run periodically by the crontab, for example, it may be configured by default to run every 7 days, updating the data set cache on the compute node;
s03, the dispatcher inquires the node cache data set information and carries out screening:
modifying the scheduling logic of a scheduler used by the platform according to the scheduler, adding a query node cache function in a code of the scheduler, and acquiring data set information of node cache by the function through querying a dataset _ cache _ detail table; this function adds a preference to the dataset cache when screening nodes, and then returns a list of nodes containing the corresponding dataset cache (the function screening function fails when all nodes do not contain the specified dataset cache). It should be noted that the screening generated in this step is only a scheduling suggestion, and the actual mount still needs to select whether to mount according to the node data set caching condition, that is, when the local cache information dataset _ cache.json conflicts with the database table dataset _ cache _ detail information, the local cache information is taken as the standard, which is the function realized by the last component.
S04, judging whether the nodes contain task use data sets:
s041, if yes, scheduling the task to a node containing the data set cache;
s042, if not, ignoring the screening and carrying out normal scheduling;
s05, reading a cache configuration file, wherein the cache configuration file comprises the node cache data set information;
judging whether the data set cache exists comprises the following steps: calling a judging data set path component in a task container starting script, judging the cache of the data set by reading the cache information of the data set, and directly mounting a corresponding cache path to/data in a container if the data set specified by the task exists in a specified file (such as a JSON file); if not, meaning that the data set is not cached at the local node, the remote NFS storage path in the default mount task configuration, the data set is read from the NFS. This achieves cache path mapping that is transparent to the user.
S06, judging a data set path according to the cache configuration file;
s061, if the cache does not exist, mounting a remote NFS storage path;
s062, if the data are cached in the SSD, mounting a local SSD storage path;
s063, if the local tmpfs storage path is cached in a memory, mounting a local tmpfs storage path;
and S07, reading the mounting path in the step S06 by the training task in the container, and starting training.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The embodiments described in the specification are only preferred embodiments of the present invention, and the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit the present invention. Those skilled in the art can obtain technical solutions through logical analysis, reasoning or limited experiments according to the concepts of the present invention, and all such technical solutions are within the scope of the present invention.

Claims (9)

1. The utility model provides an intelligence cache acceleration system of deep learning cloud platform which characterized in that includes: the system comprises a data set use frequency statistic component, a multi-level cache automatic deployment component, a mounted data set path component and an optimized scheduling component according to data set cache;
the data set use frequency statistic component obtains the use times of each data set in a specified time period by analyzing the log information of the training task, so as to determine which data sets are deployed subsequently as a cache;
the multi-level cache automatic deployment component calculates the data sets with the highest profit by a dynamic programming algorithm according to the use frequency of the data sets, and actually deploys the data sets to the idle memory and/or SSD space of the server;
the mounting data set path component mounts the corresponding physical path into a Docker container of the training task by judging whether the data set used by the training task has a local cache or not, so that the data set can be ensured to be accessed by the training task;
the optimized scheduling component preferentially schedules the training tasks to the nodes cached with the corresponding data sets according to the cache information of each node, and schedules the training tasks to the server containing the required data set cache as much as possible, so that the acceleration effect of the data set cache is exerted to the maximum extent.
2. The intelligent cache acceleration system of the deep learning cloud platform of claim 1, wherein the data set usage frequency statistics component is configured to first use an administrator account to obtain a user authentication token, and request the platform to obtain configuration information of all training tasks; then, training tasks submitted in a specified time period are filtered out from all the configuration information; the using times of each data set are obtained through accumulation, and then sequencing is carried out, so that the using frequency of each data set in a specified time period can be counted.
3. The intelligent cache acceleration system of the deep learning cloud platform according to claim 1, wherein the multi-level cache automatic deployment component is configured to, first, obtain the data set usage frequency counted by the data set usage frequency counting component, calculate the gain value of each data set according to the usage frequency, convert cache deployment into a 0-1 knapsack problem, calculate the data set that each computing node should cache, and the corresponding cache path using a dynamic programming method; the corresponding data set is subsequently copied to the cache path of the corresponding node.
4. The intelligent cache acceleration system of a deep learning cloud platform of claim 1, wherein the mount data set path component is configured to mount a storage path on a compute node into a Docker container for use by a compute task.
5. The intelligent cache acceleration system of a deep-learning cloud platform of claim 4, wherein the mount dataset path component comprises a determine dataset path component to determine whether a cache exists on a server prior to mounting a dataset cache into a Docker container.
6. The intelligent cache acceleration system of a deep learning cloud platform of claim 5, wherein the determine dataset path component is configured to invoke the component in a Docker container start script of a training task, determine a dataset cache by reading the dataset cache information, and directly mount a corresponding cache path to a corresponding local SSD storage path or local tmpfs storage path in the container if a dataset specified by the task exists in a specified file; if not, meaning that the data set is not cached at the local node, the remote NFS storage path in the default mount task configuration, the data set is read from the NFS.
7. The intelligent cache acceleration system for a deep learning cloud platform of claim 1, wherein the optimization scheduling component is configured to, first, contain a list one capable of representing caching statuses of data sets of each node, which includes the following fields: node IP address, cache data set name; secondly, modifying the scheduling logic of a scheduler used by the platform, and adding a query node cache function in a code of the scheduler, wherein the function acquires the data set information of the node cache by querying the table I; the function adds preferences to the data set cache when screening nodes, and then returns a node list containing corresponding data set cache; when all nodes do not contain the specified data set cache, the function screening function is invalid.
8. The intelligent cache acceleration system of the deep learning cloud platform of claim 7, wherein the screening step generated when the function screens the node is a scheduling suggestion, and the actual mount still needs to select whether to mount according to the node dataset cache condition, that is, when the local cache information conflicts with the information of the first database table, the local cache information is used as a criterion.
9. The acceleration method of the intelligent cache acceleration system of the deep learning cloud platform according to any one of claims 1 to 8, characterized by comprising the following steps:
s01, acquiring configuration information and counting the use frequency of a data set;
s02, calculating data set gain according to the data set use frequency, calculating node cache deployment conditions by a dynamic programming method, completing cache deployment and obtaining node cache data set information;
s03, the scheduler inquires the node cache data set information and conducts screening;
s04, judging whether the node contains a task use data set or not;
s041, if yes, scheduling the task to a node containing the data set cache;
s042, if not, ignoring the screening and carrying out normal scheduling;
s05, reading a cache configuration file, wherein the cache configuration file comprises the node cache data set information;
s06, judging a data set path according to the cache configuration file;
s061, if the cache does not exist, mounting a remote NFS storage path;
s062, if the data are cached in the SSD, mounting a local SSD storage path;
s063, if the local tmpfs storage path is cached in the memory, mounting a local tmpfs storage path;
and S07, reading the path mounted in the step S06 by the training task in the container, and starting training.
CN202210957648.2A 2022-08-10 2022-08-10 Intelligent cache acceleration system and method of deep learning cloud platform Active CN115022405B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210957648.2A CN115022405B (en) 2022-08-10 2022-08-10 Intelligent cache acceleration system and method of deep learning cloud platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210957648.2A CN115022405B (en) 2022-08-10 2022-08-10 Intelligent cache acceleration system and method of deep learning cloud platform

Publications (2)

Publication Number Publication Date
CN115022405A true CN115022405A (en) 2022-09-06
CN115022405B CN115022405B (en) 2022-10-25

Family

ID=83065806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210957648.2A Active CN115022405B (en) 2022-08-10 2022-08-10 Intelligent cache acceleration system and method of deep learning cloud platform

Country Status (1)

Country Link
CN (1) CN115022405B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8706971B1 (en) * 2012-03-14 2014-04-22 Netapp, Inc. Caching and deduplication of data blocks in cache memory
CN110825705A (en) * 2019-11-22 2020-02-21 广东浪潮大数据研究有限公司 Data set caching method and related device
CN111124277A (en) * 2019-11-21 2020-05-08 苏州浪潮智能科技有限公司 Deep learning data set caching method, system, terminal and storage medium
CN111860835A (en) * 2020-07-17 2020-10-30 苏州浪潮智能科技有限公司 Neural network model training method and device
US20210064346A1 (en) * 2019-08-30 2021-03-04 Bull Sas Support system for designing an artificial intelligence application, executable on distributed computing platforms
CN113792885A (en) * 2021-08-20 2021-12-14 山东英信计算机技术有限公司 Execution method and related device for deep learning training
CN114679283A (en) * 2020-12-09 2022-06-28 中兴通讯股份有限公司 Block chain data request processing method and device, server and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8706971B1 (en) * 2012-03-14 2014-04-22 Netapp, Inc. Caching and deduplication of data blocks in cache memory
US20210064346A1 (en) * 2019-08-30 2021-03-04 Bull Sas Support system for designing an artificial intelligence application, executable on distributed computing platforms
CN111124277A (en) * 2019-11-21 2020-05-08 苏州浪潮智能科技有限公司 Deep learning data set caching method, system, terminal and storage medium
CN110825705A (en) * 2019-11-22 2020-02-21 广东浪潮大数据研究有限公司 Data set caching method and related device
CN111860835A (en) * 2020-07-17 2020-10-30 苏州浪潮智能科技有限公司 Neural network model training method and device
CN114679283A (en) * 2020-12-09 2022-06-28 中兴通讯股份有限公司 Block chain data request processing method and device, server and storage medium
CN113792885A (en) * 2021-08-20 2021-12-14 山东英信计算机技术有限公司 Execution method and related device for deep learning training

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
范浩等: "一种基于强化学习的混合缓存能耗优化与评价", 《计算机研究与发展》 *

Also Published As

Publication number Publication date
CN115022405B (en) 2022-10-25

Similar Documents

Publication Publication Date Title
CN110046133B (en) Metadata management method, device and system for storage file system
CN101636742B (en) Efficient processing of time-bounded messages
CN108052374B (en) Method and device for deploying container microservice
CN111901294A (en) Method for constructing online machine learning project and machine learning system
CN1282066C (en) Method and system for accessing tape devices in computer system
US20100325199A1 (en) Client, brokerage server and method for providing cloud storage
US20080189713A1 (en) System and Method for Performing Systems Management on IT-Resources Using Web Services
CN1811704A (en) System and method for a context-awareness platform
US10191663B1 (en) Using data store accelerator intermediary nodes and write control settings to identify write propagation nodes
CN102045399B (en) Cloud computing mode file system and file reading method
US10970113B1 (en) Systems and methods for orchestrating seamless, distributed, and stateful high performance computing
CN109831540A (en) Distributed storage method, device, electronic equipment and storage medium
CN103607424A (en) Server connection method and server system
CN110351532A (en) Video big data cloud platform cloud storage service method
CN100394404C (en) System and method for management of metadata
US7895247B2 (en) Tracking space usage in a database
CN108475201A (en) A kind of data capture method in virtual machine start-up course and cloud computing system
CN116737363A (en) Data set cache acceleration method, system, equipment and medium of deep learning platform
US20220210218A1 (en) Information processing system and application services distribution method in information processing system
KR100858157B1 (en) System and merhod for map uapdate, storage medium recording that method program, user termianl
CN115022405B (en) Intelligent cache acceleration system and method of deep learning cloud platform
CN1983199A (en) System and method for analyzing out-of-work of computer intellectually
CN108920095A (en) A kind of data store optimization method and apparatus based on CRUSH
CN115629860A (en) Software parameter tuning method, container management platform, storage medium and system
CN114816272A (en) Magnetic disk management system under Kubernetes environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant