CN117931302A - Parameter file saving and loading method, device, equipment and storage medium - Google Patents

Parameter file saving and loading method, device, equipment and storage medium Download PDF

Info

Publication number
CN117931302A
CN117931302A CN202410317212.6A CN202410317212A CN117931302A CN 117931302 A CN117931302 A CN 117931302A CN 202410317212 A CN202410317212 A CN 202410317212A CN 117931302 A CN117931302 A CN 117931302A
Authority
CN
China
Prior art keywords
container
local cache
parameter file
directory
cache directory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410317212.6A
Other languages
Chinese (zh)
Other versions
CN117931302B (en
Inventor
王德奎
王超
陈培
王文潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202410317212.6A priority Critical patent/CN117931302B/en
Publication of CN117931302A publication Critical patent/CN117931302A/en
Application granted granted Critical
Publication of CN117931302B publication Critical patent/CN117931302B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application relates to the technical field of data storage, in particular to a method, a device, equipment and a storage medium for storing and loading a parameter file, aiming at improving the storage and loading speeds of the parameter file. The method comprises the following steps: under the condition that a model training task is executed in a first container cluster, determining a shared storage directory corresponding to the model training task; generating a local cache catalog corresponding to each container in the first container cluster; establishing a mapping relation between the local cache directory and the shared storage directory, wherein the mapping relation is used for copying the parameter files stored in the local cache directory into the shared storage directory; and saving the parameter file generated by the container in the training process into the local cache directory, wherein the parameter file is used for saving parameters generated in the model training process.

Description

Parameter file saving and loading method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of data storage, in particular to a method, a device, equipment and a storage medium for storing and loading a parameter file.
Background
At present, when model training is performed, a checkpoint (parameter file) is usually stored regularly, so that when model training is abnormally interrupted, model parameters can be reloaded and training can be recovered based on the parameter file, and therefore the loss of results of early training is avoided. In the related art, the parameter files are usually saved to the shared storage, and the parameter files stored in the shared storage are directly loaded when the training task is started.
In the related art, due to poor storage performance of the shared storage, the writing speed and the loading speed of the parameter file are generally slow, which wastes the computing power of the computer and further reduces the model training speed.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for saving and loading a parameter file, aiming at improving the speed of saving and loading the parameter file.
An embodiment of the present application provides a method for saving and loading a parameter file, where the method includes:
Under the condition that a model training task is executed in a first container cluster, determining a shared storage directory corresponding to the model training task;
Generating a local cache catalog corresponding to each container in the first container cluster;
Establishing a mapping relation between the local cache directory and the shared storage directory, wherein the mapping relation is used for copying the parameter files stored in the local cache directory into the shared storage directory;
And saving the parameter file generated by the container in the training process into the local cache directory, wherein the parameter file is used for saving parameters generated in the model training process.
Optionally, the method further comprises:
Updating the file state of the parameter file to a storage completion state under the condition that the parameter file is stored in the local cache directory;
And storing the file state of the parameter file into a memory database.
Optionally, the method further comprises:
under the condition that the container executes the model training task, inquiring whether the parameter file with the completed caching exists in the local cache directory;
copying the parameter file into the shared storage directory under the condition that the cached parameter file exists;
recording the copy success information in the memory database.
Optionally, the method further comprises:
Determining the number of the parameter files which are completely copied in the local cache directory;
And cleaning the parameter files in the local cache directory under the condition that the number of the parameter files which are completed to be copied exceeds the preset reserved number of the caches.
Optionally, the method further comprises:
Recording that the parameter file is successfully cleaned in a memory database under the condition that the local cache directory is cleaned;
And under the condition that the cleaning of the local cache directory fails, recording the cleaning failure of the parameter file in the memory database.
Optionally, the cleaning the parameter file in the local cache directory includes:
determining the preservation time of each parameter file in the local cache directory;
sequencing all the parameter files in sequence according to the preservation time, and sequencing the parameter file with the shortest preservation time in a first order;
Sequentially cleaning a plurality of last parameter files in the sorted plurality of parameter files;
and stopping cleaning under the condition that the number of the parameter files in the local cache directory does not exceed the reserved number of the caches.
Optionally, the generating a local cache directory corresponding to the container includes:
applying for a preset memory capacity in a cache space of a node corresponding to the container;
And creating the local cache directory in the cache space, wherein the size of the local cache directory is the preset memory capacity.
Optionally, the method further comprises:
Performing exception monitoring on the local cache directory;
and terminating the model training task when abnormal information in the local cache directory is monitored.
Optionally, when abnormal information in the local cache directory is detected, terminating the model training task, including:
terminating the model training task when abnormal information of replication failure occurs in the local cache directory;
Terminating the model training task when abnormal cleaning failure information in the local cache directory is monitored;
And terminating the model training task when the condition that insufficient storage space information appears in the local cache directory is monitored.
Optionally, before generating the shared storage catalog corresponding to the model training task, the method further includes:
Numbering each container in the first container cluster according to a preset numbering rule to obtain a container number corresponding to each container;
and allocating a corresponding node number for each container.
Optionally, the method further comprises:
Each container in the first container cluster is respectively distributed to one host in the host cluster;
And recording the corresponding container number, the node number and the corresponding relation between the hosts corresponding to each container.
Optionally, the allocating each container in the first container cluster to one host in the host cluster includes:
Determining host clusters available to the first container cluster;
Each of the containers in the first container cluster is dispatched by a dispatcher into one of the hosts in the host cluster.
Optionally, the method further comprises:
In the case of a training interruption in the process of executing the model training task, creating a second container cluster, wherein the number of containers in the second container cluster is the same as the number of containers in the first container cluster;
numbering each container in the second container cluster according to a preset numbering rule to obtain a container number corresponding to each container;
Assigning a corresponding node number to each of the containers in the second cluster of containers;
and distributing each container in the second container cluster to the corresponding host according to the node numbers and the corresponding relation between hosts.
Optionally, in the event of a training interruption during execution of the model training task, creating a second cluster of containers, the number of containers in the second cluster of containers being the same as the number of containers in the first cluster of containers;
numbering each container in the second container cluster according to a preset numbering rule to obtain a container number corresponding to each container;
Optionally, the method further comprises:
And under the condition that the host corresponding to the node number cannot operate the container, distributing the container to an idle host.
Optionally, the method further comprises:
Under the condition that the model training task is started in the second container cluster, acquiring first parameter file information in the local cache directory and second parameter file information in the shared storage directory for each container;
Determining the parameter file of the latest version according to the first parameter file information and the second parameter file information;
And sending file directory information corresponding to the parameter file of the latest version to the container.
Optionally, the method further comprises:
And when the parameter file in the local cache directory is empty, sending file directory information corresponding to the shared storage directory to the container.
Optionally, the method further comprises:
Determining a directory address corresponding to the parameter file according to the file directory information;
Acquiring the parameter file from the directory address;
loading the parameter file into the container;
loading model parameters recorded in the parameter file into a model to be trained;
and executing the model training task on the basis of the model to be trained.
A second aspect of an embodiment of the present application provides a device for saving and loading a parameter file, where the device includes:
the shared storage catalog creation module is used for generating a shared storage catalog corresponding to a model training task under the condition that the model training task is executed in the first container cluster;
The local cache directory creation module is used for generating a local cache directory corresponding to each container in the first container cluster;
The mapping relation establishing module is used for establishing a mapping relation between the local cache directory and the shared storage directory, and the mapping relation is used for copying the parameter files stored in the local cache directory into the shared storage directory;
And the parameter file storage module is used for storing the parameter file generated in the training process of the container into the local cache directory, and the parameter file is used for storing parameters generated in the model training process.
Optionally, the apparatus further comprises:
A saving completion state recording module, configured to update a file state of the parameter file to a saving completion state when the parameter file is saved in the local cache directory;
And the file state storage module is used for storing the file state of the parameter file into the memory database.
Optionally, the apparatus further comprises:
The parameter file query module is used for querying whether the parameter file with the completed caching exists in the local cache directory or not under the condition that the container executes the model training task;
A parameter file copying module, configured to copy, in the case where the cached parameter file exists, the parameter file to the shared storage directory;
and the copy success information recording module is used for recording copy success information in the memory database.
Optionally, the apparatus further comprises:
a parameter file quantity determining module, configured to determine the quantity of the parameter files that have completed copying in the local cache directory;
and the parameter file cleaning module is used for cleaning the parameter files in the local cache directory under the condition that the number of the parameter files which are completed to be copied exceeds the preset reserved number of caches.
Optionally, the apparatus further comprises:
The file cleaning success recording module is used for recording the success of cleaning the parameter file in a memory database under the condition that the cleaning of the local cache directory is finished;
and the file cleaning failure recording module is used for recording the parameter file cleaning failure in the memory database under the condition that the cleaning of the local cache directory fails.
Optionally, the parameter file cleaning module includes:
A save time determination submodule, configured to determine a save time of each of the parameter files in the local cache directory;
A parameter file sorting sub-module, configured to sort all the parameter files sequentially according to the storage time, and arrange the parameter file with the shortest storage time in a first order;
A parameter file cleaning sub-module, configured to sequentially clean a plurality of last parameter files among the sorted plurality of parameter files;
And the cleaning finishing submodule is used for stopping cleaning under the condition that the number of the parameter files in the local cache directory does not exceed the reserved number of the caches.
Optionally, the local cache directory generation module includes:
The memory capacity application submodule is used for applying preset memory capacity in the cache space of the node corresponding to the container;
And the local cache directory creation submodule is used for creating the local cache directory in the cache space, and the size of the local cache directory is the preset memory capacity.
Optionally, the apparatus further comprises:
The abnormality monitoring sub-module is used for carrying out abnormality monitoring on the local cache directory;
and the model training task termination sub-module is used for terminating the model training task when abnormal information in the local cache directory is monitored.
Optionally, the model training task termination submodule includes:
The first monitoring submodule is used for terminating the model training task when the abnormal information of the copy failure occurs in the local cache directory;
the second monitoring submodule is used for terminating the model training task when abnormal cleaning failure information occurs in the local cache directory;
and the third monitoring submodule is used for terminating the model training task when the fact that the storage space shortage information appears in the local cache directory is monitored.
Optionally, the apparatus further comprises:
the first container numbering module is used for numbering each container in the first container cluster according to a preset numbering rule to obtain a container number corresponding to each container;
And the first node number distribution module is used for distributing corresponding node numbers for each container.
Optionally, the apparatus further comprises:
The first container distribution module is used for distributing each container in the first container cluster to one host in the host cluster respectively;
and the relation recording module is used for recording the corresponding relation among the container numbers corresponding to each container, the node numbers and the host.
Optionally, the first container dispensing module includes:
a host cluster determination submodule for determining host clusters available to the first container cluster;
a first container scheduling sub-module for scheduling each of the containers in the first container cluster to one of the hosts in the host cluster by a scheduler.
Optionally, the apparatus further comprises:
A second container cluster creation module, configured to create a second container cluster in the case of a training interruption in the process of executing the model training task, where the number of containers in the second container cluster is the same as the number of containers in the first container cluster;
the second container numbering module is used for numbering each container in the second container cluster according to a preset numbering rule to obtain a container number corresponding to each container;
A second node number allocation module, configured to allocate a corresponding node number to each container in the second container cluster;
And the second container distribution module is used for distributing each container in the second container cluster to the corresponding host according to the node numbers and the corresponding relation between hosts.
Optionally, the second container dispensing module includes:
A host determining submodule, configured to determine, for each container in the second container cluster, the host corresponding to the node number in the host cluster according to the node number and a correspondence between hosts;
and the second container scheduling sub-module is used for scheduling the containers into the host through a scheduler.
Optionally, the apparatus further comprises:
A third container allocation module, configured to determine an idle host in a host cluster if the host corresponding to the node number cannot operate the container
Determining a host state of the idle host;
And when the host state of the idle host is a normal running state, scheduling the container into the idle host through a scheduler.
Optionally, the apparatus further comprises:
The file information acquisition module is used for acquiring, for each container, first parameter file information in the local cache directory and second parameter file information in the shared storage directory under the condition that the model training task is started in the second container cluster;
The parameter file version determining module is used for determining the parameter file of the latest version according to the first parameter file information and the second parameter file information;
And the first catalog information sending module is used for sending the file catalog information corresponding to the parameter file of the latest version to the container.
Optionally, the apparatus further comprises:
And the second catalog information sending module is used for sending the file catalog information corresponding to the shared storage catalog to the container when the parameter file in the local cache catalog is empty.
Optionally, the apparatus further comprises:
the catalog address acquisition module is used for determining the catalog address corresponding to the parameter file according to the file catalog information;
A parameter file obtaining module, configured to obtain the parameter file from the directory address;
the parameter file loading module is used for loading the parameter file into the container;
The parameter loading module is used for loading the model parameters recorded in the parameter file into the model to be trained;
and the model training task execution module is used for executing the model training task on the basis of the model to be trained.
A third aspect of the embodiments of the present application provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to the first aspect of the present application.
A fourth aspect of the embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect of the application when the processor executes the computer program.
By adopting the parameter file saving and loading method provided by the application, under the condition that a model training task is executed in a first container cluster, a shared storage catalog corresponding to the model training task is generated; generating a local cache catalog corresponding to each container in the first container cluster; establishing a mapping relation between the local cache directory and the shared storage directory, wherein the mapping relation is used for copying the parameter files stored in the local cache directory into the shared storage directory; and saving the parameter file generated by the container in the training process into the local cache directory, wherein the parameter file is used for saving parameters generated in the model training process. In the application, when the model training task is executed in the container cluster, a corresponding local cache directory is generated for each container in the cluster, the parameter file is stored in the local cache directory when the model training task is executed, and meanwhile, the parameter file in the local cache directory is moved to the shared memory directory according to the mapping relation between the shared memory directory and the local cache directory, so that the storage time of the parameter file is shortened, the parameter file in the local cache directory can be quickly loaded when the model training task is interrupted by faults, the loading time of the parameter file is obviously shortened, the model training time is further saved, and the training efficiency of the model is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for saving and loading a parameter file according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating a workflow of a parameter file management component according to an embodiment of the present application;
FIG. 3 is a schematic diagram of training task creation according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a fault tolerant scheduling strategy according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a training task restarting process according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a device for saving and loading a parameter file according to an embodiment of the present application;
Fig. 7 is a schematic diagram of an electronic device according to an embodiment of the application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, fig. 1 is a flowchart of a method for saving and loading a parameter file according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
s11: and under the condition that the model training task is executed in the first container cluster, determining a shared storage directory corresponding to the model training task.
In this embodiment, the first container cluster is a cluster formed by a plurality of containers, and each container encapsulates a program of a model training task and has a corresponding operating environment, so that one model training task can be independently operated. Model training tasks are tasks that train the neural network model for adjusting parameters of the untrained neural network model to an optimal. The shared memory directory is a memory directory that each container in the container cluster can access, and the memory space corresponding to the memory directory can store a parameter file generated by each container in the container cluster when performing a model training task.
In this embodiment, under the condition that the first container cluster executes the model training task, a shared storage directory corresponding to the model training task is determined. The shared storage catalogue corresponding to the model training task is preconfigured, and the shared storage catalogue corresponding to the model training task is determined according to configuration information under the condition that the first container cluster executes the model training task.
Illustratively, in configuring the shared memory directory, the shared memory directory is configured to be/mnt/inais/job/1. Container clusters are organized based on kubernetes (container cluster organization system).
S12: and generating a local cache directory corresponding to each container in the first container cluster.
In this embodiment, the local cache directory is a cache directory created in a storage space in a node (node) where the container is located, and is used to store a parameter file generated in a model training task in each container.
In this embodiment, when a training task is executed in a container cluster, for each container in a first container cluster, a local cache directory corresponding to the container is generated. Upon execution of the model training task, a local cache directory is generated within the local cache space after the shared storage directory is determined. The local cache directory generated is different in each training period of the model training task, that is, a different parameter file is generated in each training period.
For example, in the first training period, the local cache directory is/localpath/jobname/1, in the second training period, the local cache directory is/localpath/jobname/2, and may be an mount directory of the SSD disk, or may be a host memory/dev/shm directory, and the number of parameter files that may be stored in the local cache directory may be set at the same time during setting.
In this embodiment, the generating a local cache directory corresponding to the container includes:
S12-1: and applying for preset memory capacity in the cache space of the node corresponding to the container.
In this embodiment, the node corresponding to the container is the computing device where the container is located.
In this embodiment, when a local cache directory corresponding to a container is generated, a preset memory capacity is first applied for in a node cache space corresponding to the container.
S12-2: and creating the local cache directory in the cache space, wherein the size of the local cache directory is the preset memory capacity.
In this embodiment, after a preset memory capacity is applied, a local cache directory is created in the cache space, where the capacity of the created local cache directory is the applied preset memory capacity.
In this embodiment, the preset memory capacity may be set according to the size of the actual memory, which is not limited herein.
S13: and establishing a mapping relation between the local cache directory and the shared storage directory, wherein the mapping relation is used for copying the parameter files stored in the local cache directory into the shared storage directory.
In this embodiment, after the local cache directory is established in each training period, a mapping relationship between the local cache directory and the shared storage directory is established, and according to the mapping relationship, the parameter file in the local cache directory may be copied and stored in the shared storage directory.
Illustratively, in the first training period, a mapping relationship is established between the shared storage directory/mnt/inais/job/1 and the local cache directory/localpath/jobname/1.
S14: and saving the parameter file generated by the container in the training process into the local cache directory, wherein the parameter file is used for saving parameters generated in the model training process.
In this embodiment, the parameter file is a file for storing model weight parameters, and in the execution process of the model training task, data including an activation value, a gradient, a loss value and the like are also stored in the parameter file in addition to the model weight parameters.
In this embodiment, during the process of executing the model training task, the container stores the generated parameter file in the local cache directory, generates a new local cache directory in each training period, generates a new parameter file in each training period, and stores each newly generated parameter file in the local cache directory generated in the training period.
In this embodiment, when multi-machine training is performed in a container cluster based on a container arrangement system, a local cache directory is created in a node corresponding to each container, a parameter file generated in the training process is saved in the local cache directory, a mapping relationship between the local cache directory and a shared storage directory is established, and the parameter file in the local cache directory can be moved to the shared storage directory according to the mapping relationship, so that the saving speed and the loading speed of the parameter file are increased, and further the efficiency of a model training task is improved.
In another embodiment of the present application, the method further comprises:
and storing the mapping relation into a memory database.
In this embodiment, the memory database is a database in a local memory space on a node corresponding to the container.
In this embodiment, after the shared storage directory and the local cache directory are determined, and the mapping relationship between the shared storage directory and the local cache directory is established, the mapping relationship is stored in the memory database, and when the parameter file in the local cache directory is migrated, the shared storage directory corresponding to the local cache directory in the memory database is checked, so as to determine the address of the target storage directory to which the parameter file is migrated.
In another embodiment of the present application, the method further comprises:
s21: and under the condition that the parameter file is stored in the local cache directory, updating the file state of the parameter file into a storage completion state.
In this embodiment, the saving completion state is the state of the parameter file after the parameter file is successfully saved in the local cache.
In this embodiment, when the parameter file is saved in the local cache, the file state of the parameter file is updated to the saving completion state.
S22: and storing the file state of the parameter file into a memory database.
In this embodiment, after the file state of the parameter file is updated to the save completion state, the file state of the parameter file is stored in the memory database.
In another embodiment of the present application, the method further comprises:
s31: and under the condition that the container executes the model training task, inquiring whether the parameter file with the completed caching exists in the local cache directory.
In this embodiment, under the condition that the container performs the model training task, whether a cached parameter file exists in the local cache directory is queried. The inquiry is automatically executed through a background parameter file management component, the inquiry period can be preset, and when the inquiry is performed, file state information in a memory database is read, and whether the file state is a parameter file with a saved state is determined.
For example, the query period may be set to query once at the end of each training period.
S32: copying the parameter file to the shared storage directory in the presence of the cached parameter file.
In this embodiment, when the cached parameter file exists in the local cache directory, the parameter file is copied to the shared storage directory.
In this embodiment, when a cached parameter file exists in a local cache directory, the parameter file management component queries a mapping relationship of the local cache directory in the memory database, and further obtains a corresponding shared storage directory, and copies the parameter file to the shared storage directory corresponding to the local cache directory.
S33: recording the copy success information in the memory database.
In this embodiment, after copying the parameter file in the local cache directory to the shared storage directory, the copy success information is recorded in the storage database.
In this embodiment, when each training period is finished, the parameter files in the local cache directory are copied to the shared storage directory, and when the local storage fails and cannot be read, the latest parameter files can be read from the shared storage, so that normal running of the model training task is ensured.
In another embodiment of the present application, the method further comprises:
And under the condition that the copying of the parameter file fails, recording copying failure information in the memory database.
In this embodiment, in the case of failure in copying of the parameter file, copy failure information is recorded in the internal database.
In another embodiment of the present application, the method further comprises:
s41: determining the number of the parameter files in the local cache directory, which have completed copying.
In this embodiment, each training period generates a parameter file to be stored in the local cache directory, and each time a parameter file is stored in the local cache directory, the parameter file management component copies the parameter file to the shared storage directory, and records the number of successfully copied parameter files in the memory database, and the parameter file management component determines the number of completely copied parameter files in the local cache directory according to the records in the memory database.
For example, the number of files copied in the local cache directory may be set to be queried once per training period.
S42: and cleaning the parameter files in the local cache directory under the condition that the number of the parameter files which are completed to be copied exceeds the preset reserved number of the caches.
In this embodiment, the preset reserved number of caches is the maximum value of the number of parameter files that can be stored in the preset local cache directory.
In this embodiment, when the number of parameter files that have been copied exceeds the preset number of reserved caches, the number of parameter files that have been copied is large, which occupies a large memory space, and the memory management component cleans the parameter files in the local cache directory, and starts cleaning from the file with the earliest storage time until the number of parameter files that have been copied does not exceed the preset number of reserved caches.
In this embodiment, the cleaning the parameter file in the local cache directory includes:
s42-1: and determining the preservation time of each parameter file in the local cache directory.
In this embodiment, the parameter file management component determines the save time of each parameter file in the local cache directory in the save success record in the in-memory database.
S42-2: and sequencing all the parameter files in sequence according to the preservation time, and sequencing the parameter file with the shortest preservation time in a first order.
In this embodiment, in the local cache directory, all the parameter files are ordered sequentially according to the save time, where the parameter file with the shortest save time is saved in the first order, and the parameter file with the longest save time is ordered at the last.
S42-3: and sequentially clearing a plurality of last parameter files in the sorted plurality of parameter files.
In this embodiment, the last several parameter files in the sorted multiple parameter files are cleaned up and deleted from the local cache directory.
S42-4: and stopping cleaning under the condition that the number of the parameter files in the local cache directory does not exceed the reserved number of the caches.
In this embodiment, when the number of parameter files in the local cache directory does not exceed the preset number of caches, cleaning is stopped.
For example, the preset reserved number of caches is 10, when the number of parameter files which are completed to be copied reaches 11, the 11 parameter files are ordered, and after the ordering is finished, the parameter file 1 with the longest preservation time is cleared from the end.
In this embodiment, the method further includes:
And under the condition that the cleaning of the local cache directory is finished, recording that the cleaning of the parameter file is successful in a memory database.
In this embodiment, when the local cache directory is cleaned, the successful cleaning of the parameter file is recorded in the memory database.
And under the condition that the cleaning of the local cache directory fails, recording the cleaning failure of the parameter file in the memory database.
In this embodiment, in the case of a local cache directory cleaning failure, a parameter file cleaning failure is recorded in the memory database.
In this embodiment, when the parameter file cleaning fails, the parameter file management component may attempt to clean the parameter file again, and if the parameter file cannot be cleaned after the retry, may apply for more storage space in the local storage, and send the parameter file cleaning failure information to the foreground for warning.
In another embodiment of the present application, the method further comprises:
S51: and carrying out anomaly monitoring on the local cache directory.
In this embodiment, the parameter file management component initiates anomaly monitoring of the local cache directory after the cache directory is generated.
S52: and terminating the model training task when abnormal information in the local cache directory is monitored.
In this embodiment, the exception information includes, but is not limited to, exception information such as failure in copying a parameter file, failure in cleaning a parameter file, and insufficient memory space.
In this embodiment, when abnormal information in the local cache directory is detected, the model training task is terminated, and after the model training task is terminated, the container needs to be re-created and training of the model needs to be started.
In this embodiment, when it is detected that abnormal information occurs in the local cache directory, terminating the model training task includes:
s52-1: and terminating the model training task when the abnormal information of the copy failure occurs in the local cache directory.
In this embodiment, when it is monitored that copy failure exception information occurs in the local cache directory, it is indicated that the parameter file in the local cache directory is not copied into the shared storage directory, and at this time, the model training task is terminated, and after the cause of the copy failure is ascertained, the training task is restarted.
S52-2: and terminating the model training task when abnormal clearing failure information occurs in the local cache directory.
In this embodiment, when abnormal information of cleaning failure occurs in the local cache directory is monitored, it is indicated that the parameter files in the local cache directory are not cleaned, at this time, the model training task is terminated, and the model training task is restarted after the parameter files are cleaned.
S52-3: and terminating the model training task when the condition that insufficient storage space information appears in the local cache directory is monitored.
In this embodiment, when the information of insufficient storage space in the local cache directory is detected, it is indicated that the storage space in the local cache directory is insufficient, and a new parameter file cannot be stored again, and then the model training task is terminated, and after the model training task is applied to more cache spaces, or after the cache spaces are cleaned, the model training task is restarted.
For example, referring to table 1, table 1 is a parameter file directory and a status table according to an embodiment of the present application.
TABLE 1
In this embodiment, when writing the code of the training task before the training task is executed, a parameter management component may be preloaded, for saving and optimizing the parameter file, where the parameter management component has the following functions:
Configuration of the local cache directory is supported, and meanwhile, configuration of the number of parameter files which can be saved by the local cache directory is supported. And generating a local cache directory for holding the parameter file in the local cache space according to the pre-configured shared memory directory, and generating a new local cache directory every cycle. And then establishing a mapping relation between the local cache directory and the shared storage directory, recording and managing the file states of the parameter files, and carrying out anomaly monitoring on the parameter files in the local cache directory after the local cache directory is generated, and recording anomaly information and terminating a training task when conditions such as failure in copying and cleaning anomalies of the parameter files occur.
When the parameter file is successfully stored in the local cache directory, the parameter file management component performs asynchronous copying on the parameter file, namely, the process of copying the file and the training process are performed asynchronously, and when the file is copied, a check code is generated for the parameter file in the local cache directory to verify the integrity of the file, wherein the check code can be an md5 check code and the like. And copying and storing the local parameter file into the corresponding shared storage directory according to the established mapping relation between the shared storage directory and the local cache directory, and recording the copy information according to actual conditions.
And counting the number of the parameter files which are completed to be copied, cleaning the parameter files when the number of the parameter files which are completed to be copied in the local cache directory exceeds the preset reserved number of caches, preferentially cleaning the parameter files with earlier storage time, and recording the file state of the parameter files as cleaning success after cleaning is completed. And when the cleaning fails, recording the file state of the parameter file as the cleaning failure. At the end of the training task, the parameter file management component blocks the main process from terminating, ensures that all parameter files have been copied, and only saves a specified number of parameter files in the local cache directory.
Referring to fig. 2, fig. 2 is a workflow diagram of a parameter file management component according to an embodiment of the present application, as shown in fig. 2, when a training task starts, a shared storage directory is defined, and a local cache directory is regenerated, where the parameter management component first determines whether abnormal information, such as insufficient memory space, exists in the generated local cache directory. When the local cache directory is determined to have no abnormal information, a mapping relation between the shared storage directory and the local cache directory is established, the mapping relation is stored in the memory database, and when the local cache directory has abnormal information, the abnormal information is thrown out and the training task is terminated. Meanwhile, the parameter files generated in the training process are saved in a local cache directory, the completion of saving the parameter files is determined, and the parameter management component updates the file state of the parameter files to be the saving completion state. And when each round of training is finished, the parameter file management component inquires the local cache file, inquires the parameter file which is stored completely, copies the parameter file which is stored completely into the shared storage directory, records copy success information in the memory database when the copy is successful, records copy failure information in the memory database when the copy is failed, and inquires the copy success information at the same time. And determining whether the parameter files in the local cache exceed the preset cache reserved quantity according to the successful copying information, if so, cleaning the parameter files in the local cache directory, recording cleaning success information by the memory database when cleaning is successful, and recording cleaning failure information by the memory database when cleaning fails.
The source code of the parameter file management component is set to add the configuration for automatically obtaining the local temporary cache directory according to the shared storage directory before invoking the training framework to save the parameter file, after the training framework saves the parameter file, the configuration for notifying the parameter file management component that the parameter file is saved is added, and meanwhile, when the training task is finished, the configuration for cleaning the residual data of the parameter file is added, the data cleaning mainly comprises copying the uncopyed parameter file into the shared storage directory, deleting the redundant parameter file, and the source code is as follows:
from ckpt import CkptMgt
# initialization load
ckptMgt = CkptMgt()
...
while True:
....
# Configuration checkpoint shared memory directory
sharefile = '/mnt/inais/job/"+ step
# Generating a checkpoint-saved local cache directory
cachefile=ckptMgt.getCacheFile(fname)
....
# Training framework save checkpoint
try:
torch.save(model.state_dict(), cachefile+"/pytorch_ckpt.bin")
except Exception as e:
print("pytorch failed to write ckpt.")
raise(e)
The # current checkpoint file is stored in the local cache
ckptMgt.complete(fname)
....
# Blocking main process, handling checkpoint replication
ckptMgt.finalize()
In another embodiment of the present application, before generating the shared storage directory corresponding to the model training task, the method further includes:
s61: numbering each container in the first container cluster according to a preset numbering rule to obtain a container number corresponding to each container.
In this embodiment, before configuring a training task, each container in the first container cluster is numbered according to a preset numbering rule.
Illustratively, at the time of training task creation, the creation task name isThe task comprises N workers (working units, one node in the distributed training task), the task can be considered as a k8s point_rank running the training task, in the multi-machine training task, each worker is generally allocated one node_rank sequence number for calculating the rank sequence number of each training process in the worker, the rank sequence number is the global number of the distributed training task, and is generally used for establishing communication, storing a checkpoint and the like for the multi-machine task, each worker corresponds to a container, and G GPUs (graphics processing unit, graphic microprocessors) are applied by the worker, and the container number is/>
S62: and allocating a corresponding node number for each container.
In this embodiment, the node number is the number of the node in the host cluster corresponding to the container.
Illustratively, a corresponding node number (node_rank) is assigned to each worker corresponding to each container, as follows:
in this embodiment, the method further includes:
S63: each container in the first container cluster is respectively distributed to one host in the host cluster.
In this embodiment, after numbering each container, each container in the first container cluster is allocated to a corresponding host, and the container number and the node number are corresponding.
In this embodiment, the specific step of allocating each container in the first container cluster to one host in the host cluster includes:
S63-1: a host cluster available to the first container cluster is determined.
In this embodiment, the host cluster is a cluster formed by a plurality of hosts, where each host performs data transmission through a network.
In this embodiment, a host cluster for which a first container cluster is available is first determined, and which hosts in the host cluster are available is determined.
S63-2: each of the containers in the first container cluster is dispatched by a dispatcher into one of the hosts in the host cluster.
In this embodiment, the scheduler is a tool in the container orchestration system for scheduling each container in the cluster of containers.
In this embodiment, each container in the first container cluster is scheduled to one host in the host cluster by the scheduler, and then a plurality of containers in the first container cluster are scheduled to the corresponding hosts, so that the first container cluster operates in the host cluster, where each container operates in one host.
S64: and recording the corresponding container number, the node number and the corresponding relation between the hosts corresponding to each container.
In this embodiment, after a corresponding container number is allocated to each container and a node number is allocated to a node where the container is located, a corresponding relationship between the container number corresponding to each container, the node number, and the host is recorded, where the corresponding relationship is used for fault-tolerant scheduling when the container cluster fails.
Illustratively, the correspondence is as follows:
workernormal-0node_rank=0node1
workernormal-1 node_rank=1node2
……
workernormal-N-1 node_rank=N-1node9
Referring to fig. 3, fig. 3 is a schematic diagram of training task creation according to an embodiment of the present application, as shown in fig. 3, in a k8s cluster, the training task creation includes a node 1 and a node 2, a container 1 is run on the node 1, a container 2 is run on the node 2, a container number corresponding to the container 1 is worker0, a corresponding node number is node_rank=0, the number of gpus is 4, where 4 parameter files, a parameter file 1 (rank 0), a parameter file 2 (rank 1), a parameter file 3 (rank 2), and a parameter file 4 (rank 3) are stored. Similarly, the container number corresponding to the container 2 is worker1, the corresponding node number is node_rank=1, the number of gpus is 4, and 4 parameter files, parameter file 5 (rank 4), parameter file 6 (rank 5), parameter file 7 (rank 6), and parameter file 8 (rank 7) are stored therein. The parameter files 1 to 7 are stored in the same storage space in the shared storage.
In this embodiment, the parameter file management is performed based on the parameter file management component, at the end of each round of training task of the multi-machine training task, at least one parameter file is saved in the local cache of each worker, the file is usually the latest parameter file saved in the parameter file, the multi-machine distributed training is performed based on kubernetes (container arrangement system) and the container in the AI platform, the multi-machine training task is performed, each node corresponds to a container as a different worker of the distributed task, each worker has a different node number, multiple GPUs in each worker can be allocated with different ranks, according to the difference of the model script, it is possible to save the parameter file only in the GPU with rank number of 0, it is also possible to save the parameter file in the GPU with partial rank, it is also possible to save the parameter file in all GPUs, based on the parameter file management component, save the parameter file based on rank, it is possible to save the parameter files with multiple versions of a certain rank in the shared storage directory, and save the latest parameter file in the local cache.
In this embodiment, the method further includes:
S81: in the event of a training interruption during execution of the model training task, a second cluster of containers is created, the number of containers in the second cluster of containers being the same as the number of containers in the first cluster of containers.
In this embodiment, the second container cluster is a newly created container cluster when a training task interruption occurs in the process of executing the training task.
In this embodiment, when a training interrupt occurs during execution of a model training task, for example, a hardware exception occurs, a task is reconstructed based on a task operation (job-operator) component, and a second container cluster with the same number as the first container cluster is created. Task creation is performed using the same resource specification.
Illustratively, a new training task Jobfault is created.
S82: numbering each container in the second container cluster according to a preset numbering rule to obtain a container number corresponding to each container.
In this embodiment, each container in the second container cluster is numbered according to a preset numbering rule, so as to obtain a container number corresponding to each container, where the numbering rule used in the second container cluster is the same as the numbering rule used in the first container cluster. The number of each container obtained is also the same.
S83: and allocating a corresponding node number to each container in the second container cluster.
In this embodiment, after the second container cluster is created, a corresponding node number is assigned to each container in the second container cluster.
In this embodiment, when the corresponding node number is allocated to each container in the second container cluster, the node numbers are still allocated according to the number sequence when the node numbers are allocated to the first container cluster. The following is shown:
workerfault-0node_rank=0
workerfault-1 node_rank=1
……
workerfault-N-1 node_rank=N-1
S84: and distributing each container in the second container cluster to the corresponding host according to the node numbers and the corresponding relation between hosts.
In this embodiment, after a corresponding node number is allocated to each container in the second container cluster, the node number of the original task is taken as a main factor to be considered, and each container in the second container cluster is allocated to a corresponding host by referring to a corresponding relationship between each node number in the first container cluster and the host. Workerfault created after fault tolerance is preferentially scheduled to workernormal managed hosts with the same node_rank (node number).
In this embodiment, the allocating each container in the second container cluster to the corresponding host according to the correspondence between the node number and the host includes:
S84-1: and for each container in the second container cluster, determining the host corresponding to the node number in the host cluster according to the corresponding relation between the node number and the host.
In this embodiment, the corresponding relationship between the node numbers and the hosts is recorded in advance, and a corresponding node number is allocated to each container, so that the host corresponding to the node number corresponding to each container in the host cluster is determined according to the corresponding relationship between the node numbers and the hosts.
In this embodiment, the node number corresponding to each container in the second container cluster is the same as the node number corresponding to the corresponding container in the first container cluster, and then each container in the second container cluster is the same as the host corresponding to the corresponding container in the first container cluster.
S84-2: the container is dispatched into the host by a dispatcher.
In this embodiment, after determining the host corresponding to the node number corresponding to each container in the second container cluster, the container is scheduled to the corresponding host by the scheduler.
In this embodiment, the method further includes:
S85: under the condition that the host corresponding to the node number cannot operate the container, determining an idle host in a host cluster;
in this embodiment, the idle host is a host that does not run a task or a host with a lower load.
In this embodiment, under the condition that the host corresponding to the node number cannot operate the container, it is indicated that the host corresponding to the node number has a fault or a higher load cannot operate more tasks, and at this time, based on the existing scheduling policy, the rest idle hosts are found in the host cluster.
S86: and determining the host state of the idle host.
In this embodiment, after an idle host is found, the host state of the idle host is determined, where the host state is divided into a normal running state and an abnormal state, when the host is in the normal running state, the host can normally perform data transmission, run tasks, and when the host is in the abnormal state, a fault may occur, and the tasks cannot be normally run.
S87: and when the host state of the idle host is a normal running state, scheduling the container into the idle host through a scheduler.
In this embodiment, when the host state of the idle host is the normal running state, the container is scheduled to the idle host by the scheduler.
In this embodiment, if a host corresponding to the node number has a problem and cannot operate the container, a suitable host may be selected from the cluster idle resources for allocation based on the existing scheduling algorithm.
Referring to fig. 4, fig. 4 is a schematic diagram of a fault-tolerant scheduling policy according to an embodiment of the present application, as shown in fig. 4, after a fault-tolerant scheduling task starts, a working unit (worker) is selected to obtain a container number of the working unit, a host corresponding to the node number is found in cache information, whether the host operates normally is determined, in case that the host operates normally, whether the host can operate a container corresponding to the working unit is determined, in case that the host can operate the container, the host node is selected to operate the container, otherwise, nodes are selected in a cluster based on other scheduling policies. And (3) until all the working units in the second container cluster are scheduled.
In another embodiment of the present application, the method further comprises:
S91: and under the condition that the model training task is started in the second container cluster, acquiring first parameter file information in the local cache directory and second parameter file information in the shared storage directory for each container.
In this embodiment, the first parameter file information is information of a parameter file in a local cache directory of a node where the container is located, and the second parameter file information is information of a parameter file in the shared storage directory.
In this embodiment, under the condition that the second container cluster starts the model training task, for each container, the first parameter file information in the local cache directory and the second parameter file information in the shared storage directory are obtained.
S92: and determining the parameter file of the latest version according to the first parameter file information and the second parameter file information.
In this embodiment, the first parameter file information includes storage time information of the parameter files stored in the local cache directory, the second parameter file information includes storage time information of the parameter files stored in the shared storage directory, and the parameter file of the latest version is determined according to the storage time information, where the parameter file with the closest storage time to the current time is the parameter file of the latest version.
S93: and sending file directory information corresponding to the parameter file of the latest version to the container.
In this embodiment, after determining the parameter file of the latest version, file directory information corresponding to the parameter file of the latest version is sent to the container, and if the parameter file of the directory in the local cache and the parameter file in the shared storage directory are both the parameter file of the latest version, the file directory information corresponding to the parameter file of the latest version in the local cache directory is preferentially sent to the container.
In this embodiment, the method further includes:
S94: and when the parameter file in the local cache directory is empty, sending file directory information corresponding to the shared storage directory to the container.
In this embodiment, when the parameter file in the local cache directory is empty, the file directory information corresponding to the shared storage directory is sent to the container.
In this embodiment, because a failure restart is performed, a plurality of host nodes may not be used any more, and therefore, a container in the second container cluster cannot be allocated to an original host node during allocation, at this time, a node where the container is located may not store a parameter file in a local cache directory, and at this time, file directory information corresponding to the shared storage directory is sent to the container.
In this embodiment, the method further includes:
s95: and determining the directory address corresponding to the parameter file according to the file directory information.
In this embodiment, the directory address is the address of the directory where the parameter file is located.
In this embodiment, according to the file directory information, the directory address of the file directory corresponding to the parameter file is determined. When the latest version of the parameter file is stored in the local cache directory, the directory address is the directory address of the local cache directory, and when the latest version of the parameter file is not stored in the local cache directory, the directory address is the directory address of the shared storage directory.
S96: and acquiring the parameter file from the directory address.
In this embodiment, after determining the directory address corresponding to the parameter file, the corresponding parameter file is obtained from the directory address.
S97: and loading the parameter file into the container.
In this embodiment, after the corresponding parameter file is obtained, the parameter file is loaded to the container for operation.
S98: and loading the model parameters recorded in the parameter file into the model to be trained.
In this embodiment, after the container finishes loading the parameter file, the model parameters recorded in the parameter file are loaded into the model to be trained, where the parameters in the model to be trained are parameters of the model trained by the last round before the training task is terminated by the first container cluster.
S99: and executing the model training task on the basis of the model to be trained.
In this embodiment, after the model parameters in the parameter file are loaded into the model to be trained, the model training task is continuously executed on the basis of the model to be trained.
In this embodiment, the model script calls an API (Application Programming Interface, application program interface) of the parameter file management component according to the file directory information, obtains a storage directory corresponding to the latest version of the parameter file, obtains the parameter file, loads the parameter file into the container, and continues to execute the model training task.
Referring to fig. 5, fig. 5 is a schematic diagram of a training task restarting flow according to an embodiment of the present application, and as shown in fig. 5, the running model script restarts to execute the model training task. The model script obtains the parameter file storage catalog information of the latest version by calling an application program interface of the parameter file management component, queries the parameter file information in the local cache catalog and the parameter file information in the shared storage through the parameter file management component, returns the parameter file catalog information in the shared storage to the model script when the local cache catalog is empty, returns the parameter file catalog information in the shared storage to the model script if the parameter file in the local cache catalog is not the latest version, returns the parameter file catalog information in the shared storage to the model script if the parameter file in the local cache catalog is the latest version, returns the file catalog information of the local cache catalog to the model script, and loads the parameter file according to the obtained file catalog information to obtain the parameter file of the latest version, and continues to execute the training task.
In the embodiment of the application, a parameter file storage mechanism is defined, a local cache directory is created in a host computer when a model training task is executed, an API of a parameter file management component is called by a model script to obtain the address of the local cache directory for storing the parameter file, the parameter file in the training process is stored in the local cache directory, the storage efficiency of the parameter file is improved, the time of model training is further saved, and the efficiency of model training is accelerated.
And for the files stored in the local cache directory, the files are automatically copied into the shared storage, so that when the host node is abnormal, the parameter files can still be obtained, and the model training efficiency is further ensured.
And the number of the parameter files stored in the local cache directory is preset, the parameter files of the latest version are reserved, when the number of the parameter files stored in the local cache directory exceeds a preset value, the parameter files which are already copied in part of the local cache directory are deleted, the storage space of the local cache directory is saved, the load of the local storage space is reduced, and further the training efficiency of the model is improved.
After a training task is created and scheduled, node numbers distributed by each working unit (worker) of a multi-machine task are stored, information of running nodes is stored, after a container cluster is rebuilt by a task in error, the node numbers corresponding to each working unit are used as main consideration factors through a scheduler, containers in the rebuilt container cluster are scheduled to corresponding hosts, the rebuilt containers have high probability of being scheduled to the hosts executing the task before, and further parameter files can be directly obtained from a local cache directory, normal loading parameters and continuous training after the model training task is restarted are guaranteed to the greatest extent, and model training efficiency is improved.
In the embodiment of the application, the optimization of the parameter file saving and loading mechanism is carried out on two aspects of the model script and the AI platform, in the aspect of parameter file saving, the parameter files during model training are saved into the local cache directory based on the parameter file management component, the saving speed of the parameter files is improved, the parameter files are copied and saved into the shared storage directory, the files are ensured not to be lost, in order to ensure enough cache space, the local cache directory is automatically cleaned, only the parameter files with the execution quantity are reserved, and the saving mechanism can save 10% of training time. When the parameter file is loaded, the task operation assembly rebuilds the training task when the fault tolerance occurs to the training task, and the scheduling assembly schedules the container to the node running the container before the fault occurs preferentially according to the corresponding relation between the previously stored working unit and the node number, so that the probability of directly loading the parameter file in the local cache directory by the container is increased, the loading time of the parameter file is reduced to the greatest extent, and the model training efficiency is improved.
Based on the same inventive concept, an embodiment of the application provides a parameter file save and load device. Referring to fig. 6, fig. 6 is a schematic diagram of a parameter file save loading device 600 according to an embodiment of the application. As shown in fig. 6, the apparatus includes:
the shared storage directory creating module 601 is configured to generate a shared storage directory corresponding to a model training task when the model training task is executed in the first container cluster;
A local cache directory creation module 602, configured to generate, for each container in the first container cluster, a local cache directory corresponding to the container;
A mapping relationship establishing module 603, configured to establish a mapping relationship between the local cache directory and the shared storage directory, where the mapping relationship is used to copy a parameter file stored in the local cache directory to the shared storage directory;
And the parameter file saving module 604 is configured to save a parameter file generated by the container in the training process to the local cache directory, where the parameter file is used to save parameters generated in the model training process.
Optionally, the apparatus further comprises:
A saving completion state recording module, configured to update a file state of the parameter file to a saving completion state when the parameter file is saved in the local cache directory;
And the file state storage module is used for storing the file state of the parameter file into the memory database.
Optionally, the apparatus further comprises:
The parameter file query module is used for querying whether the parameter file with the completed caching exists in the local cache directory or not under the condition that the container executes the model training task;
A parameter file copying module, configured to copy, in the case where the cached parameter file exists, the parameter file to the shared storage directory;
and the copy success information recording module is used for recording copy success information in the memory database.
Optionally, the apparatus further comprises:
a parameter file quantity determining module, configured to determine the quantity of the parameter files that have completed copying in the local cache directory;
and the parameter file cleaning module is used for cleaning the parameter files in the local cache directory under the condition that the number of the parameter files which are completed to be copied exceeds the preset reserved number of caches.
Optionally, the apparatus further comprises:
The file cleaning success recording module is used for recording the success of cleaning the parameter file in a memory database under the condition that the cleaning of the local cache directory is finished;
and the file cleaning failure recording module is used for recording the parameter file cleaning failure in the memory database under the condition that the cleaning of the local cache directory fails.
Optionally, the parameter file cleaning module includes:
A save time determination submodule, configured to determine a save time of each of the parameter files in the local cache directory;
A parameter file sorting sub-module, configured to sort all the parameter files sequentially according to the storage time, and arrange the parameter file with the shortest storage time in a first order;
A parameter file cleaning sub-module, configured to sequentially clean a plurality of last parameter files among the sorted plurality of parameter files;
And the cleaning finishing submodule is used for stopping cleaning under the condition that the number of the parameter files in the local cache directory does not exceed the reserved number of the caches.
Optionally, the local cache directory generation module includes:
The memory capacity application submodule is used for applying preset memory capacity in the cache space of the node corresponding to the container;
And the local cache directory creation submodule is used for creating the local cache directory in the cache space, and the size of the local cache directory is the preset memory capacity.
Optionally, the apparatus further comprises:
The abnormality monitoring sub-module is used for carrying out abnormality monitoring on the local cache directory;
and the model training task termination sub-module is used for terminating the model training task when abnormal information in the local cache directory is monitored.
Optionally, the model training task termination submodule includes:
The first monitoring submodule is used for terminating the model training task when the abnormal information of the copy failure occurs in the local cache directory;
the second monitoring submodule is used for terminating the model training task when abnormal cleaning failure information occurs in the local cache directory;
and the third monitoring submodule is used for terminating the model training task when the fact that the storage space shortage information appears in the local cache directory is monitored.
Optionally, the apparatus further comprises:
the first container numbering module is used for numbering each container in the first container cluster according to a preset numbering rule to obtain a container number corresponding to each container;
And the first node number distribution module is used for distributing corresponding node numbers for each container.
Optionally, the apparatus further comprises:
The first container distribution module is used for distributing each container in the first container cluster to one host in the host cluster respectively;
and the relation recording module is used for recording the corresponding relation among the container numbers corresponding to each container, the node numbers and the host.
Optionally, the first container dispensing module includes:
a host cluster determination submodule for determining host clusters available to the first container cluster;
a first container scheduling sub-module for scheduling each of the containers in the first container cluster to one of the hosts in the host cluster by a scheduler.
Optionally, the apparatus further comprises:
A second container cluster creation module, configured to create a second container cluster in the case of a training interruption in the process of executing the model training task, where the number of containers in the second container cluster is the same as the number of containers in the first container cluster;
the second container numbering module is used for numbering each container in the second container cluster according to a preset numbering rule to obtain a container number corresponding to each container;
A second node number allocation module, configured to allocate a corresponding node number to each container in the second container cluster;
And the second container distribution module is used for distributing each container in the second container cluster to the corresponding host according to the node numbers and the corresponding relation between hosts.
Optionally, the second container dispensing module includes:
A host determining submodule, configured to determine, for each container in the second container cluster, the host corresponding to the node number in the host cluster according to the node number and a correspondence between hosts;
and the second container scheduling sub-module is used for scheduling the containers into the host through a scheduler.
Optionally, the apparatus further comprises:
A third container allocation module, configured to determine an idle host in a host cluster if the host corresponding to the node number cannot operate the container
Determining a host state of the idle host;
And when the host state of the idle host is a normal running state, scheduling the container into the idle host through a scheduler.
Optionally, the apparatus further comprises:
The file information acquisition module is used for acquiring, for each container, first parameter file information in the local cache directory and second parameter file information in the shared storage directory under the condition that the model training task is started in the second container cluster;
The parameter file version determining module is used for determining the parameter file of the latest version according to the first parameter file information and the second parameter file information;
And the first catalog information sending module is used for sending the file catalog information corresponding to the parameter file of the latest version to the container.
Optionally, the apparatus further comprises:
And the second catalog information sending module is used for sending the file catalog information corresponding to the shared storage catalog to the container when the parameter file in the local cache catalog is empty.
Optionally, the apparatus further comprises:
the catalog address acquisition module is used for determining the catalog address corresponding to the parameter file according to the file catalog information;
A parameter file obtaining module, configured to obtain the parameter file from the directory address;
the parameter file loading module is used for loading the parameter file into the container;
The parameter loading module is used for loading the model parameters recorded in the parameter file into the model to be trained;
and the model training task execution module is used for executing the model training task on the basis of the model to be trained.
Based on the same inventive concept, another embodiment of the present application provides a readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method for saving and loading a parameter file according to any of the above embodiments of the present application.
Based on the same inventive concept, another embodiment of the present application provides an electronic device, and referring to fig. 7, fig. 7 is a schematic diagram of an electronic device 700 according to an embodiment of the present application, as shown in fig. 7, including a memory 702, a processor 701, and a computer program stored in the memory and capable of running on the processor, where the processor executes the steps in the method for saving and loading a parameter file according to any one of the foregoing embodiments of the present application.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the application.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.
The method, the device, the equipment and the storage medium for saving and loading the parameter file provided by the application are described in detail, and specific examples are applied to the principle and the implementation mode of the application, and the description of the above examples is only used for helping to understand the method and the core idea of the application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (20)

1. A method for saving and loading a parameter file, the method comprising:
Under the condition that a model training task is executed in a first container cluster, determining a shared storage directory corresponding to the model training task;
Generating a local cache catalog corresponding to each container in the first container cluster;
Establishing a mapping relation between the local cache directory and the shared storage directory, wherein the mapping relation is used for copying the parameter files stored in the local cache directory into the shared storage directory;
And saving the parameter file generated by the container in the training process into the local cache directory, wherein the parameter file is used for saving parameters generated in the model training process.
2. The method according to claim 1, wherein the method further comprises:
Updating the file state of the parameter file to a storage completion state under the condition that the parameter file is stored in the local cache directory;
And storing the file state of the parameter file into a memory database.
3. The method according to claim 1, wherein the method further comprises:
under the condition that the container executes the model training task, inquiring whether the parameter file with the completed caching exists in the local cache directory;
copying the parameter file into the shared storage directory under the condition that the cached parameter file exists;
recording the copy success information in the memory database.
4. The method according to claim 1, wherein the method further comprises:
Determining the number of the parameter files which are completely copied in the local cache directory;
And cleaning the parameter files in the local cache directory under the condition that the number of the parameter files which are completed to be copied exceeds the preset reserved number of the caches.
5. The method according to claim 4, wherein the method further comprises:
Recording that the parameter file is successfully cleaned in a memory database under the condition that the local cache directory is cleaned;
And under the condition that the cleaning of the local cache directory fails, recording the cleaning failure of the parameter file in the memory database.
6. The method of claim 4, wherein the cleaning the parameter file in the local cache directory comprises:
determining the preservation time of each parameter file in the local cache directory;
sequencing all the parameter files in sequence according to the preservation time, and sequencing the parameter file with the shortest preservation time in a first order;
Sequentially cleaning a plurality of last parameter files in the sorted plurality of parameter files;
and stopping cleaning under the condition that the number of the parameter files in the local cache directory does not exceed the reserved number of the caches.
7. The method of claim 1, wherein the generating the local cache directory corresponding to the container comprises:
applying for a preset memory capacity in a cache space of a node corresponding to the container;
And creating the local cache directory in the cache space, wherein the size of the local cache directory is the preset memory capacity.
8. The method according to claim 1, wherein the method further comprises:
Performing exception monitoring on the local cache directory;
and terminating the model training task when abnormal information in the local cache directory is monitored.
9. The method of claim 8, wherein terminating the model training task upon detecting the occurrence of exception information in the local cache directory comprises:
terminating the model training task when abnormal information of replication failure occurs in the local cache directory;
Terminating the model training task when abnormal cleaning failure information in the local cache directory is monitored;
And terminating the model training task when the condition that insufficient storage space information appears in the local cache directory is monitored.
10. The method of claim 1, wherein prior to generating the shared memory directory for the model training task, the method further comprises:
Numbering each container in the first container cluster according to a preset numbering rule to obtain a container number corresponding to each container;
Allocating a corresponding node number for each container;
Each container in the first container cluster is respectively distributed to one host in the host cluster;
And recording the corresponding container number, the node number and the corresponding relation between the hosts corresponding to each container.
11. The method of claim 10, wherein said assigning each of said containers in said first cluster of containers to a respective one of the hosts in the cluster of hosts comprises:
Determining host clusters available to the first container cluster;
Each of the containers in the first container cluster is dispatched by a dispatcher into one of the hosts in the host cluster.
12. The method according to claim 1, wherein the method further comprises:
In the case of a training interruption in the process of executing the model training task, creating a second container cluster, wherein the number of containers in the second container cluster is the same as the number of containers in the first container cluster;
numbering each container in the second container cluster according to a preset numbering rule to obtain a container number corresponding to each container;
Assigning a corresponding node number to each of the containers in the second cluster of containers;
and distributing each container in the second container cluster to the corresponding host according to the node numbers and the corresponding relation between hosts.
13. The method of claim 12, wherein the assigning each container in the second container cluster to a corresponding host according to the correspondence between the node numbers and hosts comprises:
For each container in the second container cluster, determining the host corresponding to the node number in the host cluster according to the corresponding relation between the node number and the host;
The container is dispatched into the host by a dispatcher.
14. The method according to claim 12, wherein the method further comprises:
Under the condition that the host corresponding to the node number cannot operate the container, determining an idle host in a host cluster;
determining a host state of the idle host;
And when the host state of the idle host is a normal running state, scheduling the container into the idle host through a scheduler.
15. The method according to claim 12, wherein the method further comprises:
Under the condition that the model training task is started in the second container cluster, acquiring first parameter file information in the local cache directory and second parameter file information in the shared storage directory for each container;
Determining the parameter file of the latest version according to the first parameter file information and the second parameter file information;
And sending file directory information corresponding to the parameter file of the latest version to the container.
16. The method of claim 15, wherein the method further comprises:
And when the parameter file in the local cache directory is empty, sending file directory information corresponding to the shared storage directory to the container.
17. The method of claim 15, wherein the method further comprises:
Determining a directory address corresponding to the parameter file according to the file directory information;
Acquiring the parameter file from the directory address;
loading the parameter file into the container;
loading model parameters recorded in the parameter file into a model to be trained;
and executing the model training task on the basis of the model to be trained.
18. A parameter file save-and-load device, the device comprising:
the shared storage catalog creation module is used for generating a shared storage catalog corresponding to a model training task under the condition that the model training task is executed in the first container cluster;
The local cache directory creation module is used for generating a local cache directory corresponding to each container in the first container cluster;
The mapping relation establishing module is used for establishing a mapping relation between the local cache directory and the shared storage directory, and the mapping relation is used for copying the parameter files stored in the local cache directory into the shared storage directory;
And the parameter file storage module is used for storing the parameter file generated in the training process of the container into the local cache directory, and the parameter file is used for storing parameters generated in the model training process.
19. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1 to 17.
20. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 17 when executing the computer program.
CN202410317212.6A 2024-03-20 2024-03-20 Parameter file saving and loading method, device, equipment and storage medium Active CN117931302B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410317212.6A CN117931302B (en) 2024-03-20 2024-03-20 Parameter file saving and loading method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410317212.6A CN117931302B (en) 2024-03-20 2024-03-20 Parameter file saving and loading method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117931302A true CN117931302A (en) 2024-04-26
CN117931302B CN117931302B (en) 2024-06-21

Family

ID=90759641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410317212.6A Active CN117931302B (en) 2024-03-20 2024-03-20 Parameter file saving and loading method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117931302B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111124277A (en) * 2019-11-21 2020-05-08 苏州浪潮智能科技有限公司 Deep learning data set caching method, system, terminal and storage medium
CN112445574A (en) * 2020-11-27 2021-03-05 中国工商银行股份有限公司 Application container multi-cluster migration method and device
CN113704299A (en) * 2021-02-26 2021-11-26 腾讯科技(深圳)有限公司 Model training method and device, storage medium and computer equipment
CN113792885A (en) * 2021-08-20 2021-12-14 山东英信计算机技术有限公司 Execution method and related device for deep learning training
CN114064594A (en) * 2021-11-22 2022-02-18 马上消费金融股份有限公司 Data processing method and device
CN115250227A (en) * 2022-06-02 2022-10-28 苏州思萃工业互联网技术研究所有限公司 Scheduling system for realizing fault migration in edge computing scene

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111124277A (en) * 2019-11-21 2020-05-08 苏州浪潮智能科技有限公司 Deep learning data set caching method, system, terminal and storage medium
CN112445574A (en) * 2020-11-27 2021-03-05 中国工商银行股份有限公司 Application container multi-cluster migration method and device
CN113704299A (en) * 2021-02-26 2021-11-26 腾讯科技(深圳)有限公司 Model training method and device, storage medium and computer equipment
CN113792885A (en) * 2021-08-20 2021-12-14 山东英信计算机技术有限公司 Execution method and related device for deep learning training
CN114064594A (en) * 2021-11-22 2022-02-18 马上消费金融股份有限公司 Data processing method and device
CN115250227A (en) * 2022-06-02 2022-10-28 苏州思萃工业互联网技术研究所有限公司 Scheduling system for realizing fault migration in edge computing scene

Also Published As

Publication number Publication date
CN117931302B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
US11016939B2 (en) Architecture for scalable metadata microservices orchestration
US20210064442A1 (en) Implementing An Application Manifest In A Node-Specific Manner Using An Intent-Based Orchestrator
JP5607059B2 (en) Partition management in partitioned, scalable and highly available structured storage
US10838829B2 (en) Method and apparatus for loading data from a mirror server and a non-transitory computer readable storage medium
RU2498394C2 (en) Synchronisation of life cycles of virtual machine and application
US10534674B1 (en) Scalable, persistent, high performance and crash resilient metadata microservice
CN110362390B (en) Distributed data integration job scheduling method and device
US20200019476A1 (en) Accelerating Write Performance for Microservices Utilizing a Write-Ahead Log
CN109558260B (en) Kubernetes fault elimination system, method, equipment and medium
US11579992B2 (en) Methods and systems for rapid failure recovery for a distributed storage system
CN111897558A (en) Kubernets upgrading method and device for container cluster management system
US20200019330A1 (en) Combined Read/Write Cache for Deduplicated Metadata Service
US20230127166A1 (en) Methods and systems for power failure resistance for a distributed storage system
US20220066786A1 (en) Pre-scanned data for optimized boot
CN107894874B (en) Data read-write control method, terminal and system based on super-fusion storage system
CN112596762A (en) Rolling upgrading method and device
CN111930716A (en) Database capacity expansion method, device and system
CN115858086A (en) Data recovery method, data recovery system, device and storage medium
CN114564281A (en) Container scheduling method, device, equipment and storage medium
CN114385349A (en) Container group deployment method and device
CN117931302B (en) Parameter file saving and loading method, device, equipment and storage medium
US11461131B2 (en) Hosting virtual machines on a secondary storage system
WO2018188958A1 (en) A method and a host for managing events in a network that adopts event-driven programming framework
CN113687935A (en) Cloud native storage scheduling mode based on super-fusion design
CN112685130A (en) Virtual machine backup method and device in distributed storage environment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant