CN113704299A

CN113704299A - Model training method and device, storage medium and computer equipment

Info

Publication number: CN113704299A
Application number: CN202110216012.8A
Authority: CN
Inventors: 查冲
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-11-26

Abstract

The application discloses a model training method, a device, a storage medium and computer equipment; the method is related to the fields of machine learning of artificial intelligence, reading of cache and the like, and can be used for obtaining a data request which comprises a training data identifier required by model training; target training data corresponding to the training data identification is inquired from a local cache, and the local cache is used for storing; when target training data corresponding to the training data identification exist in the local cache, loading the target training data to the graphics processor so that the graphics processor performs model training based on the target training data; when target training data corresponding to the training data identification do not exist in the local cache, acquiring the target training data from the data cluster, and loading the target training data to the graphics processor so that the graphics processor performs model training based on the target training data; this application can effectively promote model training efficiency.

Description

Model training method and device, storage medium and computer equipment

Technical Field

The application relates to the field of artificial intelligence and storage, in particular to a model training method, a model training device, a storage medium and computer equipment.

Background

With the rapid development of artificial intelligence science, more and more problems can be solved through a network model in daily life, before the problems are solved by applying the network model, training the network model is a key step, model training usually needs a large amount of training data to perform multi-round training on the model, in the prior art, before each round of training, the training data needed by the round of training needs to be obtained from a remote storage cluster, for example, a storage system commonly used in an artificial intelligence training scene may include a ceph (distributed file storage system), and the training data may be read from the ceph remote storage cluster through the ceph fuse (user-mode file system client).

In the research and practice process of the prior art, the inventors of the present application found that each round of training requires reading training data from a remote storage cluster, which is limited by network conditions and other factors, and may take a long time to read the training data, resulting in a long time consumption of the model training process of the prior art.

Disclosure of Invention

The embodiment of the application provides a model training method and device, a storage medium and computer equipment, which can effectively improve the model training efficiency.

The embodiment of the application provides a model training method, which comprises the following steps:

acquiring a data request, wherein the data request comprises a training data identifier required by model training;

inquiring target training data corresponding to the training data identification from a local cache, wherein the local cache is used for storing;

when target training data corresponding to the training data identification exist in the local cache, loading the target training data to a graphics processor so that the graphics processor performs model training based on the target training data;

and when target training data corresponding to the training data identification does not exist in the local cache, acquiring the target training data from the data cluster, and loading the target training data to a graphics processor so that the graphics processor performs model training based on the target training data.

Accordingly, the present application provides a model training device comprising:

the acquisition module is used for acquiring a data request, wherein the data request comprises a training data identifier required by model training;

the query module is used for querying the target training data corresponding to the training data identification from a local cache, and the local cache is used for storing;

the loading module is used for loading the target training data to a graphics processor when the target training data corresponding to the training data identification exists in the local cache, so that the graphics processor performs model training based on the target training data;

and the cluster acquisition module is used for acquiring the target training data from the data cluster and loading the target training data to the graphics processor when the target training data corresponding to the training data identification does not exist in the local cache, so that the graphics processor performs model training based on the target training data.

In some embodiments, the training data identifier includes a first data identifier and a second data identifier, and the loading module is specifically configured to:

and when first training data corresponding to the first data identification exists in the local cache, loading the first training data to a graphics processor, and loading second training data corresponding to the second data identification acquired from a data cluster to the graphics processor, so that the graphics processor performs model training based on the first training data and the second training data.

when first training data corresponding to the first data identification exist in the local cache, loading the target training data to a graphics processor;

sending a second data request to the data cluster, the second data request comprising a second data identification;

loading second training data returned by the data cluster based on the second data request to a local memory, wherein the second training data corresponds to the second data identifier;

and loading the second training data from the local memory to a graphics processor.

In some embodiments, the model training apparatus further comprises:

and the asynchronous loading module is used for asynchronously writing the second training data returned by the data cluster based on the second data request into a local cache.

In some embodiments, the asynchronous loading module is specifically configured to:

and when the current used capacity of the local cache is smaller than the set cache capacity, asynchronously writing second training data returned by the data cluster based on the second data request into the local cache.

In some embodiments, the model training apparatus further comprises:

the parameter receiving module is used for receiving cache configuration parameter information;

the parameter storage module is used for storing the cache configuration parameter information after the cache configuration parameter information passes the verification;

and the cache establishing module is used for establishing a local cache based on the cache configuration parameter information.

In some embodiments, the parameter preservation module includes a detection submodule and a preservation submodule, wherein,

the detection submodule is used for detecting whether a model training starting message is received or not after the cache configuration parameter information passes the verification;

and the storage submodule is used for storing the cache configuration parameter information when a model training starting message is received.

In some embodiments, the save submodule is specifically configured to:

when a model training starting message is received, generating a parameter storage request;

sending the parameter storage request to a data system, and triggering the data system to store the cache configuration parameter information;

at this time, the cache establishment module is specifically configured to:

and triggering the data system to establish a local cache based on the cache configuration parameter information.

In some embodiments, the cache configuration parameter information includes a cache directory and a cache capacity, and the model training apparatus further includes:

and the checking module is used for determining that the cache configuration parameter information passes the check when the cache directory exists locally and the cache capacity does not exceed a locally available threshold value.

In some embodiments, the model training apparatus further comprises:

and the cache deleting module is used for deleting the local cache when a model training termination message is received.

In some embodiments, the cache deletion module is specifically configured to:

when a model training termination message is received, generating a cache destruction request;

and sending the cache destroying request to a data system, and triggering the data system to delete the local cache.

Correspondingly, the embodiment of the present application further provides a storage medium, where a computer program is stored, and the computer program is suitable for being loaded by a processor to execute any one of the model training methods provided in the embodiment of the present application.

Accordingly, embodiments of the present application further provide a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements any one of the model training methods provided in the embodiments of the present application when executing the computer program.

The method and the device can acquire a data request, wherein the data request comprises a training data identifier required by model training; target training data corresponding to the training data identification is inquired from a local cache, and the local cache is used for storing; when target training data corresponding to the training data identification exist in the local cache, loading the target training data to the graphics processor so that the graphics processor performs model training based on the target training data; and when target training data corresponding to the training data identification does not exist in the local cache, acquiring the target training data from the data cluster, and loading the target training data to the graphics processor so that the graphics processor performs model training based on the target training data.

According to the method and the device, when a data request is received, target training data required by model training can be inquired from a local cache of the computer equipment, when the target training data exist in the local cache, the target training data in the local cache can be loaded to the image processor for model training, and compared with the method and the device which need to be read from a far-end storage cluster in the prior art, the time consumed by reading the training data can be effectively shortened, and then the model training efficiency is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a scenario of a model training system provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating a model training method according to an embodiment of the present disclosure;

FIG. 3 is another schematic flow chart diagram of a model training method provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a partial implementation of a model training method provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of another embodiment of a model training method provided in an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating an implementation of a model training method according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the embodiments described in the present application are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The model training method can be widely applied to the field of artificial intelligence, such as neural network model training, learning model training and the like in the field of machine learning, and also can be applied to the training process of models built for solving specific problems and the like in the fields of computer vision, voice technology, natural language processing and the like of artificial intelligence.

The model training method can be integrated in a model training system, the model training system can be integrated in one or more computer devices, the computer devices can comprise terminals or servers and the like, wherein the servers can be independent physical servers, server clusters or distributed systems formed by a plurality of physical servers, and cloud servers for providing cloud computing services. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Referring to fig. 1, a model training system may include a model training device, wherein the model training device may obtain a data request, the data request including training data identifiers required for model training; target training data corresponding to the training data identification is inquired from a local cache, and the local cache is used for storing; when target training data corresponding to the training data identification exist in the local cache, loading the target training data to the graphics processor so that the graphics processor performs model training based on the target training data; and when target training data corresponding to the training data identification does not exist in the local cache, acquiring the target training data from the data cluster, and loading the target training data to the graphics processor so that the graphics processor performs model training based on the target training data.

It should be noted that the scenario diagram of the model training system shown in fig. 1 is merely an example, and the model training system and the scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application.

The following are detailed below. In this embodiment, a detailed description will be given of a model training method, which may be integrated on a terminal or a server, as shown in fig. 2, where fig. 2 is a flowchart of the model training method provided in this embodiment of the present application. The model training method can comprise the following steps:

101. and acquiring a data request, wherein the data request comprises training data identification required by model training.

In the application, the model training may include a model training process in the field of artificial intelligence, and the model after training that can solve a specific problem is obtained through the model training, so as to achieve a specific purpose, for example, the model after training may include a speech recognition model, an image detection model, a semantic parsing model, an automatic driving model, an intelligent question-answering model, a machine translation model, or the like.

The data request may include a message requesting data required for model training, and therefore, the data request may include a training data identifier, the training data identifier may be a data identifier of the data required for model training, and the training data identifier may perform a unique identification function on the corresponding training data.

The data request may be issued by a training system controlling a model training process, which may be integrated on the same computer device as the model training method of the present application; the training system can also be independently integrated on other computer equipment, different from the computer equipment integrated by the model training method, and the like.

Along with the difference of the computer devices integrated by the training system, the data request obtaining mode may also change, for example, when the training system and the model training method of the present application are integrated on the same computer device, the data request may be obtained only by performing information transmission inside the computer device, or for example, when the training system and the model training method of the present application are integrated on different computer devices, the data request sent by other computer devices may be received by means of a wired or wireless connection mode and a corresponding communication technology, and so on.

For example, the model training device may receive a data request S by training the neural network model through the model training device, and the data request S may include a training data identifier T of data required for training the neural network model.

102. And inquiring target training data corresponding to the training data identification from a local cache, wherein the local cache is used for storing.

The model training method can be integrated on the model training equipment, and the model training equipment can be computer equipment such as a terminal or a server.

The local cache may include, among other things, high-speed accessible memory on a computer device that incorporates the model training method, such as high-speed memory in the model training device.

The target training data may include data required for model training, the target training data corresponds to training data identifiers, and the target training data may include various forms, such as text, images, video, voice, and the like.

The data volume required in the whole process of model training is very large, so the training data is usually stored in a remote data cluster, such as a server, while the process of model training needs to be trained many times, the data required each time is different, and before each training, the target training data is obtained from the remote data cluster, which needs to consume a lot of time, resulting in long time consumption in the whole process of model training.

For example, the target training data M corresponding to the training data identifier T may be queried from a local cache, which is a high-speed memory on the model training device.

In some embodiments, the model training apparatus further comprises:

receiving cache configuration parameter information; after the cache configuration parameter information passes the verification, storing the cache configuration parameter information; and establishing a local cache based on the cache configuration parameter information.

According to the method and the device, the local cache can be established in the computer equipment, so that the target training data can be stored and read at high speed in the local cache in the model training process, and the model training efficiency is effectively improved.

The cache configuration parameter information may include parameter-related information of a local cache to be established, for example, the cache configuration parameter information may include cache directory information, cache capacity information, and the like, and the cache configuration parameter information may be manually input by a user, or may automatically determine the cache configuration parameter information based on target training data required by model training, for example, automatically determine the cache configuration parameter information according to a data amount of the target training data required by one round of model training, and the like.

After receiving the cache configuration parameter information, the cache configuration parameter information may be checked to check whether the cache configuration parameter information meets a setting requirement, where the setting requirement may include a hard requirement limited by the computer device itself, or may include an individual requirement of a specific model training process, and specifically may be flexibly set, where no limitation is made here, and when the cache verification parameter information passes the verification, the cache verification parameter information may be stored in a disk of the computer device, so as to facilitate subsequent reading and use.

Before the local cache needs to be used, the stored model configuration parameter information can be read, and the local cache is established on the computer equipment based on the model configuration parameter information.

For example, receiving the cache configuration parameter information 1, after the cache configuration parameter information 1 passes verification, storing the cache configuration parameter information 1 in a disk of the model training device, and finally establishing a local cache in the model training device based on the cache configuration parameter information 1.

In some embodiments, the step of storing the cache configuration parameter information after the cache configuration parameter information passes the verification includes:

after the cache configuration parameter information passes the verification, detecting whether a model training starting message is received or not; and when the model training starting message is received, storing the cache configuration parameter information.

In the computer equipment, the cache can realize a memory for high-speed data exchange, but the use scenes of the cache are more, in order to ensure the effective use of the cache in the computer equipment, the local cache for model training in the application can be established at a specific time, so that the local cache can be used quickly after the establishment, unnecessary occupation of the cache is avoided, and for example, the local cache can be established at the beginning of model training or before the beginning of model training.

In addition, when special conditions such as power failure restart of computer equipment occur, the cache configuration parameter information can be read from the disk, and the local cache can be quickly reestablished.

The model training initiation message may indicate that model training is to be started, and the model training initiation message may be different contents or different sources in different model training scenarios, and specifically, the model training initiation message may be sent out based on an operation of a user, or may be automatically generated and triggered by the system.

In some scenarios, the model training initiation message may include messages with specific roles, such as container creation instructions, which may be designated as model training initiation messages since these messages with specific roles indicate that model training is about to begin, such as container creation instructions, which may indicate that model training is about to begin.

Therefore, after the cache configuration parameter information passes the verification, when the model training start message is received, the cache configuration parameter information can be stored, and the local cache is established based on the cache configuration parameter information.

In some embodiments, the step "saving the cache configuration parameter information when receiving the model training initiation message" includes:

when a model training starting message is received, generating a parameter storage request; sending a parameter storage request to a data system, and triggering the data system to store cache configuration parameter information;

at this time, the step "establishing a local cache based on the cache configuration parameter information" may include:

the trigger data system establishes a local cache based on the cache configuration parameter information.

In the application, the data system can realize the related operation of the training data, the data system can be integrated on the same computer equipment with the model training method, specifically, when the model training starting message is received, a parameter storage request is generated based on the model training starting message, and is sent to the data system, and the data system can store the cache configuration parameter information into a disk of the computer equipment based on the parameter storage request.

The data system may also perform the establishment and deletion of the local cache, for example, the data system may establish the local cache based on the stored cache configuration parameter information.

For example, when receiving the model training initiation message 1, the data system S may generate a parameter storage request 1, and send the parameter storage request 1 to the data system S, and the data system S may store the cache configuration parameter information in the disk of the model training device based on the data storage request 1, and establish the local cache based on the cache configuration parameter information.

In some embodiments, the cache configuration parameter information includes a cache directory and a cache capacity, and the model testing method further includes:

and when the cache directory exists locally and the cache capacity does not exceed the locally available threshold, determining that the cache configuration parameter information passes the verification.

In order to ensure that the local cache is successfully established, before the establishing operation is performed, the cache configuration parameter information may be checked, where the cache configuration parameter information may include a cache directory and a cache capacity, the cache directory may include address information of the local cache to be established in the computer device, and the cache capacity may include storage capacity information occupied by the local cache to be established in the computer device cache.

The local available threshold may be flexibly set according to an actual application scenario, for example, the local available threshold may include an upper limit of an available capacity of a computer device cache, or may also include an artificially set upper limit of a local cache capacity, and the like.

In particular, the checking may include checking whether the cache directory already exists locally on the computer device and whether the cache capacity is greater than a locally available threshold, and when both are satisfied, it may be determined that the cache configuration parameter passes the checking.

For example, the cache configuration parameter information includes a cache directory 1 and a cache capacity 1, and when the cache directory 1 already exists locally and the cache capacity 1 does not exceed the locally available threshold Y, it may be determined that the cache configuration parameter information passes the check.

103. And when target training data corresponding to the training data identification exist in the local cache, loading the target training data to the graphics processor so that the graphics processor performs model training based on the target training data.

The method can be used for inquiring the target training data in computer equipment (such as model training equipment) integrated with the model training method, when the target training data exist, the target training data in the local cache can be directly read out and loaded into a graphic processor of the computer equipment, the image processor inputs the target training data into a model and carries out calculation to obtain an output result, and parameters in the model are updated by combining the contents such as a data label carried by the target training data, so that a round of model training is completed.

This application can at first inquire the target training data in the local cache of computer equipment, when existing, only need carry out data flow in computer equipment inside, load the target training data in the local cache to the graphics processor, can carry out the model training, compare in the mode that obtains the target training data from the server, a large amount of time can be saved to this application to effectively promote the efficiency of model training.

For example, when the target training data M exists in the local cache, the target training data M may be loaded into the local memory of the model training device, and then loaded into the graphics processor of the model training device, and the graphics processor trains the neural network model to be trained through the target training data M.

104. And when target training data corresponding to the training data identification does not exist in the local cache, acquiring the target training data from the data cluster, and loading the target training data to the graphics processor so that the graphics processor performs model training based on the target training data.

In the application, when target training data are not inquired in the local cache, a training data acquisition request can be sent to the data cluster, the training data acquisition request comprises a training data identifier, then the target training data returned by the data cluster based on the data acquisition request are received, and the target training data are loaded to the graphic processor, so that model training is carried out in the graphic processor through the target training data.

Specifically, the process of obtaining the target training data from the data cluster may also be implemented by the data system, for example, sending a training data obtaining request including a training data identifier to the data cluster through the data system, and receiving the target training data returned by the data cluster based on the data obtaining request.

In some embodiments, the training data identifier includes a first data identifier and a second data identifier, and the step of loading the target training data to the graphics processor when the target training data corresponding to the training data identifier exists in the local cache, so that the graphics processor performs model training based on the target training data may include:

and when first training data corresponding to the first data identification exists in the local cache, loading the first training data to the graphics processor, and loading second training data corresponding to the second data identification acquired from the data cluster to the graphics processor, so that the graphics processor performs model training based on the first training data and the second training data.

The data size of the target training data required for completing one round of model training is huge, so that the training data identifier included in the data request may include a plurality of training data identifiers, and a part of the target training data corresponding to the training data identifier may be included in the training data stored in the local cache, so that the part of the target training data in the local cache may be loaded to the graphics processor, and another part of the target training data acquired from the data cluster is loaded to the graphics processor, thereby performing model training based on the target training data through the graphics processor.

For example, the training data identifier may include a first data identifier and a second data identifier, the training data identifier corresponds to target training data, the first data identifier corresponds to first training data, the second data identifier corresponds to second training data, and the target training data may include first training data and second training data.

In some embodiments, the step of "loading second training data corresponding to the second data identification obtained from the data cluster to the graphics processor" may include:

sending a second data request to the data cluster, wherein the second data request comprises a second data identifier; loading second training data returned by the data cluster based on the second data request to a local memory, wherein the second training data corresponds to a second data identifier; and loading the second training data from the local memory to the graphics processor.

For example, a second data request 1 is sent to the data server, where the second data request 1 includes a second data identifier 1, the data server returns second training data based on the second data request 1, loads the returned second training data to the local memory of the model training device, and then loads the second training data from the local memory of the model training device to the graphics processor.

In some embodiments, the model testing method further comprises:

and asynchronously writing second training data returned by the data cluster based on the second data request into the local cache.

In the multi-round training process of the model, the target training data required by each round is a part of all the training data, and all the training data are not changed, for example, the training data identifier required by each round of model training may be sampling and determining the identifier of all the training data, and based on the fixed proportion of the target training data of each round of model training in all the training data, the number of rounds of model training is large, and the like, the target training data acquired from the data cluster or part of the target training data may be written into the local cache from the beginning of the first round of model training, so that when acquiring the target training data of the next round, the target training data may be read from the local cache in which a part of all the training data is stored, and then the part of the target training data which does not exist in the local cache may be acquired from the data cluster.

Specifically, target training data or part of the target training data acquired from the data cluster can be written into a local cache in an asynchronous mode, and meanwhile, the computer equipment can also perform model training based on the target training data, and the target training data are processed in parallel, so that the efficiency of the whole model training process is effectively improved.

For example, the data server may return the second training data based on the second data request 1, and the second training data may be asynchronously written to the local cache of the model training machine.

In some embodiments, the step of "asynchronously writing second training data returned by the data cluster based on the second data request to the local cache" may include:

Because the set cache capacity is set in the local cache during configuration, when the current used capacity of the local cache is greater than the set cache capacity, the target training data or part of the target training data (e.g., the second training data in this embodiment) cannot be written into the local cache any more, and when the current used capacity of the local cache is less than or equal to the set cache capacity, the target training data or part of the target training data (e.g., the second training data in this embodiment) can be written into the local cache.

In some embodiments, the model testing method further comprises:

and when the model training termination message is received, deleting the local cache.

In computer equipment, a cache can realize a memory for high-speed data exchange, but the use scenes of the cache are more, and in order to ensure the effective use of the cache in the computer equipment, the local cache used for model training in the application can be deleted at a specific time, so that unnecessary occupation of the cache is avoided.

In the application, the local cache may no longer have a use value, for example, when the model training is terminated, the local cache is deleted, and the termination of the model training may be determined by a model training termination message.

The model training termination message may indicate that the model training has been terminated, and the model training termination message may be different contents or different sources in different model training scenarios, for example, the model training termination message may be sent manually or may be generated and sent automatically by the system, and so on, for example, the model training termination message may be a message corresponding to the training completion in the middle of the training or when the training is completed, and in some scenarios, the model training termination message may include a message indicating the meaning of the training completion in the middle of the training or the training completion, for example, a container destruction instruction indicating the training completion, and when a container destruction instruction is received, the local cache may be deleted.

In some embodiments, the step "delete local cache when model training termination message is received" may comprise:

and sending a cache destroy request to the data system, and triggering the data system to delete the local cache.

Specifically, the local cache may be deleted by the data system, for example, when the model training termination message is received, a cache destruction request may be generated, and the cache destruction request is sent to the data system, so that the data system deletes the local cache.

For example, when receiving the model training termination message 1, the cache destruction request 1 may be generated, and the cache destruction request 1 is sent to the data system S, and the data system S may delete the established local cache based on the cache destruction request 1.

The method described in the above embodiments is further illustrated in detail by way of example.

The present application will be described by taking as an example a model training method integrated in a model training device, where the model training device may be a computer device, as shown in fig. 3, and fig. 3 is a schematic flow chart of the model training method provided in the embodiment of the present application. The model training method can comprise the following steps:

201. the computer device receives cache configuration parameter information.

For example, the cache configuration parameter information may include a cache directory and a cache capacity customized by a user according to the training requirements of the actual model.

202. And when the cache configuration parameter information passes the verification and the model training starting message is received, the computer equipment stores the cache configuration parameter information.

For example, referring to fig. 4, a parameter configuration (e.g., user-specified) of a cache directory and a space size (i.e., a cache capacity) may be specified, and then a check may be performed on whether the configuration parameter is normal, where the check may include determining whether the cache directory already exists and whether the cache capacity exceeds an available free threshold of a local cache of the computer device, if the cache directory already exists and the cache capacity does not exceed the available free threshold, it may be determined that information of the cache configuration parameter passes the check, and if the cache directory does not exist or the cache capacity exceeds the available free threshold, it may be determined that the cache configuration parameter fails the check, and prompt information needs to be returned to instruct the user to reconfigure the cache directory and the cache capacity.

The user can submit the model training task, and create a task container (pod) (i.e. a model training start message) according to the need of the model training task, at this time, the cache configuration check parameter can be transmitted to the client (i.e. the fuse client) of the storage system on the computer device, and the fuse client stores the cache configuration parameter information in the local disk of the computer device (i.e. the fuse client executes the transmitted parameter), after the cache configuration parameter information is stored in the local disk, when the client of the storage system is restarted, etc., the client of the storage system can directly establish the local cache according to the stored cache configuration parameter information, thereby effectively improving the stability of the model training process.

203. The computer device establishes a local cache based on the cache configuration parameter information.

For example, the client of the cache system establishes a local read-only cache in the computer device according to the cache directory and the cache capacity, and data stored in the local read-only cache cannot be modified, so that the consistency of training data stored in the local read-only cache in the whole model training process is ensured, and the smooth expansion of the model training process is effectively ensured.

204. The computer equipment acquires a first-round data request, wherein the first-round data request comprises first-round data identification required by first-round model training.

205. The computer equipment obtains first-round training data corresponding to the training data identification from the data cluster, and loads the first-round training data to the graphics processor, so that the graphics processor performs first-round model training based on the first-round training data.

For example, a model training process of a model to be trained may include multiple rounds, before a first round of model training, first round training data may be acquired from a storage system, the storage system may include a data cluster in which training data is stored remotely, and then the first round training data corresponding to a first round training identifier may be read from the remote data cluster, and loaded into a memory of a computer device, and then loaded into a graphics processor from the memory, so as to perform the first round model training.

206. The computer device asynchronously writes the first round of training data into the local cache.

For example, first round training data is read from a data cluster at the far end of the storage system, the read first round training data can be asynchronously written into a local read-only cache of the computer device, so that a subsequent training process can be firstly searched from the local read-only cache, and a first round model training process and a data writing local read-only cache process can be processed in parallel by adopting an asynchronous writing mode, so that the efficiency of the whole model training process is effectively improved.

207. The computer device obtains a training data request, the training data request including training data identifiers required for model training, the training data identifiers including a first data identifier and a second data identifier.

208. When first training data corresponding to the first data identification exists in the local cache, the computer equipment loads the first training data to the graphics processor, and loads second training data corresponding to the second data identification acquired from the data cluster to the graphics processor, so that the graphics processor performs model training based on the first training data and the second training data.

After the first round of model training is completed, the local read-only cache contains first round training data, the first round training data are part of all training data required by the whole model training process, all training data required by the whole model training process are constant, and the training data required by each round of model training is extracted from all training data, so that part of the training data required by each round of model training is data used in the previous round.

Based on this, in each round of model training process including the first round of model training, under the condition that the available capacity of the local read-only cache is not zero, the training data required by the current round read from the data cluster at the far end can be written into the local read-only cache, so that in each round of model training process, the target training data required by the current round can be inquired from the local read-only cache at first, and when part of the target training data exists in the local read-only cache, the other part of the target training data can be acquired from the data cluster at the far end, so that the model training time is effectively saved, and the model training efficiency is improved.

For example, referring to fig. 5, the process of obtaining the target training data may be that first, AI training data is read from the read-only cache of the local disk, when there is data in the local disk, the data may be loaded to the memory to run AI training, when there is no data in the local disk, the required AI training data may be read from the ceph remote data cluster, the data is asynchronously written to the cache of the local disk, and the data is loaded to the memory to run AI training.

209. The computer device deletes the local cache when receiving the model training termination message.

For example, when the whole model training process is completed or the middle is finished, the container (pod) is destroyed, a local read-only cache cleaning instruction can be transmitted to the client of the storage system at the moment of destroying the container, and the client can delete the local read-only cache according to the instruction, so that the recovery of the cache space of the computer equipment is completed.

The whole model training process may refer to fig. 6, where the data reading process may include sending a data reading request to a client (e.g., ceph-fuse) of the storage system by AI training, reading training data from a ceph cluster on the cloud by the client (e.g., ceph-fuse) of the storage system, transmitting the training data to a remote storage cluster (e.g., ceph cluster) for AI training, and asynchronously caching the training data to a local disk (e.g., ssd) by the client (e.g., ceph-fuse) of the storage system, where the local read-only cache may be created by the client of the storage system based on a cache directory and a space size at a pod creation time, and the local read-only cache may be deleted (delete the training data and the local read-only cache) at a client word pod destruction time of the storage system.

In order to better implement the model training method provided by the embodiment of the present application, the embodiment of the present application further provides a device based on the model training method. The meaning of the noun is the same as that in the model training method, and the specific implementation details can refer to the description in the method embodiment.

Fig. 7 is a schematic structural diagram of a model training apparatus provided in fig. 7 according to an embodiment of the present application, where the model training apparatus may include an obtaining module 301, a querying module 302, a loading module 303, and a cluster obtaining module 304, where,

an obtaining module 301, configured to obtain a data request, where the data request includes a training data identifier required by model training;

a query module 302, configured to query target training data corresponding to the training data identifier from a local cache, where the local cache is used for storage;

the loading module 303 is configured to load the target training data to the graphics processor when the target training data corresponding to the training data identifier exists in the local cache, so that the graphics processor performs model training based on the target training data.

The cluster obtaining module 304 is configured to, when target training data corresponding to the training data identifier does not exist in the local cache, obtain the target training data from the data cluster, and load the target training data to the graphics processor, so that the graphics processor performs model training based on the target training data.

when first training data corresponding to the first data identification exist in the local cache, loading target training data to the graphics processor;

sending a second data request to the data cluster, wherein the second data request comprises a second data identifier;

loading second training data returned by the data cluster based on the second data request to a local memory, wherein the second training data corresponds to a second data identifier;

and loading the second training data from the local memory to the graphics processor.

In some embodiments, the model training apparatus further comprises:

and the asynchronous loading module is used for asynchronously writing the second training data returned by the data cluster based on the second data request into the local cache.

In some embodiments, the model training apparatus further comprises:

and the storage submodule is used for storing the cache configuration parameter information when the model training starting message is received.

In some embodiments, the save submodule is specifically configured to:

sending a parameter storage request to a data system, and triggering the data system to store cache configuration parameter information;

at this time, the cache establishment module is specifically configured to:

and the checking module is used for determining that the cache configuration parameter information passes the check when the cache directory exists locally and the cache capacity does not exceed a locally available threshold.

In some embodiments, the model training apparatus further comprises:

and the cache deleting module is used for deleting the local cache when the model training termination message is received.

In some embodiments, the cache deletion module is specifically configured to:

In the present application, the obtaining module 301 may obtain a data request, where the data request includes a training data identifier required for model training; the query module 302 may query target training data corresponding to the training data identifier from a local cache, where the local cache is used for storage; the loading module 303 may load the target training data to the graphics processor when the target training data corresponding to the training data identifier exists in the local cache, so that the graphics processor performs model training based on the target training data, and the cluster obtaining module 304 may obtain the target training data from the data cluster when the target training data corresponding to the training data identifier does not exist in the local cache, and load the target training data to the graphics processor, so that the graphics processor performs model training based on the target training data.

In addition, an embodiment of the present application further provides a computer device, where the computer device may be a terminal or a server, as shown in fig. 8, which shows a schematic structural diagram of the computer device according to the embodiment of the present application, and specifically:

the computer device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 8 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby monitoring the computer device as a whole. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user pages, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 via a power management system, so that functions of managing charging, discharging, and power consumption are implemented via the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The computer device may also include an input unit 404, the input unit 404 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions as follows:

acquiring a data request, wherein the data request comprises a training data identifier required by model training; target training data corresponding to the training data identification is inquired from a local cache, and the local cache is used for storing; when target training data corresponding to the training data identification exist in the local cache, loading the target training data to the graphics processor so that the graphics processor performs model training based on the target training data; and when target training data corresponding to the training data identification does not exist in the local cache, acquiring the target training data from the data cluster, and loading the target training data to the graphics processor so that the graphics processor performs model training based on the target training data.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the above embodiments.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by a computer program, which may be stored in a computer-readable storage medium and loaded and executed by a processor, or by related hardware controlled by the computer program.

To this end, the present application further provides a storage medium, in which a computer program is stored, where the computer program can be loaded by a processor to execute the steps in any one of the model training methods provided in the present application. For example, the computer program may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the storage medium can execute the steps in any model training method provided in the embodiments of the present application, the beneficial effects that can be achieved by any model training method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The model training method, the model training device, the model training storage medium, and the computer device provided in the embodiments of the present application are described in detail above, and specific examples are applied herein to explain the principles and implementations of the present application, and the descriptions of the above embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of model training, comprising:

2. The method of claim 1, wherein the training data identifier comprises a first data identifier and a second data identifier, and wherein loading the target training data to a graphics processor when target training data corresponding to the training data identifier exists in the local cache, so that the graphics processor performs model training based on the target training data comprises:

3. The method of claim 2, wherein loading second training data corresponding to the second data identification obtained from the data cluster to a graphics processor comprises:

4. The method of claim 3, further comprising:

and asynchronously writing second training data returned by the data cluster based on the second data request into a local cache.

5. The method of claim 4, wherein asynchronously writing second training data returned by the data cluster based on the second data request to a local cache comprises:

6. The method of claim 1, further comprising:

receiving cache configuration parameter information;

after the cache configuration parameter information passes the verification, storing the cache configuration parameter information;

and establishing a local cache based on the cache configuration parameter information.

7. The method according to claim 6, wherein the saving the cache configuration parameter information after the cache configuration parameter information passes the verification comprises:

after the cache configuration parameter information passes the verification, detecting whether a model training starting message is received or not;

and when a model training starting message is received, storing the cache configuration parameter information.

8. The method of claim 7, wherein saving the cache configuration parameter information when receiving a model training initiation message comprises:

sending the parameter storage to a data system, and triggering the data system to store the cache configuration parameter information;

establishing a local cache based on the cache configuration parameter information includes:

9. The method of claim 6, wherein the cache configuration parameter information comprises a cache directory and a cache capacity, and wherein the method further comprises:

and when the cache directory exists locally and the cache capacity does not exceed a locally available threshold, determining that the cache configuration parameter information passes the check.

10. The method of claim 1, further comprising:

and deleting the local cache when a model training termination message is received.

11. The method of claim 10, wherein deleting the local cache when a model training termination message is received comprises:

12. A model training apparatus, comprising:

13. A storage medium, characterized in that it stores a plurality of computer programs adapted to be loaded by a processor for performing the steps of the method according to any one of claims 1 to 11.

14. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method according to any of claims 1 to 11 are implemented when the computer program is executed by the processor.