WO2017107788A1

WO2017107788A1 - Machine learning tool middleware and training method of machine learning

Info

Publication number: WO2017107788A1
Application number: PCT/CN2016/109370
Authority: WO
Inventors: 雷鸣; 鄢志杰
Original assignee: 阿里巴巴集团控股有限公司; 雷鸣; 鄢志杰
Priority date: 2015-12-22
Filing date: 2016-12-12
Publication date: 2017-06-29
Also published as: CN106909529A; CN106909529B

Abstract

A machine learning tool middleware and a machine learning training method. The machine learning tool comprises at least one training unit, wherein each training unit is provided with a middleware combined with a machine learning tool. Each middleware comprises a low-level communication module and at least one of a data distribution module, a model parameter updating module, a training parameter adjusting module and a training cessation determining module. The training unit distributes required data from a data storage device to a storage unit accessible to each training unit through the middleware, so that each training unit reads the data from the storage unit to implement the training, completes the update of model parameters of the training unit and the adjustment of training parameters of each training unit through communication between corresponding modules of the middleware, and determines whether to cease the training according to the training information of all training units. The middleware is responsible for required processing for large-scale parallel training, and is convenient for expanding various machine learning tools.

Description

Machine learning tool middleware and machine learning training method

Technical field

The invention belongs to the field of machine learning technology, and in particular relates to a machine learning tool middleware and a machine learning training method.

Background technique

Machine learning is a branch of artificial intelligence, and in many cases it has become synonymous with artificial intelligence. Simply put, machine learning is a machine learning algorithm model that allows machines to learn patterns from a large amount of historical data to intelligently identify new samples or predict the future. The general process of machine learning is to calculate the machine learning algorithm model parameters from the input data (input data), form a machine algorithm model based on the calculated model parameters, and intelligently identify new samples or predict the future. In many practical applications, the input data is very large and must be processed simultaneously by multiple computing devices to complete the calculation in a reasonable time. Therefore, the model parameters must be exchanged with each other, and the exchange model parameters are collected by the parameter server for aggregation and distribution.

The existing large-scale machine learning platform is a closed training framework, first based on a shareable storage space. In addition, for example, the format supported by the data file is limited, and the format supported by the model file is limited. The training objectives and algorithms used in the machine learning training are selected in a finite method implemented in advance, and the parameter adjustment method in the training process and The stop condition is also pre-implemented.

In reality, different products or services often require different data, models or training methods. Based on different training tools, these related documents and training methods tend to be very different. If implemented on an existing large-scale machine learning platform, it needs to be completely replaced with the existing functionality of the platform, or extended to be compatible with actual machine learning tasks. However, in doing so, a large number of experimental comparisons are required, and existing products need to be modified to be compatible with the platform's data and model formats. In addition, there is no guarantee that the existing implementation of the platform will meet the business needs. At the same time, it needs to have a deep understanding of the implementation of the platform, and it takes a lot of time to implement the data format, model format and training method, which has high requirements for users.

Summary of the invention

The object of the present invention is to provide a machine learning tool middleware and a machine learning training method, so that various machine learning tools do not depend on a large-scale machine learning platform, and do not need to change specific models, data file analysis, and core training methods. Training goals can be completed.

In order to achieve the above object, the technical solution of the present invention is as follows:

A machine learning tool middleware for model training of a machine learning tool, the machine learning tool comprising at least one training unit, each training unit being provided with a middleware combined with a machine learning tool, the middleware comprising a bottom layer a communication module, and at least one of a data distribution module, a model parameter update module, a training parameter adjustment module, and a training stop determination module, wherein:

The bottom layer communication module is configured to implement communication between corresponding modules between the training units, and communication between the training units;

The data distribution module is configured to distribute required data from the data storage device to a storage unit accessible by the training unit, so that the training unit reads data from the storage unit for training;

The model parameter update module is configured to collect training information of other training units, and update model parameters of the training unit;

The training parameter adjustment module is configured to collect training information of other training units, and adjust training parameters of the training unit;

The training stop determination module is configured to collect training information of other training units to determine whether to stop training.

Further, the data storage device is configured to store all training data of the machine learning tool, the data storage device being located on a main training unit of the machine learning tool.

Further, the data distribution module of the main training unit is configured to receive a request of a data distribution module of another training unit, and distribute data to data distribution modules of other training units, and the data distribution module of the other training unit data receives the distributed data. Stored in the local storage unit of this training unit.

The data distribution is implemented by setting the data distribution module, and the training data is distributed from the storage device of the main training unit to the local storage unit of each training unit, and the distribution is implemented in the middleware without affecting the training unit. Practice process. Each training unit does not need to go to the shared storage device to extract data during each training, thus reducing the working pressure of the storage device, and does not need to share a large-scale storage platform.

Further, the model parameter update module collects training information of other training units, and transmits training information of the training unit to other training units, and averages model parameters for model parameters of each training unit.

Alternatively, the machine learning tool further includes a parameter server, and the model parameter update module transmits the training information of the training unit to the parameter server, and the parameter server sends the model parameter and sends it back.

Further, the underlying communication module is further configured to add an interlocking mechanism between various communications when implementing communication between corresponding modules between the training units and communication between the training units. Different modules cannot communicate at the same time. When one module is communicating, other modules need to wait for it to complete before communicating.

The present invention also proposes a machine learning training method for model training of a machine learning tool, the machine learning tool comprising at least one training unit, each training unit being provided with a middleware combined with a machine learning tool, The training unit communicates through the middleware, and the training unit performs at least one of the following training operations by the middleware to complete the model training, the training operation includes:

Distributing the required data from the data storage device to the storage unit accessible by each training unit, so that each training unit reads data from the storage unit for training;

Collect training information of other training units, and update model parameters of the training unit;

Collect training information of other training units, and adjust training parameters of the training unit;

The training information of other training units is collected to determine whether to stop training.

The invention proposes a machine learning tool middleware and a machine learning training method, and distributes data to each training unit local storage unit through the data distribution module of the middleware, and no longer depends on the large-scale storage platform. The middleware is responsible for the processing required for large-scale parallel training: data distribution, model parameter updates, training parameter adjustments, training stop synchronization, and communication between training units without changing specific models, data file parsing, and core training. Methods, training objectives, and thus no longer rely on large-scale machine learning platforms. The present invention facilitates the expansion of various machine learning tools with little or no impact on the training behavior of a single training unit while supporting the expansion of various data file formats.

DRAWINGS

1 is a schematic structural view of a middleware of a machine learning tool of the present invention;

2 is a schematic diagram of correspondence between machine learning training and middleware according to the present invention;

FIG. 3 is a flowchart of a machine learning training method according to an embodiment of the present invention.

detailed description

The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings and embodiments. The following embodiments are not to be construed as limiting.

Machine learning tools are widely used in the field of artificial intelligence. Common machine learning tools include Caffe, Kaldi, etc. Machine learning tools train machine learning models based on known training data, and use machine learning models to analyze unknown data for learning. To new knowledge. The general idea of the present invention is to provide a machine learning tool middleware that enables a machine learning tool to adapt to different training data file formats, and the middleware can be applied to any machine learning tool to satisfy different machine learning tools based on different Training data, different models or training methods, training of machine learning models.

As shown in FIG. 1 , a machine learning tool middleware of the embodiment includes: a data distribution module, a model parameter update module, a training parameter adjustment module, a training stop determination module, and an underlying communication module.

In a practical application, the machine learning tool of the embodiment implements the combination of the two by calling the middleware, and then deploys the middleware and the machine learning tool on one or more servers while training. In performing model training, the machine learning tool includes at least one basic machine learning tool process for implementing parallel processing of different training data or parallel processing of different model partitions. This embodiment supports both types of distributed simultaneously. Parallel processing. Each basic machine learning tool process is called a training unit. For example, machine learning tools and their combined middleware deployed on different servers form a training unit to process a machine learning tool process.

In Fig. 1, two training units 1 and a training unit 2 are exemplarily listed, and the present invention is not limited to the number of training units. Each training unit includes machine learning tools and corresponding middleware, training units Connected by the underlying communication module, in a training unit, the data distribution module, the model parameter update module, the training parameter adjustment module, and the training stop determination module are respectively connected with the machine learning tool and connected with the underlying communication module, and the bottom communication module Also connected to machine learning tools. The connection described in this embodiment belongs to an interface call of a software program, and is not described here.

The data distribution module is configured to distribute required data from the data storage device to a storage unit accessible by each training unit.

For a machine learning tool having multiple training units, all training data used in the training is usually stored in a data storage device of a main training unit, and the data distribution module of each training unit requests the data distribution module corresponding to the main training unit. The data is then transmitted over the network to a local storage unit for use by the local training unit. Generally, each training unit has its own data storage unit, and the training data is stored in the storage device of the main training unit, and the data is distributed to the storage unit local to each training unit through the data distribution module for use by each training unit, and each training unit is locally used. The storage unit reads the training data for training. The storage device and the storage unit of the embodiment are respectively provided. Preferably, the storage unit is local to the training unit server, and the training unit can also access other storage devices. The distribution of data here is carried out on the middleware in the background and does not affect the actual training process of the training unit. In this way, when the training unit processes the current data file, the processing of the next data file can be directly performed, that is, the data file that the middleware data distribution module has prepared.

A model parameter update module is used to update the model parameters between the training units. When the training unit processes a plurality of batches of data and needs to perform multi-training unit update, the parameter update may be performed through the model parameter update module of the middleware, that is, the training information of other training units is collected, and the training information of the training unit is notified to other training. unit. The training information here may be the model parameter itself or a related parameter when the model parameter is updated, such as a gradient. The parameter update may be performed synchronously by each training unit, or may be performed asynchronously by each training unit, or may be performed by a virtual parameter server. Specifically, the update method may be that the model parameters on each training unit are averaged (synchronized), or each training unit sends the gradient to the parameter server, and the parameter server sends the latest model parameters back, and then proceeds to the next step. Training (asynchronous).

The training parameter adjustment module is configured to adjust the training parameters of each training unit. The training parameter adjustment module is similar to the model parameter update module, mainly for the training target, learning rate, etc. of the training unit. The information is exchanged with other training units, and then the training parameters are adjusted. In this way, each adjustment is adjusted based on the training information of all training units, rather than the training information of a single training unit, which can provide a better adjustment mechanism.

The training stop determination module is configured to determine whether to stop training based on the training information of all the training units. Similar to the training parameter adjustment module, the training stop determination module is based on the training information of all training units to determine whether to stop training, rather than the training information of a single training unit, which can provide a better stopping mechanism.

The underlying communication module is configured to implement communication between corresponding modules between the training units and communication between the training units.

The module is mainly used to process communication between corresponding modules between the training units, for example, communication between the training unit 1 and the data distribution module of the training unit 2, and the data is distributed by calling the underlying communication module; Communication between the corresponding model parameter update modules, the training parameter adjustment modules corresponding to the two training units, and the training stop determination modules corresponding to the two training units.

At the same time, some necessary communication between the training units can be provided. For example, the training unit can continuously synthesize the training performance of all training units, such as the objective indicators of training, by calling the underlying communication module in a specific training process. For example, each training unit can perform unified behavior control between the training units by calling the underlying communication module in a specific training process, such as when the actual training is performed consistently, and when the specified test is performed consistently.

At the same time, in order to carry out risk-free communication, an interlock mechanism needs to be added between various communications to ensure communication security. In some underlying system communication implementations, such as the MPI communication protocol, it is not sufficient to support multi-threaded free calls for communication. That is to say, there are some system underlying communication protocols that prevent multiple modules from communicating at the same time. In order to protect communication security, the embodiment adds an interlock mechanism to the underlying communication module, so that different modules cannot communicate at the same time. When one module is communicating, other modules need to wait for completion to communicate.

As shown in FIG. 2, using the middleware of this embodiment, a typical machine learning training process is as follows:

Each training unit is started at the same time. The main training unit (capable of accessing model files and data files) transmits the model files to all other training units through the middleware communication module of the middleware, and each training unit reads Into the model file. Each training unit then requests training data from the main training unit data distribution module storing the training data through the middleware data distribution module, and the main training unit middleware data distribution module responds to the request and distributes the training data to the local storage unit of each training unit. Each training unit reads the data file prepared by the middleware data distribution module for training processing; meanwhile, the middleware data distribution module continues to distribute data in the background to prepare the next batch of data files.

The parameter update is performed by the middleware model parameter update module, that is, the training information of other training units is collected, and the training information of the training unit is notified to other training units. After the training unit processes each batch of data processing according to its own training target and training method, the model parameters are updated by the middleware model parameter update module. Or each training unit model parameter update module sends the gradient to the parameter server, and the parameter server sends back the latest model parameters, and then performs the next training.

The training parameter adjustment module exchanges information such as the training target and the learning rate of the training unit with other training units, and then adjusts the training parameters through the middleware training parameter adjustment module.

Similarly, the training stop determination module collects the training information of the other training units, and informs the other training units of the training information of the training unit, and determines whether to stop the training based on the training information of all the training units. When the training unit performs each batch of data processing, the middleware training stop judgment module determines whether to stop training. If the judgment is stopped, the training is ended, and the learned model is output. Otherwise, the training data is continued to be read, and the training of the next batch of training data is performed until the training process is completed.

The information data transmitted between the above modules is transmitted through the underlying communication module.

Through the above process, when multiple training units perform machine model task processing, they can continuously update model parameters and training parameters according to their own training methods and algorithms, and process their own models and data format files to achieve massive parallelism. The purpose of the treatment.

It should be noted that only the underlying communication module is necessary in the middleware of this embodiment, and other modules may select a required combination of modules according to a specific machine learning tool.

For example, some machine learning tools have their own training parameter adjustment methods, so that the user can choose not to use the training parameter adjustment module of the present invention, and adopt the method of the machine learning tool itself, and simultaneously use the underlying communication module of the present invention. Synchronize the training parameters on each machine learning program to ensure overall consistency. Another example is that some machine learning tools do not allow dynamic reading of new data files at runtime, so The user can choose not to use the data distribution module in the present invention, but only distribute the data to each machine before the training starts. During the training, each training unit directly reads the training data already distributed by the machine to start training.

As shown in FIG. 3, a machine learning training method is used for model training of a machine learning tool, the machine learning tool includes at least one training unit, and each training unit is provided with an intermediate combination with a machine learning tool. The training unit communicates through the middleware, and the training unit performs the model training by performing at least one of the following training operations, the training operation includes:

The above training operations are performed by middleware, including data distribution, parameter update, adjustment of training parameters, and suspension of training. Each training unit requests training data from the main training unit storing the training data through the middleware, and the main training unit middleware responds to the request, and distributes the training data to the local storage unit of each training unit. Each training unit reads the data file prepared by the middleware and performs training processing. At the same time, the middleware performs data distribution in the background to prepare the next batch of data files. During the training process, after the training unit processes each batch of data processing according to its own training target and training method, the model parameters are updated through the middleware. That is, the training information of other training units is collected, and the training information of the training unit is notified to other training units; or each training unit sends the gradient to the parameter server through the middleware, and the latest model parameters are sent back by the parameter server, and then the next One step training. The training unit exchanges the training target, the learning rate and other information of the training unit with other training units through the middleware, and then adjusts the training parameters through the middleware. Similarly, the training unit collects the training information of the other training units through the middleware, and informs the other training units of the training information of the training unit, and determines whether to stop the training based on the training information of all the training units. When the training unit performs each batch of data processing, it judges whether to stop training through the middleware training. If the judgment stops, the training ends, and the learned model is output. Otherwise, the training data is continued to be read, and the next batch of training data is performed. Train until the training process is completed.

The above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to be limiting, and those skilled in the art can make various corresponding changes according to the present invention and without departing from the spirit and scope of the present invention. Modifications, but such corresponding changes and modifications are intended to be included within the scope of the appended claims.

Claims

A machine learning tool middleware for model training of a machine learning tool, the machine learning tool comprising at least one training unit, wherein each training unit is provided with a middleware combined with a machine learning tool, The middleware includes an underlying communication module, and at least one of a data distribution module, a model parameter update module, a training parameter adjustment module, and a training stop determination module, wherein:

The bottom layer communication module is configured to implement communication between corresponding modules between the training units, and communication between the training units;

The data distribution module is configured to distribute required data from the data storage device to a storage unit accessible by the training unit, so that the training unit reads data from the storage unit for training;

The model parameter update module is configured to collect training information of other training units, and update model parameters of the training unit;

The training parameter adjustment module is configured to collect training information of other training units, and adjust training parameters of the training unit;

The training stop determination module is configured to collect training information of other training units to determine whether to stop training.
The machine learning tool middleware according to claim 1, wherein said data storage device is configured to store all training data of a machine learning tool, said data storage device being located on a main training unit of the machine learning tool.
The machine learning tool middleware according to claim 2, wherein the data distribution module of the main training unit is configured to receive a request of a data distribution module of another training unit, and distribute data to a data distribution module of another training unit, The data distributed by the data distribution module of the other training unit data is stored in a local storage unit of the training unit.
The machine learning tool middleware according to claim 1, wherein the model parameter update module collects training information of other training units, and transmits training information of the training unit to other training units, for each training unit The model parameters are averaged to update the model parameters.
A machine learning tool middleware according to claim 1 wherein said machine learning The tool further includes a parameter server, and the model parameter update module transmits the training information of the training unit to the parameter server, and the model server updates the model parameters and sends back the parameters.
The machine learning tool middleware according to claim 1, wherein the underlying communication module is further configured to implement communication between corresponding modules between the training units and communication between the training units. An interlock mechanism is added between communications.
A machine learning training method for model training of a machine learning tool, the machine learning tool comprising at least one training unit, wherein each training unit is provided with a middleware combined with a machine learning tool, the training The unit communicates through the middleware, and the training unit performs at least one of the following training operations by the middleware to complete the model training, the training operation includes:

Distributing the required data from the data storage device to the storage unit accessible by each training unit, so that each training unit reads data from the storage unit for training;

Collect training information of other training units, and update model parameters of the training unit;

Collect training information of other training units, and adjust training parameters of the training unit;

The training information of other training units is collected to determine whether to stop training.
The machine learning training method according to claim 7, wherein the data storage device is configured to store all training data of the machine learning tool, and the data storage device is located on a main training unit of the machine learning tool.
The machine learning training method according to claim 8, wherein said distributing the required data from the data storage device to a storage unit accessible to each training unit, so that each training unit reads data from said storage unit Train, including:

The main training unit receives the request from the middleware of the other training unit through the middleware, and distributes the data to the middleware of the other training unit;

The middleware of the other training unit data receives and distributes the data stored in the local storage unit of the training unit.
The machine learning training method according to claim 7, wherein the collecting the training information of the other training units and updating the model parameters of the training unit comprises:

The training information of the other training units is collected, and the training information of the training unit is transmitted to other training units, and the model parameters of each training unit are averaged to update the model parameters.
The machine learning training method according to claim 7, wherein the machine learning tool further comprises a parameter server, wherein the training information of the other training units is collected, and the model parameters of the training unit are updated, including:

The training information of the training unit is transmitted to the parameter server, and the parameter server updates the model parameters and sends back the parameters.
The machine learning training method according to claim 1, wherein when the training unit communicates through the middleware, the method further includes:

Add interlocking mechanisms between various communications.