WO2017107788A1 - Machine learning tool middleware and training method of machine learning - Google Patents

Machine learning tool middleware and training method of machine learning Download PDF

Info

Publication number
WO2017107788A1
WO2017107788A1 PCT/CN2016/109370 CN2016109370W WO2017107788A1 WO 2017107788 A1 WO2017107788 A1 WO 2017107788A1 CN 2016109370 W CN2016109370 W CN 2016109370W WO 2017107788 A1 WO2017107788 A1 WO 2017107788A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
unit
machine learning
data
middleware
Prior art date
Application number
PCT/CN2016/109370
Other languages
French (fr)
Chinese (zh)
Inventor
雷鸣
鄢志杰
Original Assignee
阿里巴巴集团控股有限公司
雷鸣
鄢志杰
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 雷鸣, 鄢志杰 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017107788A1 publication Critical patent/WO2017107788A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the invention belongs to the field of machine learning technology, and in particular relates to a machine learning tool middleware and a machine learning training method.
  • Machine learning is a branch of artificial intelligence, and in many cases it has become synonymous with artificial intelligence.
  • machine learning is a machine learning algorithm model that allows machines to learn patterns from a large amount of historical data to intelligently identify new samples or predict the future.
  • the general process of machine learning is to calculate the machine learning algorithm model parameters from the input data (input data), form a machine algorithm model based on the calculated model parameters, and intelligently identify new samples or predict the future.
  • the input data is very large and must be processed simultaneously by multiple computing devices to complete the calculation in a reasonable time. Therefore, the model parameters must be exchanged with each other, and the exchange model parameters are collected by the parameter server for aggregation and distribution.
  • the existing large-scale machine learning platform is a closed training framework, first based on a shareable storage space.
  • the format supported by the data file is limited, and the format supported by the model file is limited.
  • the training objectives and algorithms used in the machine learning training are selected in a finite method implemented in advance, and the parameter adjustment method in the training process and The stop condition is also pre-implemented.
  • the object of the present invention is to provide a machine learning tool middleware and a machine learning training method, so that various machine learning tools do not depend on a large-scale machine learning platform, and do not need to change specific models, data file analysis, and core training methods. Training goals can be completed.
  • a machine learning tool middleware for model training of a machine learning tool, the machine learning tool comprising at least one training unit, each training unit being provided with a middleware combined with a machine learning tool, the middleware comprising a bottom layer a communication module, and at least one of a data distribution module, a model parameter update module, a training parameter adjustment module, and a training stop determination module, wherein:
  • the bottom layer communication module is configured to implement communication between corresponding modules between the training units, and communication between the training units;
  • the data distribution module is configured to distribute required data from the data storage device to a storage unit accessible by the training unit, so that the training unit reads data from the storage unit for training;
  • the model parameter update module is configured to collect training information of other training units, and update model parameters of the training unit;
  • the training parameter adjustment module is configured to collect training information of other training units, and adjust training parameters of the training unit;
  • the training stop determination module is configured to collect training information of other training units to determine whether to stop training.
  • the data storage device is configured to store all training data of the machine learning tool, the data storage device being located on a main training unit of the machine learning tool.
  • the data distribution module of the main training unit is configured to receive a request of a data distribution module of another training unit, and distribute data to data distribution modules of other training units, and the data distribution module of the other training unit data receives the distributed data. Stored in the local storage unit of this training unit.
  • the data distribution is implemented by setting the data distribution module, and the training data is distributed from the storage device of the main training unit to the local storage unit of each training unit, and the distribution is implemented in the middleware without affecting the training unit.
  • Practice process Each training unit does not need to go to the shared storage device to extract data during each training, thus reducing the working pressure of the storage device, and does not need to share a large-scale storage platform.
  • model parameter update module collects training information of other training units, and transmits training information of the training unit to other training units, and averages model parameters for model parameters of each training unit.
  • the machine learning tool further includes a parameter server, and the model parameter update module transmits the training information of the training unit to the parameter server, and the parameter server sends the model parameter and sends it back.
  • the underlying communication module is further configured to add an interlocking mechanism between various communications when implementing communication between corresponding modules between the training units and communication between the training units. Different modules cannot communicate at the same time. When one module is communicating, other modules need to wait for it to complete before communicating.
  • the present invention also proposes a machine learning training method for model training of a machine learning tool, the machine learning tool comprising at least one training unit, each training unit being provided with a middleware combined with a machine learning tool,
  • the training unit communicates through the middleware, and the training unit performs at least one of the following training operations by the middleware to complete the model training, the training operation includes:
  • the training information of other training units is collected to determine whether to stop training.
  • the invention proposes a machine learning tool middleware and a machine learning training method, and distributes data to each training unit local storage unit through the data distribution module of the middleware, and no longer depends on the large-scale storage platform.
  • the middleware is responsible for the processing required for large-scale parallel training: data distribution, model parameter updates, training parameter adjustments, training stop synchronization, and communication between training units without changing specific models, data file parsing, and core training. Methods, training objectives, and thus no longer rely on large-scale machine learning platforms.
  • the present invention facilitates the expansion of various machine learning tools with little or no impact on the training behavior of a single training unit while supporting the expansion of various data file formats.
  • FIG. 1 is a schematic structural view of a middleware of a machine learning tool of the present invention
  • FIG. 2 is a schematic diagram of correspondence between machine learning training and middleware according to the present invention.
  • FIG. 3 is a flowchart of a machine learning training method according to an embodiment of the present invention.
  • Machine learning tools are widely used in the field of artificial intelligence. Common machine learning tools include Caffe, Kaldi, etc. Machine learning tools train machine learning models based on known training data, and use machine learning models to analyze unknown data for learning. To new knowledge.
  • the general idea of the present invention is to provide a machine learning tool middleware that enables a machine learning tool to adapt to different training data file formats, and the middleware can be applied to any machine learning tool to satisfy different machine learning tools based on different Training data, different models or training methods, training of machine learning models.
  • a machine learning tool middleware of the embodiment includes: a data distribution module, a model parameter update module, a training parameter adjustment module, a training stop determination module, and an underlying communication module.
  • the machine learning tool of the embodiment implements the combination of the two by calling the middleware, and then deploys the middleware and the machine learning tool on one or more servers while training.
  • the machine learning tool includes at least one basic machine learning tool process for implementing parallel processing of different training data or parallel processing of different model partitions. This embodiment supports both types of distributed simultaneously. Parallel processing.
  • Each basic machine learning tool process is called a training unit. For example, machine learning tools and their combined middleware deployed on different servers form a training unit to process a machine learning tool process.
  • Each training unit includes machine learning tools and corresponding middleware, training units Connected by the underlying communication module, in a training unit, the data distribution module, the model parameter update module, the training parameter adjustment module, and the training stop determination module are respectively connected with the machine learning tool and connected with the underlying communication module, and the bottom communication module Also connected to machine learning tools.
  • the connection described in this embodiment belongs to an interface call of a software program, and is not described here.
  • the data distribution module is configured to distribute required data from the data storage device to a storage unit accessible by each training unit.
  • all training data used in the training is usually stored in a data storage device of a main training unit, and the data distribution module of each training unit requests the data distribution module corresponding to the main training unit.
  • the data is then transmitted over the network to a local storage unit for use by the local training unit.
  • each training unit has its own data storage unit, and the training data is stored in the storage device of the main training unit, and the data is distributed to the storage unit local to each training unit through the data distribution module for use by each training unit, and each training unit is locally used.
  • the storage unit reads the training data for training.
  • the storage device and the storage unit of the embodiment are respectively provided.
  • the storage unit is local to the training unit server, and the training unit can also access other storage devices.
  • the distribution of data here is carried out on the middleware in the background and does not affect the actual training process of the training unit. In this way, when the training unit processes the current data file, the processing of the next data file can be directly performed, that is, the data file that the middleware data distribution module has prepared.
  • a model parameter update module is used to update the model parameters between the training units.
  • the parameter update may be performed through the model parameter update module of the middleware, that is, the training information of other training units is collected, and the training information of the training unit is notified to other training. unit.
  • the training information here may be the model parameter itself or a related parameter when the model parameter is updated, such as a gradient.
  • the parameter update may be performed synchronously by each training unit, or may be performed asynchronously by each training unit, or may be performed by a virtual parameter server.
  • the update method may be that the model parameters on each training unit are averaged (synchronized), or each training unit sends the gradient to the parameter server, and the parameter server sends the latest model parameters back, and then proceeds to the next step. Training (asynchronous).
  • the training parameter adjustment module is configured to adjust the training parameters of each training unit.
  • the training parameter adjustment module is similar to the model parameter update module, mainly for the training target, learning rate, etc. of the training unit.
  • the information is exchanged with other training units, and then the training parameters are adjusted. In this way, each adjustment is adjusted based on the training information of all training units, rather than the training information of a single training unit, which can provide a better adjustment mechanism.
  • the training stop determination module is configured to determine whether to stop training based on the training information of all the training units. Similar to the training parameter adjustment module, the training stop determination module is based on the training information of all training units to determine whether to stop training, rather than the training information of a single training unit, which can provide a better stopping mechanism.
  • the underlying communication module is configured to implement communication between corresponding modules between the training units and communication between the training units.
  • the module is mainly used to process communication between corresponding modules between the training units, for example, communication between the training unit 1 and the data distribution module of the training unit 2, and the data is distributed by calling the underlying communication module; Communication between the corresponding model parameter update modules, the training parameter adjustment modules corresponding to the two training units, and the training stop determination modules corresponding to the two training units.
  • the training unit can continuously synthesize the training performance of all training units, such as the objective indicators of training, by calling the underlying communication module in a specific training process.
  • each training unit can perform unified behavior control between the training units by calling the underlying communication module in a specific training process, such as when the actual training is performed consistently, and when the specified test is performed consistently.
  • an interlock mechanism needs to be added between various communications to ensure communication security.
  • some underlying system communication implementations such as the MPI communication protocol
  • the embodiment adds an interlock mechanism to the underlying communication module, so that different modules cannot communicate at the same time. When one module is communicating, other modules need to wait for completion to communicate.
  • a typical machine learning training process is as follows:
  • Each training unit is started at the same time.
  • the main training unit (capable of accessing model files and data files) transmits the model files to all other training units through the middleware communication module of the middleware, and each training unit reads Into the model file.
  • Each training unit requests training data from the main training unit data distribution module storing the training data through the middleware data distribution module, and the main training unit middleware data distribution module responds to the request and distributes the training data to the local storage unit of each training unit.
  • Each training unit reads the data file prepared by the middleware data distribution module for training processing; meanwhile, the middleware data distribution module continues to distribute data in the background to prepare the next batch of data files.
  • the parameter update is performed by the middleware model parameter update module, that is, the training information of other training units is collected, and the training information of the training unit is notified to other training units.
  • the model parameters are updated by the middleware model parameter update module.
  • each training unit model parameter update module sends the gradient to the parameter server, and the parameter server sends back the latest model parameters, and then performs the next training.
  • the training parameter adjustment module exchanges information such as the training target and the learning rate of the training unit with other training units, and then adjusts the training parameters through the middleware training parameter adjustment module.
  • the training stop determination module collects the training information of the other training units, and informs the other training units of the training information of the training unit, and determines whether to stop the training based on the training information of all the training units.
  • the middleware training stop judgment module determines whether to stop training. If the judgment is stopped, the training is ended, and the learned model is output. Otherwise, the training data is continued to be read, and the training of the next batch of training data is performed until the training process is completed.
  • the information data transmitted between the above modules is transmitted through the underlying communication module.
  • some machine learning tools have their own training parameter adjustment methods, so that the user can choose not to use the training parameter adjustment module of the present invention, and adopt the method of the machine learning tool itself, and simultaneously use the underlying communication module of the present invention. Synchronize the training parameters on each machine learning program to ensure overall consistency.
  • Some machine learning tools do not allow dynamic reading of new data files at runtime, so The user can choose not to use the data distribution module in the present invention, but only distribute the data to each machine before the training starts. During the training, each training unit directly reads the training data already distributed by the machine to start training.
  • a machine learning training method is used for model training of a machine learning tool
  • the machine learning tool includes at least one training unit
  • each training unit is provided with an intermediate combination with a machine learning tool.
  • the training unit communicates through the middleware, and the training unit performs the model training by performing at least one of the following training operations, the training operation includes:
  • the training information of other training units is collected to determine whether to stop training.
  • the above training operations are performed by middleware, including data distribution, parameter update, adjustment of training parameters, and suspension of training.
  • Each training unit requests training data from the main training unit storing the training data through the middleware, and the main training unit middleware responds to the request, and distributes the training data to the local storage unit of each training unit.
  • Each training unit reads the data file prepared by the middleware and performs training processing.
  • the middleware performs data distribution in the background to prepare the next batch of data files.
  • the model parameters are updated through the middleware.
  • the training information of other training units is collected, and the training information of the training unit is notified to other training units; or each training unit sends the gradient to the parameter server through the middleware, and the latest model parameters are sent back by the parameter server, and then the next One step training.
  • the training unit exchanges the training target, the learning rate and other information of the training unit with other training units through the middleware, and then adjusts the training parameters through the middleware.
  • the training unit collects the training information of the other training units through the middleware, and informs the other training units of the training information of the training unit, and determines whether to stop the training based on the training information of all the training units.
  • the training unit performs each batch of data processing it judges whether to stop training through the middleware training. If the judgment stops, the training ends, and the learned model is output. Otherwise, the training data is continued to be read, and the next batch of training data is performed. Train until the training process is completed.

Abstract

A machine learning tool middleware and a machine learning training method. The machine learning tool comprises at least one training unit, wherein each training unit is provided with a middleware combined with a machine learning tool. Each middleware comprises a low-level communication module and at least one of a data distribution module, a model parameter updating module, a training parameter adjusting module and a training cessation determining module. The training unit distributes required data from a data storage device to a storage unit accessible to each training unit through the middleware, so that each training unit reads the data from the storage unit to implement the training, completes the update of model parameters of the training unit and the adjustment of training parameters of each training unit through communication between corresponding modules of the middleware, and determines whether to cease the training according to the training information of all training units. The middleware is responsible for required processing for large-scale parallel training, and is convenient for expanding various machine learning tools.

Description

一种机器学习工具中间件及机器学习训练方法Machine learning tool middleware and machine learning training method 技术领域Technical field
本发明属于机器学习技术领域,尤其涉及一种机器学习工具中间件及机器学习训练方法。The invention belongs to the field of machine learning technology, and in particular relates to a machine learning tool middleware and a machine learning training method.
背景技术Background technique
机器学习是人工智能的一个分支,而在很多时候,几乎成为人工智能的代名词。简单来说,机器学习就是通过机器学习算法模型,使得机器能从大量历史数据中学习规律,从而对新的样本做智能识别或对未来做预测。机器学习的一般过程是从输入数据(输入数据)中计算出机器学习算法模型参数,根据计算得到的模型参数形成机器算法模型,并对新的样本做智能识别或对未来做预测。在很多现实应用中,输入数据非常大,必须由多台计算装置同时处理才能在合理的时间内完成计算,因此必须互相交换模型参数,而交换模型参数由参数服务器来收集参数进行汇总和分发。Machine learning is a branch of artificial intelligence, and in many cases it has become synonymous with artificial intelligence. Simply put, machine learning is a machine learning algorithm model that allows machines to learn patterns from a large amount of historical data to intelligently identify new samples or predict the future. The general process of machine learning is to calculate the machine learning algorithm model parameters from the input data (input data), form a machine algorithm model based on the calculated model parameters, and intelligently identify new samples or predict the future. In many practical applications, the input data is very large and must be processed simultaneously by multiple computing devices to complete the calculation in a reasonable time. Therefore, the model parameters must be exchanged with each other, and the exchange model parameters are collected by the parameter server for aggregation and distribution.
现有的大规模机器学习平台是一个封闭的训练框架,首先基于一个可共享的存储空间。另外例如数据文件支持的格式是有限的,模型文件支持的格式是有限的,进行机器学习训练时采用的训练目标和算法是在预先实现的有限方法中进行选择,训练过程中的参数调整方法和停止条件也是预先实现的。The existing large-scale machine learning platform is a closed training framework, first based on a shareable storage space. In addition, for example, the format supported by the data file is limited, and the format supported by the model file is limited. The training objectives and algorithms used in the machine learning training are selected in a finite method implemented in advance, and the parameter adjustment method in the training process and The stop condition is also pre-implemented.
而实际中不同的产品或者业务往往需要不同的数据、模型或者训练方法,基于不同的训练工具实现,这些相关文件以及训练方法往往会有很大的不同。如果基于现有的大规模机器学习平台实现,则需要完全用该平台已有的功能替换,或对该平台进行扩展以便兼容实际的机器学习任务。但是这样做,就需要进行大量的实验对比验证,而且需要对已有产品进行修改以兼容该平台的数据、模型格式。另外,也不能够保证该平台的已有实现能够达到业务上的需求。同时还需要对该平台的实现有深入的了解,而且需要花费大量的时间进行数据格式、模型格式以及训练方法的实现,对用户有很高的要求。 In reality, different products or services often require different data, models or training methods. Based on different training tools, these related documents and training methods tend to be very different. If implemented on an existing large-scale machine learning platform, it needs to be completely replaced with the existing functionality of the platform, or extended to be compatible with actual machine learning tasks. However, in doing so, a large number of experimental comparisons are required, and existing products need to be modified to be compatible with the platform's data and model formats. In addition, there is no guarantee that the existing implementation of the platform will meet the business needs. At the same time, it needs to have a deep understanding of the implementation of the platform, and it takes a lot of time to implement the data format, model format and training method, which has high requirements for users.
发明内容Summary of the invention
本发明的目的是提供一种机器学习工具中间件及机器学习训练方法,使得各种机器学习工具不依赖于大规模机器学习平台,不需要改变具体的模型、数据文件解析、以及核心的训练方法、训练目标,就能够完成训练。The object of the present invention is to provide a machine learning tool middleware and a machine learning training method, so that various machine learning tools do not depend on a large-scale machine learning platform, and do not need to change specific models, data file analysis, and core training methods. Training goals can be completed.
为了实现上述目的,本发明技术方案如下:In order to achieve the above object, the technical solution of the present invention is as follows:
一种机器学习工具中间件,用于机器学习工具的模型训练,所述机器学习工具包括至少一个训练单元,每个训练单元都设置有与机器学习工具结合的中间件,所述中间件包括底层通信模块,以及数据分发模块、模型参数更新模块、训练参数调整模块和训练停止判断模块中的至少一块,其中:A machine learning tool middleware for model training of a machine learning tool, the machine learning tool comprising at least one training unit, each training unit being provided with a middleware combined with a machine learning tool, the middleware comprising a bottom layer a communication module, and at least one of a data distribution module, a model parameter update module, a training parameter adjustment module, and a training stop determination module, wherein:
所述底层通信模块,用于实现训练单元之间对应模块之间的通信,以及训练单元之间的通信;The bottom layer communication module is configured to implement communication between corresponding modules between the training units, and communication between the training units;
所述数据分发模块,用于从数据存储设备中分发需要的数据到训练单元能够访问的存储单元,以便训练单元从所述存储单元中读取数据进行训练;The data distribution module is configured to distribute required data from the data storage device to a storage unit accessible by the training unit, so that the training unit reads data from the storage unit for training;
所述模型参数更新模块,用于收集其他训练单元的训练信息,更新本训练单元的模型参数;The model parameter update module is configured to collect training information of other training units, and update model parameters of the training unit;
所述训练参数调整模块,用于收集其他训练单元的训练信息,对本训练单元的训练参数进行调整;The training parameter adjustment module is configured to collect training information of other training units, and adjust training parameters of the training unit;
所述训练停止判断模块,用于收集其他训练单元的训练信息,来进行是否停止训练的判断。The training stop determination module is configured to collect training information of other training units to determine whether to stop training.
进一步地,所述数据存储设备用于存储机器学习工具所有训练数据,所述数据存储设备位于机器学习工具的主训练单元上。Further, the data storage device is configured to store all training data of the machine learning tool, the data storage device being located on a main training unit of the machine learning tool.
进一步地,所述主训练单元的数据分发模块用于接收其他训练单元的数据分发模块的请求,向其他训练单元的数据分发模块分发数据,所述其他训练单元数据的数据分发模块接收分发的数据存储在本训练单元的本地存储单元。Further, the data distribution module of the main training unit is configured to receive a request of a data distribution module of another training unit, and distribute data to data distribution modules of other training units, and the data distribution module of the other training unit data receives the distributed data. Stored in the local storage unit of this training unit.
通过设置数据分发模块实现数据的分发,训练数据从主训练单元的存储设备分发到各训练单元的本地存储单元,分发在中间件中实现,不影响训练单元的训 练过程。各训练单元不需要在每次训练时到共享的存储设备去提取数据,因此降低了存储设备的工作压力,不需要共享一个大规模存储平台。The data distribution is implemented by setting the data distribution module, and the training data is distributed from the storage device of the main training unit to the local storage unit of each training unit, and the distribution is implemented in the middleware without affecting the training unit. Practice process. Each training unit does not need to go to the shared storage device to extract data during each training, thus reducing the working pressure of the storage device, and does not need to share a large-scale storage platform.
进一步地,所述模型参数更新模块收集其他训练单元的训练信息,并且将本训练单元的训练信息传送给其他训练单元,对各训练单元的模型参数进行平均更新模型参数。Further, the model parameter update module collects training information of other training units, and transmits training information of the training unit to other training units, and averages model parameters for model parameters of each training unit.
或者,所述机器学习工具还包括参数服务器,所述模型参数更新模块将本训练单元的训练信息传送到参数服务器,由参数服务器更新模型参数后发回。Alternatively, the machine learning tool further includes a parameter server, and the model parameter update module transmits the training information of the training unit to the parameter server, and the parameter server sends the model parameter and sends it back.
进一步地,所述底层通信模块还用于在实现训练单元之间对应模块之间的通信,以及训练单元之间的通信时,为各种通信之间加上互锁机制。使不同的模块不能够同时的进行通信,当一个模块正在进行通信时,其他的模块需要等待其完成才能进行通信。Further, the underlying communication module is further configured to add an interlocking mechanism between various communications when implementing communication between corresponding modules between the training units and communication between the training units. Different modules cannot communicate at the same time. When one module is communicating, other modules need to wait for it to complete before communicating.
本发明还提出了一种机器学习训练方法,用于机器学习工具的模型训练,所述机器学习工具包括至少一个训练单元,每个训练单元都设置有与机器学习工具结合的中间件,所述训练单元通过所述中间件进行通信,训练单元之间通过所述中间件执行如下训练操作中的至少一项完成模型训练,所述训练操作包括:The present invention also proposes a machine learning training method for model training of a machine learning tool, the machine learning tool comprising at least one training unit, each training unit being provided with a middleware combined with a machine learning tool, The training unit communicates through the middleware, and the training unit performs at least one of the following training operations by the middleware to complete the model training, the training operation includes:
从数据存储设备中分发需要的数据到各个训练单元能够访问的存储单元,以便各个训练单元从所述存储单元中读取数据进行训练;Distributing the required data from the data storage device to the storage unit accessible by each training unit, so that each training unit reads data from the storage unit for training;
收集其他训练单元的训练信息,更新本训练单元的模型参数;Collect training information of other training units, and update model parameters of the training unit;
收集其他训练单元的训练信息,对本训练单元的训练参数进行调整;Collect training information of other training units, and adjust training parameters of the training unit;
收集其他训练单元的训练信息,来进行是否停止训练的判断。The training information of other training units is collected to determine whether to stop training.
本发明提出了一种机器学习工具中间件及机器学习训练方法,通过中间件的数据分发模块分发数据到各训练单元本地存储单元,不再依赖于大规模存储平台。中间件负责进行大规模并行训练所需要的处理:数据分发、模型参数更新、训练参数调整、训练停止同步以及训练单元之间的通信,而不改变具体的模型、数据文件解析,以及核心的训练方法、训练目标,从而不再依赖于大规模机器学习平台。本发明对各种机器学习工具方便扩展,而几乎不影响单个训练单元的训练行为,同时支持对各种数据文件格式的扩展。 The invention proposes a machine learning tool middleware and a machine learning training method, and distributes data to each training unit local storage unit through the data distribution module of the middleware, and no longer depends on the large-scale storage platform. The middleware is responsible for the processing required for large-scale parallel training: data distribution, model parameter updates, training parameter adjustments, training stop synchronization, and communication between training units without changing specific models, data file parsing, and core training. Methods, training objectives, and thus no longer rely on large-scale machine learning platforms. The present invention facilitates the expansion of various machine learning tools with little or no impact on the training behavior of a single training unit while supporting the expansion of various data file formats.
附图说明DRAWINGS
图1为本发明机器学习工具中间件结构示意图;1 is a schematic structural view of a middleware of a machine learning tool of the present invention;
图2为本发明机器学习训练与中间件对应关系示意图;2 is a schematic diagram of correspondence between machine learning training and middleware according to the present invention;
图3为本发明实施例机器学习训练方法流程。FIG. 3 is a flowchart of a machine learning training method according to an embodiment of the present invention.
具体实施方式detailed description
下面结合附图和实施例对本发明技术方案做进一步详细说明,以下实施例不构成对本发明的限定。The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings and embodiments. The following embodiments are not to be construed as limiting.
机器学习工具在人工智能领域应用非常广泛,常用的机器学习工具包括Caffe、Kaldi等,机器学习工具根据已知的训练数据训练得到机器学习模型,并采用机器学习模型对未知的数据进行分析以便学习到新的知识。本发明的总体思想是提供一种机器学习工具中间件,使得机器学习工具能够适应不同的训练数据文件格式,并且该中间件能够适用于任何机器学习工具,从而满足基于不同的机器学习工具、不同的训练数据、不同的模型或者训练方法,进行机器学习模型的训练。Machine learning tools are widely used in the field of artificial intelligence. Common machine learning tools include Caffe, Kaldi, etc. Machine learning tools train machine learning models based on known training data, and use machine learning models to analyze unknown data for learning. To new knowledge. The general idea of the present invention is to provide a machine learning tool middleware that enables a machine learning tool to adapt to different training data file formats, and the middleware can be applied to any machine learning tool to satisfy different machine learning tools based on different Training data, different models or training methods, training of machine learning models.
如图1所示,本实施例一种机器学习工具中间件,包括:数据分发模块、模型参数更新模块、训练参数调整模块、训练停止判断模块和底层通信模块。As shown in FIG. 1 , a machine learning tool middleware of the embodiment includes: a data distribution module, a model parameter update module, a training parameter adjustment module, a training stop determination module, and an underlying communication module.
在实际的应用中,本实施例机器学习工具通过调用中间件实现两者的结合,然后将中间件与机器学习工具部署在一个或多个服务器上同时进行训练。在进行模型训练时,机器学习工具包括至少一个基本的机器学习工具进程,用于实现对不同训练数据的并行处理,或者对不同的模型分区进行并行处理,本实施例同时支持这两种分布式并行处理方式。每一个基本的机器学习工具进程称为一个训练单元,例如部署在不同服务器上的机器学习工具及其结合的中间件构成一个训练单元,用以处理一个机器学习工具进程。In a practical application, the machine learning tool of the embodiment implements the combination of the two by calling the middleware, and then deploys the middleware and the machine learning tool on one or more servers while training. In performing model training, the machine learning tool includes at least one basic machine learning tool process for implementing parallel processing of different training data or parallel processing of different model partitions. This embodiment supports both types of distributed simultaneously. Parallel processing. Each basic machine learning tool process is called a training unit. For example, machine learning tools and their combined middleware deployed on different servers form a training unit to process a machine learning tool process.
在图1中,示例性地列举了两个训练单元1和训练单元2,本发明不限于训练单元数量的多少。每个训练单元包括机器学习工具和对应的中间件,训练单元 之间通过底层通信模块连接,在一个训练单元中,数据分发模块、模型参数更新模块、训练参数调整模块、训练停止判断模块均分别与机器学习工具连接,并与底层通信模块连接,底层通信模块还与机器学习工具进行连接。本实施例所述的连接,属于软件程序方面的接口调用,这里不再赘述。In Fig. 1, two training units 1 and a training unit 2 are exemplarily listed, and the present invention is not limited to the number of training units. Each training unit includes machine learning tools and corresponding middleware, training units Connected by the underlying communication module, in a training unit, the data distribution module, the model parameter update module, the training parameter adjustment module, and the training stop determination module are respectively connected with the machine learning tool and connected with the underlying communication module, and the bottom communication module Also connected to machine learning tools. The connection described in this embodiment belongs to an interface call of a software program, and is not described here.
其中,数据分发模块,用于从数据存储设备中分发需要的数据到各个训练单元能够访问的存储单元。The data distribution module is configured to distribute required data from the data storage device to a storage unit accessible by each training unit.
对于具有多个训练单元的机器学习工具来说,训练所用到的所有训练数据通常存储在一个主训练单元的数据存储设备中,各训练单元的数据分发模块向主训练单元对应的数据分发模块请求数据,然后通过网络传输数据文件到本地存储单元,提供给本地的训练单元使用。通常每个训练单元具有自己的数据存储单元,训练数据存储在主训练单元的存储设备中,通过数据分发模块将数据分发到各个训练单元本地的存储单元供各个训练单元使用,各训练单元从本地的存储单元读取训练数据进行训练。本实施例的存储设备和存储单元分别设置,优选地存储单元在训练单元服务器本地,也可以位于各训练单元能够访问其他存储设备。这里数据的分发是后台在中间件上进行的,不会影响训练单元实际的训练过程。这样,在训练单元处理完当前数据文件的时候,就可以直接进行下一数据文件的处理,即中间件数据分发模块已经准备好的数据文件。For a machine learning tool having multiple training units, all training data used in the training is usually stored in a data storage device of a main training unit, and the data distribution module of each training unit requests the data distribution module corresponding to the main training unit. The data is then transmitted over the network to a local storage unit for use by the local training unit. Generally, each training unit has its own data storage unit, and the training data is stored in the storage device of the main training unit, and the data is distributed to the storage unit local to each training unit through the data distribution module for use by each training unit, and each training unit is locally used. The storage unit reads the training data for training. The storage device and the storage unit of the embodiment are respectively provided. Preferably, the storage unit is local to the training unit server, and the training unit can also access other storage devices. The distribution of data here is carried out on the middleware in the background and does not affect the actual training process of the training unit. In this way, when the training unit processes the current data file, the processing of the next data file can be directly performed, that is, the data file that the middleware data distribution module has prepared.
模型参数更新模块,用于实现各训练单元之间模型参数的更新。当训练单元处理完若干批次数据需要进行多训练单元更新时,可以通过中间件的模型参数更新模块进行参数更新,即收集其他训练单元的训练信息,并且将本训练单元的训练信息告诉其他训练单元。这里的训练信息可以是模型参数本身,也可以是模型参数更新时的相关参数,比如梯度。而参数更新可以是各个训练单元同步进行,也可以各个训练单元异步进行,还可以通过一个虚拟的参数服务器进行。具体来说,更新方法可以是各训练单元上的模型参数进行平均(同步的),也可以是各个训练单元将梯度发送给参数服务器,由参数服务器将最新的模型参数发回,然后进行下一步的训练(异步的)。A model parameter update module is used to update the model parameters between the training units. When the training unit processes a plurality of batches of data and needs to perform multi-training unit update, the parameter update may be performed through the model parameter update module of the middleware, that is, the training information of other training units is collected, and the training information of the training unit is notified to other training. unit. The training information here may be the model parameter itself or a related parameter when the model parameter is updated, such as a gradient. The parameter update may be performed synchronously by each training unit, or may be performed asynchronously by each training unit, or may be performed by a virtual parameter server. Specifically, the update method may be that the model parameters on each training unit are averaged (synchronized), or each training unit sends the gradient to the parameter server, and the parameter server sends the latest model parameters back, and then proceeds to the next step. Training (asynchronous).
训练参数调整模块,用于对各训练单元的训练参数进行调整。训练参数调整模块与模型参数更新模块类似,主要是将本训练单元的训练目标、学习速率等信 息与其他训练单元进行交换,然后进行训练参数的调整。这样每次调整是基于所有训练单元的训练信息统一的进行调整,而不是单个训练单元的训练信息,可以提供更好的调整机制。The training parameter adjustment module is configured to adjust the training parameters of each training unit. The training parameter adjustment module is similar to the model parameter update module, mainly for the training target, learning rate, etc. of the training unit. The information is exchanged with other training units, and then the training parameters are adjusted. In this way, each adjustment is adjusted based on the training information of all training units, rather than the training information of a single training unit, which can provide a better adjustment mechanism.
训练停止判断模块,用于基于所有训练单元的训练信息来进行是否停止训练的判断。与训练参数调整模块类似,训练停止判断模块是基于所有训练单元的训练信息来进行是否停止训练的判断,而不是单个训练单元的训练信息,这样可以提供更好的停止机制。The training stop determination module is configured to determine whether to stop training based on the training information of all the training units. Similar to the training parameter adjustment module, the training stop determination module is based on the training information of all training units to determine whether to stop training, rather than the training information of a single training unit, which can provide a better stopping mechanism.
底层通信模块,用于实现训练单元之间对应模块之间的通信,以及训练单元之间的通信。The underlying communication module is configured to implement communication between corresponding modules between the training units and communication between the training units.
该模块主要是用来处理训练单元之间对应模块的通信,例如训练单元1与训练单元2数据分发模块之间的通信,是通过调用底层通信模块来实现数据的分发;又如两个训练单元对应的模型参数更新模块之间、两个训练单元对应的训练参数调整模块之间、两个训练单元对应的训练停止判断模块之间的通信。The module is mainly used to process communication between corresponding modules between the training units, for example, communication between the training unit 1 and the data distribution module of the training unit 2, and the data is distributed by calling the underlying communication module; Communication between the corresponding model parameter update modules, the training parameter adjustment modules corresponding to the two training units, and the training stop determination modules corresponding to the two training units.
同时可以提供训练单元之间进行一些必要的通信。例如:训练单元可以在具体的训练过程中通过调用底层通信模块来不断的同步综合所有训练单元的训练表现,比如训练的客观指标。又例如各个训练单元可以在具体的训练过程中通过调用底层通信模块来进行训练单元之间的统一行为控制,比如何时一致的进行实际的训练,何时一致的进行指定的测试。At the same time, some necessary communication between the training units can be provided. For example, the training unit can continuously synthesize the training performance of all training units, such as the objective indicators of training, by calling the underlying communication module in a specific training process. For example, each training unit can perform unified behavior control between the training units by calling the underlying communication module in a specific training process, such as when the actual training is performed consistently, and when the specified test is performed consistently.
同时,为了进行无风险的通信,需要在各种通信之间加上互锁机制,以保证通信安全。在某些底层的系统通信实现上,比如MPI通信协议,并不能够充分的支持多线程自由的调用进行通信。也就是说,存在一些系统底层通信协议使得不允许多个模块同时进行通信。为了保护通信安全,本实施例在底层通信模块上加入了互锁机制,使不同的模块不能够同时的进行通信,当一个模块正在进行通信时,其他的模块需要等待其完成才能进行通信。At the same time, in order to carry out risk-free communication, an interlock mechanism needs to be added between various communications to ensure communication security. In some underlying system communication implementations, such as the MPI communication protocol, it is not sufficient to support multi-threaded free calls for communication. That is to say, there are some system underlying communication protocols that prevent multiple modules from communicating at the same time. In order to protect communication security, the embodiment adds an interlock mechanism to the underlying communication module, so that different modules cannot communicate at the same time. When one module is communicating, other modules need to wait for completion to communicate.
如图2所示,采用本实施例中间件,进行一个典型的机器学习训练过程如下:As shown in FIG. 2, using the middleware of this embodiment, a typical machine learning training process is as follows:
各个训练单元同时启动,主要的训练单元(能够访问模型文件、数据文件)将模型文件通过中间件底层通信模块传输给其他所有训练单元,各个训练单元读 入模型文件。然后各训练单元通过中间件数据分发模块向存储有训练数据的主训练单元数据分发模块请求训练数据,主训练单元中间件数据分发模块响应请求,分发训练数据到各训练单元的本地存储单元。各个训练单元读入中间件数据分发模块准备好的数据文件,进行训练处理;同时,中间件数据分发模块继续在后台进行数据分发,准备下一批次的数据文件。Each training unit is started at the same time. The main training unit (capable of accessing model files and data files) transmits the model files to all other training units through the middleware communication module of the middleware, and each training unit reads Into the model file. Each training unit then requests training data from the main training unit data distribution module storing the training data through the middleware data distribution module, and the main training unit middleware data distribution module responds to the request and distributes the training data to the local storage unit of each training unit. Each training unit reads the data file prepared by the middleware data distribution module for training processing; meanwhile, the middleware data distribution module continues to distribute data in the background to prepare the next batch of data files.
通过中间件模型参数更新模块进行参数更新,即收集其他训练单元的训练信息,并且将本训练单元的训练信息告诉其他训练单元。训练单元按照自身的训练目标以及训练方法处理完每一批次数据处理之后,通过中间件模型参数更新模块更新模型参数。或各个训练单元模型参数更新模块将梯度发送给参数服务器,由参数服务器将最新的模型参数发回,然后进行下一步的训练。The parameter update is performed by the middleware model parameter update module, that is, the training information of other training units is collected, and the training information of the training unit is notified to other training units. After the training unit processes each batch of data processing according to its own training target and training method, the model parameters are updated by the middleware model parameter update module. Or each training unit model parameter update module sends the gradient to the parameter server, and the parameter server sends back the latest model parameters, and then performs the next training.
训练参数调整模块将本训练单元的训练目标、学习速率等信息与其他训练单元进行交换,然后通过中间件训练参数调整模块调整训练参数。The training parameter adjustment module exchanges information such as the training target and the learning rate of the training unit with other training units, and then adjusts the training parameters through the middleware training parameter adjustment module.
类似地,训练停止判断模块收集其他训练单元的训练信息,并且将本训练单元的训练信息告诉其他训练单元,基于所有训练单元的训练信息来进行是否停止训练的判断。训练单元进行每一批次数据处理的时候,通过中间件训练停止判断模块判断是否停止训练。如果判断停止,则结束训练,输出学习到的模型,否则返回继续读取训练数据,进行下一批训练数据的训练,直到完成训练过程。Similarly, the training stop determination module collects the training information of the other training units, and informs the other training units of the training information of the training unit, and determines whether to stop the training based on the training information of all the training units. When the training unit performs each batch of data processing, the middleware training stop judgment module determines whether to stop training. If the judgment is stopped, the training is ended, and the learned model is output. Otherwise, the training data is continued to be read, and the training of the next batch of training data is performed until the training process is completed.
上述各模块间相互传送信息数据都通过底层通信模块来进行传输。The information data transmitted between the above modules is transmitted through the underlying communication module.
通过上述过程,多个训练单元进行机器模型任务处理时,就可以根据自身的训练方法、算法不断的进行模型参数、训练参数的更新,对自身的模型、数据格式文件进行处理,达到大规模并行化处理的目的。Through the above process, when multiple training units perform machine model task processing, they can continuously update model parameters and training parameters according to their own training methods and algorithms, and process their own models and data format files to achieve massive parallelism. The purpose of the treatment.
需要说明的是,本实施例的中间件中只有底层通信模块是必须的,其他模块可以根据具体的机器学习工具选择需要的模块组合。It should be noted that only the underlying communication module is necessary in the middleware of this embodiment, and other modules may select a required combination of modules according to a specific machine learning tool.
例如:有些机器学习工具有自身的一些训练参数调整方法,这样用户就可以选择不使用本发明中的训练参数调整模块,而采用机器学习工具本身的方法,同时用本发明中的底层通信模块来同步各个机器学习程序上的训练参数,保证整体一致。又如有些机器学习工具不允许在运行时动态的读取新的数据文件,因此用 户可以选择不使用本发明中的数据分发模块,而只是在训练开始前把数据先分发到各个机器上,训练时各个训练单元直接读取本机已经分发好的训练数据开始训练即可。For example, some machine learning tools have their own training parameter adjustment methods, so that the user can choose not to use the training parameter adjustment module of the present invention, and adopt the method of the machine learning tool itself, and simultaneously use the underlying communication module of the present invention. Synchronize the training parameters on each machine learning program to ensure overall consistency. Another example is that some machine learning tools do not allow dynamic reading of new data files at runtime, so The user can choose not to use the data distribution module in the present invention, but only distribute the data to each machine before the training starts. During the training, each training unit directly reads the training data already distributed by the machine to start training.
如图3所示,本发明实施例一种机器学习训练方法,用于机器学习工具的模型训练,该机器学习工具包括至少一个训练单元,每个训练单元都设置有与机器学习工具结合的中间件,训练单元通过中间件进行通信,训练单元之间通过所述中间件执行如下训练操作中的至少一项完成模型训练,训练操作包括:As shown in FIG. 3, a machine learning training method is used for model training of a machine learning tool, the machine learning tool includes at least one training unit, and each training unit is provided with an intermediate combination with a machine learning tool. The training unit communicates through the middleware, and the training unit performs the model training by performing at least one of the following training operations, the training operation includes:
从数据存储设备中分发需要的数据到各个训练单元能够访问的存储单元,以便各个训练单元从所述存储单元中读取数据进行训练;Distributing the required data from the data storage device to the storage unit accessible by each training unit, so that each training unit reads data from the storage unit for training;
收集其他训练单元的训练信息,更新本训练单元的模型参数;Collect training information of other training units, and update model parameters of the training unit;
收集其他训练单元的训练信息,对本训练单元的训练参数进行调整;Collect training information of other training units, and adjust training parameters of the training unit;
收集其他训练单元的训练信息,来进行是否停止训练的判断。The training information of other training units is collected to determine whether to stop training.
上述训练操作通过中间件进行,包括数据的分发、进行参数更新、调整训练参数和停止训练的判断。各训练单元通过中间件向存储有训练数据的主训练单元请求训练数据,主训练单元中间件响应请求,分发训练数据到各训练单元的本地存储单元。各个训练单元读入中间件准备好的数据文件,进行训练处理,同时,中间件在后台进行数据分发,准备下一批次的数据文件。在训练过程中,训练单元按照自身的训练目标以及训练方法处理完每一批次数据处理之后,通过中间件更新模型参数。即收集其他训练单元的训练信息,并且将本训练单元的训练信息告诉其他训练单元;或各个训练单元通过中间件将梯度发送给参数服务器,由参数服务器将最新的模型参数发回,然后进行下一步的训练。训练单元通过中间件将本训练单元的训练目标、学习速率等信息与其他训练单元进行交换,然后通过中间件调整训练参数。类似地,训练单元通过中间件收集其他训练单元的训练信息,并且将本训练单元的训练信息告诉其他训练单元,基于所有训练单元的训练信息来进行是否停止训练的判断。训练单元进行每一批次数据处理的时候,通过中间件训练判断是否停止训练,如果判断停止,则结束训练,输出学习到的模型,否则返回继续读取训练数据,进行下一批训练数据的训练,直到完成训练过程。 The above training operations are performed by middleware, including data distribution, parameter update, adjustment of training parameters, and suspension of training. Each training unit requests training data from the main training unit storing the training data through the middleware, and the main training unit middleware responds to the request, and distributes the training data to the local storage unit of each training unit. Each training unit reads the data file prepared by the middleware and performs training processing. At the same time, the middleware performs data distribution in the background to prepare the next batch of data files. During the training process, after the training unit processes each batch of data processing according to its own training target and training method, the model parameters are updated through the middleware. That is, the training information of other training units is collected, and the training information of the training unit is notified to other training units; or each training unit sends the gradient to the parameter server through the middleware, and the latest model parameters are sent back by the parameter server, and then the next One step training. The training unit exchanges the training target, the learning rate and other information of the training unit with other training units through the middleware, and then adjusts the training parameters through the middleware. Similarly, the training unit collects the training information of the other training units through the middleware, and informs the other training units of the training information of the training unit, and determines whether to stop the training based on the training information of all the training units. When the training unit performs each batch of data processing, it judges whether to stop training through the middleware training. If the judgment stops, the training ends, and the learned model is output. Otherwise, the training data is continued to be read, and the next batch of training data is performed. Train until the training process is completed.
以上实施例仅用以说明本发明的技术方案而非对其进行限制,在不背离本发明精神及其实质的情况下,熟悉本领域的技术人员当可根据本发明作出各种相应的改变和变形,但这些相应的改变和变形都应属于本发明所附的权利要求的保护范围。 The above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to be limiting, and those skilled in the art can make various corresponding changes according to the present invention and without departing from the spirit and scope of the present invention. Modifications, but such corresponding changes and modifications are intended to be included within the scope of the appended claims.

Claims (12)

  1. 一种机器学习工具中间件,用于机器学习工具的模型训练,所述机器学习工具包括至少一个训练单元,其特征在于,每个训练单元都设置有与机器学习工具结合的中间件,所述中间件包括底层通信模块,以及数据分发模块、模型参数更新模块、训练参数调整模块和训练停止判断模块中的至少一块,其中:A machine learning tool middleware for model training of a machine learning tool, the machine learning tool comprising at least one training unit, wherein each training unit is provided with a middleware combined with a machine learning tool, The middleware includes an underlying communication module, and at least one of a data distribution module, a model parameter update module, a training parameter adjustment module, and a training stop determination module, wherein:
    所述底层通信模块,用于实现训练单元之间对应模块之间的通信,以及训练单元之间的通信;The bottom layer communication module is configured to implement communication between corresponding modules between the training units, and communication between the training units;
    所述数据分发模块,用于从数据存储设备中分发需要的数据到训练单元能够访问的存储单元,以便训练单元从所述存储单元中读取数据进行训练;The data distribution module is configured to distribute required data from the data storage device to a storage unit accessible by the training unit, so that the training unit reads data from the storage unit for training;
    所述模型参数更新模块,用于收集其他训练单元的训练信息,更新本训练单元的模型参数;The model parameter update module is configured to collect training information of other training units, and update model parameters of the training unit;
    所述训练参数调整模块,用于收集其他训练单元的训练信息,对本训练单元的训练参数进行调整;The training parameter adjustment module is configured to collect training information of other training units, and adjust training parameters of the training unit;
    所述训练停止判断模块,用于收集其他训练单元的训练信息,来进行是否停止训练的判断。The training stop determination module is configured to collect training information of other training units to determine whether to stop training.
  2. 根据权利要求1所述的机器学习工具中间件,其特征在于,所述数据存储设备用于存储机器学习工具所有训练数据,所述数据存储设备位于机器学习工具的主训练单元上。The machine learning tool middleware according to claim 1, wherein said data storage device is configured to store all training data of a machine learning tool, said data storage device being located on a main training unit of the machine learning tool.
  3. 根据权利要求2所述的机器学习工具中间件,其特征在于,所述主训练单元的数据分发模块用于接收其他训练单元的数据分发模块的请求,向其他训练单元的数据分发模块分发数据,所述其他训练单元数据的数据分发模块接收分发的数据存储在本训练单元的本地存储单元。The machine learning tool middleware according to claim 2, wherein the data distribution module of the main training unit is configured to receive a request of a data distribution module of another training unit, and distribute data to a data distribution module of another training unit, The data distributed by the data distribution module of the other training unit data is stored in a local storage unit of the training unit.
  4. 根据权利要求1所述的机器学习工具中间件,其特征在于,所述模型参数更新模块收集其他训练单元的训练信息,并且将本训练单元的训练信息传送给其他训练单元,对各训练单元的模型参数进行平均更新模型参数。The machine learning tool middleware according to claim 1, wherein the model parameter update module collects training information of other training units, and transmits training information of the training unit to other training units, for each training unit The model parameters are averaged to update the model parameters.
  5. 根据权利要求1所述的机器学习工具中间件,其特征在于,所述机器学 习工具还包括参数服务器,所述模型参数更新模块将本训练单元的训练信息传送到参数服务器,由参数服务器更新模型参数后发回。A machine learning tool middleware according to claim 1 wherein said machine learning The tool further includes a parameter server, and the model parameter update module transmits the training information of the training unit to the parameter server, and the model server updates the model parameters and sends back the parameters.
  6. 根据权利要求1所述的机器学习工具中间件,其特征在于,所述底层通信模块还用于在实现训练单元之间对应模块之间的通信,以及训练单元之间的通信时,为各种通信之间加上互锁机制。The machine learning tool middleware according to claim 1, wherein the underlying communication module is further configured to implement communication between corresponding modules between the training units and communication between the training units. An interlock mechanism is added between communications.
  7. 一种机器学习训练方法,用于机器学习工具的模型训练,所述机器学习工具包括至少一个训练单元,其特征在于,每个训练单元都设置有与机器学习工具结合的中间件,所述训练单元通过所述中间件进行通信,训练单元之间通过所述中间件执行如下训练操作中的至少一项完成模型训练,所述训练操作包括:A machine learning training method for model training of a machine learning tool, the machine learning tool comprising at least one training unit, wherein each training unit is provided with a middleware combined with a machine learning tool, the training The unit communicates through the middleware, and the training unit performs at least one of the following training operations by the middleware to complete the model training, the training operation includes:
    从数据存储设备中分发需要的数据到各个训练单元能够访问的存储单元,以便各个训练单元从所述存储单元中读取数据进行训练;Distributing the required data from the data storage device to the storage unit accessible by each training unit, so that each training unit reads data from the storage unit for training;
    收集其他训练单元的训练信息,更新本训练单元的模型参数;Collect training information of other training units, and update model parameters of the training unit;
    收集其他训练单元的训练信息,对本训练单元的训练参数进行调整;Collect training information of other training units, and adjust training parameters of the training unit;
    收集其他训练单元的训练信息,来进行是否停止训练的判断。The training information of other training units is collected to determine whether to stop training.
  8. 根据权利要求7所述的机器学习训练方法,其特征在于,所述数据存储设备用于存储机器学习工具所有训练数据,所述数据存储设备位于机器学习工具的主训练单元上。The machine learning training method according to claim 7, wherein the data storage device is configured to store all training data of the machine learning tool, and the data storage device is located on a main training unit of the machine learning tool.
  9. 根据权利要求8所述的机器学习训练方法,其特征在于,所述从数据存储设备中分发需要的数据到各个训练单元能够访问的存储单元,以便各个训练单元从所述存储单元中读取数据进行训练,包括:The machine learning training method according to claim 8, wherein said distributing the required data from the data storage device to a storage unit accessible to each training unit, so that each training unit reads data from said storage unit Train, including:
    主训练单元通过中间件接收其他训练单元的中间件发出的请求,向其他训练单元的中间件分发数据;The main training unit receives the request from the middleware of the other training unit through the middleware, and distributes the data to the middleware of the other training unit;
    其他训练单元数据的中间件接收分发的数据存储在本训练单元的本地存储单元。The middleware of the other training unit data receives and distributes the data stored in the local storage unit of the training unit.
  10. 根据权利要求7所述的机器学习训练方法,其特征在于,所述收集其他训练单元的训练信息,更新本训练单元的模型参数,包括: The machine learning training method according to claim 7, wherein the collecting the training information of the other training units and updating the model parameters of the training unit comprises:
    收集其他训练单元的训练信息,并且将本训练单元的训练信息传送给其他训练单元,对各训练单元的模型参数进行平均更新模型参数。The training information of the other training units is collected, and the training information of the training unit is transmitted to other training units, and the model parameters of each training unit are averaged to update the model parameters.
  11. 根据权利要求7所述的机器学习训练方法,其特征在于,所述机器学习工具还包括参数服务器,所述收集其他训练单元的训练信息,更新本训练单元的模型参数,包括:The machine learning training method according to claim 7, wherein the machine learning tool further comprises a parameter server, wherein the training information of the other training units is collected, and the model parameters of the training unit are updated, including:
    将本训练单元的训练信息传送到参数服务器,由参数服务器更新模型参数后发回。The training information of the training unit is transmitted to the parameter server, and the parameter server updates the model parameters and sends back the parameters.
  12. 根据权利要求1所述的机器学习训练方法,其特征在于,所述训练单元通过所述中间件进行通信时,还包括:The machine learning training method according to claim 1, wherein when the training unit communicates through the middleware, the method further includes:
    为各种通信之间加上互锁机制。 Add interlocking mechanisms between various communications.
PCT/CN2016/109370 2015-12-22 2016-12-12 Machine learning tool middleware and training method of machine learning WO2017107788A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510975227.2 2015-12-22
CN201510975227.2A CN106909529B (en) 2015-12-22 2015-12-22 Machine learning tool middleware and machine learning training method

Publications (1)

Publication Number Publication Date
WO2017107788A1 true WO2017107788A1 (en) 2017-06-29

Family

ID=59089049

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/109370 WO2017107788A1 (en) 2015-12-22 2016-12-12 Machine learning tool middleware and training method of machine learning

Country Status (2)

Country Link
CN (1) CN106909529B (en)
WO (1) WO2017107788A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109343895A (en) * 2018-09-18 2019-02-15 郑州云海信息技术有限公司 A kind of processing method of operational order, device and computer readable storage medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977712A (en) * 2017-12-20 2018-05-01 四川九洲电器集团有限责任公司 Network type machine learning system
CN109255234B (en) * 2018-08-15 2023-03-24 腾讯科技(深圳)有限公司 Processing method, device, medium and electronic equipment of machine learning model
CN109460826A (en) * 2018-10-31 2019-03-12 北京字节跳动网络技术有限公司 For distributing the method, apparatus and model modification system of data
CN110414187B (en) * 2019-07-03 2021-09-17 北京百度网讯科技有限公司 System and method for model safety delivery automation
CN112884159A (en) * 2019-11-30 2021-06-01 华为技术有限公司 Model updating system, model updating method and related equipment
CN115859990B (en) * 2023-02-17 2023-05-09 智慧眼科技股份有限公司 Information extraction method, device, equipment and medium based on meta learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104035751A (en) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 Graphics processing unit based parallel data processing method and device
US20150019214A1 (en) * 2013-07-10 2015-01-15 Tencent Technology (Shenzhen) Company Limited Method and device for parallel processing in model training
CN104714852A (en) * 2015-03-17 2015-06-17 华中科技大学 Parameter synchronization optimization method and system suitable for distributed machine learning
CN104980518A (en) * 2015-06-26 2015-10-14 深圳市腾讯计算机系统有限公司 Method, device and system of multi-learning subject parallel training model
CN105184367A (en) * 2014-06-09 2015-12-23 讯飞智元信息科技有限公司 Model parameter training method and system for depth neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100318516A1 (en) * 2009-06-10 2010-12-16 Google Inc. Productive distribution for result optimization within a hierarchical architecture
CN104217022A (en) * 2014-09-25 2014-12-17 天津大学 Distributive big data classifying system and method based on alternating direction method of multipliers

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150019214A1 (en) * 2013-07-10 2015-01-15 Tencent Technology (Shenzhen) Company Limited Method and device for parallel processing in model training
CN105184367A (en) * 2014-06-09 2015-12-23 讯飞智元信息科技有限公司 Model parameter training method and system for depth neural network
CN104035751A (en) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 Graphics processing unit based parallel data processing method and device
CN104714852A (en) * 2015-03-17 2015-06-17 华中科技大学 Parameter synchronization optimization method and system suitable for distributed machine learning
CN104980518A (en) * 2015-06-26 2015-10-14 深圳市腾讯计算机系统有限公司 Method, device and system of multi-learning subject parallel training model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109343895A (en) * 2018-09-18 2019-02-15 郑州云海信息技术有限公司 A kind of processing method of operational order, device and computer readable storage medium
CN109343895B (en) * 2018-09-18 2021-05-04 郑州云海信息技术有限公司 Method and device for processing operation instruction and computer readable storage medium

Also Published As

Publication number Publication date
CN106909529A (en) 2017-06-30
CN106909529B (en) 2020-12-01

Similar Documents

Publication Publication Date Title
WO2017107788A1 (en) Machine learning tool middleware and training method of machine learning
CN105607954B (en) A kind of method and apparatus that stateful container migrates online
CN107347205B (en) A kind of network slice selection method, apparatus and system
CN104239476B (en) A kind of method, apparatus and system of database synchronization
CN107566165B (en) Method and system for discovering and deploying available resources of power cloud data center
CN107357896A (en) Expansion method, device, system and the data base cluster system of data-base cluster
US10057182B2 (en) Method for providing development and deployment services using a cloud-based platform and devices thereof
CN102609446B (en) Distributed Bloom filter system and application method thereof
CN104601680B (en) A kind of method for managing resource and device
CN102164184A (en) Computer entity access and management method for cloud computing network and cloud computing network
CN106339177A (en) Method and device for creating virtual machines
CN103441935B (en) Automatically method and the device of the syntople of identified server and access switch
WO2022141727A1 (en) Resource deployment system and method based on cloud cost
CN108319617A (en) Determine the method, apparatus and method for handover control, device of database principal and subordinate's difference
CN109697120A (en) Method, electronic equipment for application migration
CN104461706B (en) A kind of method and multiprocessing device that shared global variable is shared
CN104506462A (en) MAC (Media Access Control) address management method and equipment in distributed switch
CN107077376A (en) Frame buffer implementation method, device, electronic equipment and computer program product
US20220300323A1 (en) Job Scheduling Method and Job Scheduling Apparatus
CN106649600A (en) Way, device and system of migrating file permissions
CN114285695B (en) Communication method, device, apparatus, system and storage medium
CN109032635A (en) Method for upgrading software, device and the network equipment
CN109299116A (en) A kind of method of data synchronization, device, equipment and readable storage medium storing program for executing
CN106990913B (en) A kind of distributed approach of extensive streaming collective data
CN103403666B (en) Distributed storage control method, Apparatus and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16877599

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16877599

Country of ref document: EP

Kind code of ref document: A1