CN110472747B

CN110472747B - Distributed system for executing multi-machine learning task and method thereof

Info

Publication number: CN110472747B
Application number: CN201910759163.0A
Authority: CN
Inventors: 郑淇木; 焦英翔; 石光川
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2022-07-05
Anticipated expiration: 2039-08-16
Also published as: CN115345318A; CN110472747A

Abstract

A distributed system for performing a multi-machine learning task and a method thereof are provided. The distributed system includes: a plurality of computing devices configured to respectively acquire different partial data of a specified data set and collectively perform a plurality of machine learning tasks; wherein each computing device is configured to: and executing the plurality of machine learning tasks in parallel based on the acquired partial data, wherein the plurality of machine learning tasks are a plurality of model training tasks or a plurality of model prediction tasks. According to the distributed system and the method thereof, the time required for completing a plurality of machine learning tasks can be effectively shortened.

Description

Distributed system for executing multi-machine learning task and method thereof

Technical Field

The present invention relates generally to the field of artificial intelligence, and more particularly, to a distributed system for performing multi-machine learning tasks and a method thereof.

Background

The performance of a machine learning training task is typically determined by the values of a large number (e.g., tens) of configuration parameters (i.e., hyper-parameters). In a scenario such as automatic machine learning, in order to evaluate execution results of machine learning training tasks under different configurations, it is often necessary to perform different attempts and different combinations on values of configuration parameters, and to calculate and evaluate machine learning training tasks under different configurations.

In the process of exploring the optimal machine learning model, the number of machine learning training tasks to be executed increases exponentially with respect to the number of configuration parameters and the number of possible values of each configuration parameter, for example, for a machine learning training task with only 10 configuration parameters to be configured, if there are 3 possible values of each configuration parameter, the 10 configuration parameters will generate a total of 59049 possible combinations, which will correspond to 59049 machine learning training tasks.

In the prior art, machine learning training tasks under different configurations are usually executed independently, but a significant task execution efficiency problem exists, so that automatic machine learning tasks and other tasks needing to be trained on a plurality of machine learning models cannot be completed within reasonable time.

Disclosure of Invention

An exemplary embodiment of the present invention is to provide a distributed system for performing a multi-machine learning task and a method thereof, which can solve the problem in the prior art that the multi-machine learning task cannot be completed within a reasonable time.

In accordance with an exemplary embodiment of the present invention, there is provided a distributed system for performing a multi-machine learning task, comprising: a plurality of computing devices configured to respectively acquire different partial data of a specified data set and collectively perform a plurality of machine learning tasks; wherein each computing device is configured to: and executing the plurality of machine learning tasks in parallel based on the acquired partial data, wherein the plurality of machine learning tasks are a plurality of model training tasks or a plurality of model prediction tasks.

Optionally, each computing device is configured to: on one hand, the data in the specified data set is requested from a data source, the requested data is preprocessed and then stored locally, on the other hand, the locally stored data is read, and the plurality of machine learning tasks are executed based on the read data.

Optionally, each computing device is configured to: and executing the machine learning task needing to use each piece of data in the plurality of machine learning tasks in parallel based on each piece of data read.

Optionally, each computing device is configured to: based on each piece of data read, executing, in parallel, a machine learning task that requires use of the piece of data among the plurality of machine learning tasks using vectorization instructions.

Optionally, the distributed system further comprises: a parameter server configured to maintain parameters of a plurality of machine learning models involved in the plurality of machine learning tasks, wherein the parameters of the machine learning models have the form of key-value pairs, wherein the parameter server is configured to: and performing same bonding on the parameters of the multiple machine learning models according to the form that a single key corresponds to multiple values, and storing the parameters, or performing same bonding on the parameters of the multiple machine learning models according to the form that a single key corresponds to multiple values, and compressing and storing the merged result according to a first compression mode.

Optionally, when the plurality of machine learning tasks are the plurality of model training tasks, each computing device is configured to: and providing the merged results of training the plurality of machine learning models to a parameter server, or providing the merged results of training the plurality of machine learning models compressed in a second compression mode to the parameter server to enable the parameter server to update the parameters of the plurality of machine learning models, wherein the results are merged in a mode that a single key corresponds to a plurality of values.

Optionally, the parameter server is configured to: the parameters of the plurality of machine learning models required for each of the computing devices that are merged are provided to each of the computing devices, or the parameters of the plurality of machine learning models required for each of the computing devices that are merged and compressed in the third compression manner are provided to each of the computing devices.

Optionally, when the plurality of machine learning tasks are the plurality of model training tasks, the parameter server is configured to: during the process that each computing device executes the plurality of machine learning tasks, intermediate computing results which are generated when the computing device trains one machine learning model and can be used for other machine learning models are saved, and the intermediate computing results are used for the other machine learning models.

Optionally, only the number of training rounds in the hyper-parameters corresponding to the one machine learning model and the other machine learning models is different, where the number of training rounds corresponding to the one machine learning model is greater than the number of training rounds corresponding to the other machine learning models, and the parameter server is configured to: and taking the parameters of the machine learning model obtained when the number of training rounds of the computing device reaches the number of training rounds corresponding to the other machine learning models in the process of training the machine learning model as the parameters of the other machine learning models.

Optionally, each computing device is configured to: setting a network configuration used by the plurality of machine learning tasks; and/or, the parameter server is configured to: setting a network configuration for the plurality of machine learning tasks.

Optionally, each computing device is configured to: network transmission using a zero-copy technique for the plurality of machine learning tasks; and/or setting the size of the maximum transmission unit in network transmission aiming at the plurality of machine learning tasks; and/or, the parameter server is configured to: network transmission using a zero-copy technique for the plurality of machine learning tasks; and/or setting a size of a maximum transmission unit in network transmission for the plurality of machine learning tasks.

Optionally, each computing device is configured to: configuring a memory used by the plurality of machine learning tasks; and/or, the parameter server is configured to: configuring memory for the plurality of machine learning tasks.

Optionally, each computing device is configured to: binding the plurality of machine learning tasks with a set of Central Processing Units (CPUs) such that the set of CPUs use memory proximate thereto to execute the plurality of machine learning tasks; and/or configuring a memory management unit for the plurality of machine learning tasks, so that an operating system and a CPU of the computing device manage memories used by the plurality of machine learning tasks in the configured memory management unit; and/or, the parameter server is configured to: configuring a memory management unit aiming at the plurality of machine learning tasks so that an operating system and a CPU of a parameter server manage memories used by tasks related to the plurality of machine learning tasks by the configured memory management unit; and/or binding tasks related to the plurality of machine learning tasks with a set of Central Processing Units (CPUs) such that the set of CPUs use memory proximate thereto to perform tasks related to the plurality of machine learning tasks.

According to another exemplary embodiment of the present invention, a method for performing a multi-machine learning task using a distributed system is provided, wherein the distributed system comprises a plurality of computing devices, wherein the method comprises: the plurality of computing devices respectively acquire different partial data of the designated data set; the plurality of computing devices collectively execute a plurality of machine learning tasks based on the acquired partial data, wherein each computing device executes the plurality of machine learning tasks in parallel based on the partial data acquired by itself, wherein the plurality of machine learning tasks are a plurality of model training tasks or a plurality of model prediction tasks.

Optionally, the step of the plurality of computing devices respectively acquiring different partial data of the designated data set comprises: each computing device requesting data in the specified dataset from a data source; each computing device preprocesses the requested data and stores the preprocessed data locally, wherein the step of executing the plurality of machine learning tasks in parallel by each computing device based on the partial data acquired by each computing device comprises the following steps: each computing device reads the locally stored data and performs the plurality of machine learning tasks based on the read data.

Optionally, the step of each computing device executing the plurality of machine learning tasks based on the read data comprises: each computing device executes, in parallel, based on each piece of data read, a machine learning task of the plurality of machine learning tasks that requires use of the piece of data.

Optionally, the step of each computing device executing the plurality of machine learning tasks based on the read data comprises: each computing device executes, in parallel, based on each piece of data read, a machine learning task that requires use of the piece of data among the plurality of machine learning tasks using vectorization instructions.

Optionally, the distributed system further includes a parameter server, wherein the method further includes: and the parameter server maintains parameters of a plurality of machine learning models related to the plurality of machine learning tasks, wherein the parameters of the machine learning models have a key value pair form, the parameter server performs same bonding and storage on the parameters of the plurality of machine learning models according to a form that a single key corresponds to a plurality of values, or the parameter server performs same bonding and storage on the parameters of the plurality of machine learning models according to a form that a single key corresponds to a plurality of values, and compresses and stores combined results according to a first compression mode.

Optionally, the method further comprises: when the plurality of machine learning tasks are the plurality of model training tasks, each computing device provides the merged results of training the plurality of machine learning models to the parameter server, or each computing device provides the merged results of training the plurality of machine learning models compressed in the second compression manner to the parameter server to cause the parameter server to update the parameters of the plurality of machine learning models, wherein the results are merged in a form that a single key corresponds to a plurality of values.

Optionally, the method further comprises: the parameter server provides the parameters of the plurality of machine learning models required by each of the computing devices that are merged to each of the computing devices, or the parameter server provides the parameters of the plurality of machine learning models required by each of the computing devices that are merged to each of the computing devices that are compressed in the third compression manner to each of the computing devices.

Optionally, the method further comprises: when the plurality of machine learning tasks are the plurality of model training tasks, the parameter server stores intermediate calculation results which are generated when the computing device trains one machine learning model and can be used for other machine learning models in the process that each computing device executes the plurality of machine learning tasks, so that the intermediate calculation results are used for the other machine learning models.

Optionally, only the number of training rounds in the hyper-parameters corresponding to the one machine learning model and the other machine learning models is different, where the number of training rounds corresponding to the one machine learning model is greater than the number of training rounds corresponding to the other machine learning models, and the parameter server takes the parameter of the one machine learning model, obtained when the number of training rounds reaches the number of training rounds corresponding to the other machine learning models in the process of training the one machine learning model, as the parameter of the other machine learning models.

Optionally, the method further comprises: each computing device setting a network configuration used by the plurality of machine learning tasks; and/or the parameter server sets a network configuration for the plurality of machine learning tasks.

Optionally, the step of each computing device setting a network configuration used by the plurality of machine learning tasks comprises: each computing device using a zero-copy technique for network transmission for the plurality of machine learning tasks; and/or setting the size of the maximum transmission unit in network transmission aiming at the plurality of machine learning tasks; wherein the parameter server setting a network configuration for the plurality of machine learning tasks comprises: the parameter server performs network transmission on the plurality of machine learning tasks by using a zero copy technology; and/or setting a size of a maximum transmission unit in network transmission for the plurality of machine learning tasks.

Optionally, the method further comprises: each computing device configuring memory used by the plurality of machine learning tasks; and/or the parameter server configures memory for the plurality of machine learning tasks.

Optionally, the step of configuring, by each computing device, memory used by the plurality of machine learning tasks comprises: each computing device binding the plurality of machine learning tasks with a set of Central Processing Units (CPUs) such that the set of CPUs use memory proximate thereto to execute the plurality of machine learning tasks; and/or configuring a memory management unit for the plurality of machine learning tasks, so that an operating system and a CPU of the computing device manage memories used by the plurality of machine learning tasks in the configured memory management unit; wherein the step of the parameter server configuring the memory for the plurality of machine learning tasks comprises: the parameter server configures memory management units for the plurality of machine learning tasks, so that an operating system and a CPU of the parameter server manage memories used by tasks related to the plurality of machine learning tasks by the configured memory management units; and/or the parameter server binds tasks related to the plurality of machine learning tasks with a set of Central Processing Units (CPUs) so that the set of CPUs use memory adjacent thereto to execute the tasks related to the plurality of machine learning tasks.

According to the distributed system and the method for executing the multi-machine learning task, which are disclosed by the exemplary embodiment of the invention, the time required for completing the multi-machine learning task can be effectively shortened, so that the multi-machine learning task can be completed in a reasonable time.

Additional aspects and/or advantages of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.

Drawings

The above and other objects and features of exemplary embodiments of the present invention will become more apparent from the following description taken in conjunction with the accompanying drawings which illustrate exemplary embodiments, wherein:

FIG. 1 illustrates a block diagram of a distributed system for performing multi-machine learning tasks, according to an exemplary embodiment of the invention;

FIG. 2 illustrates an example of a distributed system performing a multi-machine learning task in accordance with an illustrative embodiment of the present invention;

FIG. 3 illustrates a block diagram of a distributed system for performing multi-machine learning tasks, according to another exemplary embodiment of the present invention;

FIG. 4 illustrates an example of a parameter server storing parameters for a plurality of machine learning models, according to an exemplary embodiment of the invention;

FIG. 5 illustrates an example of transmission of parameters of a multi-machine learning model according to an exemplary embodiment of the present invention;

FIG. 6 illustrates an example of parallel execution of a multi-machine learning task in accordance with an illustrative embodiment of the present invention;

FIG. 7 illustrates a flowchart of a method for performing a multi-machine learning task using a distributed system, according to an exemplary embodiment of the invention.

Detailed Description

Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present invention by referring to the figures.

FIG. 1 illustrates a block diagram of a distributed system for performing multi-machine learning tasks, according to an exemplary embodiment of the invention. As shown in FIG. 1, a distributed system for performing multi-machine learning tasks according to an exemplary embodiment of the invention includes a plurality of computing devices 1000 (e.g., 1000-1, 1000-2, …, 1000-n (where n is an integer greater than 1)).

Specifically, a plurality of computing devices 1000 are configured to respectively acquire different partial data of a specified data set and collectively perform a plurality of machine learning tasks, wherein each computing device 1000 is configured to: the plurality of machine learning tasks are performed in parallel based on the partial data acquired by itself. In other words, different computing devices 1000 collectively execute the same plurality of machine learning tasks for different data, and the same computing device 1000 executes the plurality of machine learning tasks in parallel.

Here, the plurality of machine learning tasks are a plurality of model training tasks or a plurality of model prediction tasks. The computing device 1000 may update parameters of the corresponding machine learning model by executing the model training task; the model prediction task is a task for performing prediction using a machine learning model, and the computing apparatus 1000 may obtain a prediction result using a corresponding machine learning model by executing the model prediction task.

As an example, the plurality of model training tasks may be: multiple model training tasks that use the same machine learning algorithm but differ in training configuration (e.g., hyper-parameter configuration); alternatively, the plurality of model training tasks may use different machine learning algorithms, which may be different machine learning algorithms belonging to the same type (e.g., different machine learning algorithms belonging to the same neural network type but different in specific structure (e.g., depth of neural network, etc.)), or different machine learning algorithms belonging to different types. For example, the types of machine learning algorithms may include, but are not limited to: linear regression algorithm, neural network algorithm, FM algorithm. In other words, the plurality of machine learning models respectively trained by the plurality of model training tasks may be machine learning models of the same type and the same structure, or may be machine learning models of the same type and different structures, or may be machine learning models of different types.

Each computing device 1000 is configured to obtain a portion of the data of the specified data set, the data obtained by different computing devices 1000 do not intersect, and the collection of data obtained by different computing devices 1000 is exactly the specified data set. As an example, each computing device 1000 may be configured to: on one hand, the data in the specified data set is requested from a data source, the requested data is preprocessed and then stored locally, on the other hand, the locally stored data is read, and the plurality of machine learning tasks are executed based on the read data.

In the prior art, each computing device typically independently executes one machine learning task, and when multiple computing devices execute multiple machine learning tasks simultaneously and the multiple machine learning tasks share the same data set, each computing device needs to read the entire data set separately, that is, the entire data set will be read multiple times. According to an exemplary embodiment of the present invention, each computing device 1000 only needs to read a part of the data set, not all, and each piece of data in the data set is only read once and is not read repeatedly, which greatly saves the time for the computing device 1000 to read data from the data source and to perform the subsequent preprocessing on the read data.

As an example, in each computing device 1000, the task requesting data from the data source and the plurality of machine learning tasks may be performed by two (or two) sets of threads, respectively, i.e., the task requesting data from the data source uses a different thread than the plurality of machine learning tasks.

As an example, a data source, upon receiving a request from each computing device 1000, may allocate data in a specified data set (e.g., may allocate one piece of data or one data block containing multiple pieces of data at a time) until all of the data in the specified data set is allocated. Thus, each piece of data in the given data set is read by only one computing device 1000, i.e., each piece of data is read only once. For example, each computing device 1000 may acquire data in a designated data set in a competing manner, with computing devices 1000 having greater processing power acquiring more data.

Fig. 2 illustrates an example of a distributed system performing a multi-machine learning task according to an exemplary embodiment of the present invention. As shown in FIG. 2, when the distributed system according to an exemplary embodiment of the present invention includes 4 computing devices 1000 (i.e., 1000-1, 1000-2, 1000-3, 1000-4), the 4 computing devices 1000 respectively obtain a portion of the data set D from a data source (e.g., a data warehouse), for example, the 4 computing devices 1000 respectively obtain 1/4 of the data set D, and the sum of the data obtained by the 4 computing devices 1000 respectively is the entire data set D, i.e., D₁+D₂+D₃+D₄D. Also, the 4 computing devices 1000 collectively perform a plurality of model training tasks (i.e., training machine learning model a, training machine learning model B, and training machine learning model C) based on data acquired from the data source, specifically, each computing device 1000 trains model a, model B, and model C in parallel based on 1/4 data sets D acquired by itself. In other words, for the training of each model (e.g., model a), 4 computing devices 1000 are involved based on the data acquired by themselves, and each computing device 1000 trains multiple machine learning models in parallel based on the data acquired by itself.

As an example, the locally stored data used by the plurality of machine learning tasks may be identical or partially identical.

As an example, each computing device 1000 may be configured to: and executing the machine learning task needing to use each piece of data in the plurality of machine learning tasks in parallel based on each piece of data read. When the locally saved data used by the plurality of machine learning tasks is identical, the plurality of machine learning tasks are executed in parallel for each piece of data read. For example, when the plurality of machine learning tasks are a plurality of model training tasks, and the plurality of model training tasks require training the respective machine learning models using the same data set, the plurality of machine learning models may be trained in parallel for each piece of data read. According to the exemplary embodiment of the invention, on one hand, one piece of data read from the local at a time can be used for a plurality of machine learning tasks, so that repeated reading of the data is avoided, and the data reading efficiency is improved; on the other hand, the execution time of the multi-machine learning task can be effectively shortened by executing a plurality of machine learning tasks in parallel aiming at the same data.

As an example, each computing device 1000 may be configured to: based on each piece of data read, executing, in parallel, a machine learning task that requires use of the piece of data among the plurality of machine learning tasks using vectorization instructions. For example, when the plurality of machine learning tasks are a plurality of model training tasks, the same vectorization instruction may be used to calculate the update amounts of the plurality of machine learning models based on each piece of data read, thereby reducing both the calculation amount and the calculation time of the update amounts of the plurality of machine learning models.

Here, a vectorized (SIMD) Instruction may refer to: the central processing unit is provided with an instruction which can simultaneously operate a plurality of data through a single instruction, and the instruction can complete more calculations in the same time compared with a common instruction.

Furthermore, it should be understood that, based on each piece of read data, executing the machine learning task that needs to use the piece of data in parallel among the plurality of machine learning tasks may also be implemented in other suitable manners to speed up the execution of the multi-machine learning tasks. As an example, multiple or multiple sets of threads (or hardware hyper-threads) may be used to execute in parallel the machine learning task of the multiple machine learning tasks that requires the use of the piece of data, i.e., different machine learning tasks use different threads. In addition, the machine learning tasks that need to use the piece of data in the multiple machine learning tasks can be executed in parallel by using a processor with higher parallelism (e.g., a Graphics Processing Unit (GPU) or a Field Programmable Gate Array (FPGA)) in the computing device 1000, so as to achieve better parallel execution effect and speed up the execution of the multiple machine learning tasks.

FIG. 3 illustrates a block diagram of a distributed system for performing multi-machine learning tasks, according to another exemplary embodiment of the present invention. As shown in fig. 3, a distributed system for performing a multi-machine learning task according to another exemplary embodiment of the present invention may be composed of a plurality of computing devices 1000 and a parameter server 2000.

In particular, the parameter server 2000 is configured to maintain parameters of a plurality of machine learning models involved in the plurality of machine learning tasks, wherein the parameters of the machine learning models are in the form of key-value pairs (key-values).

It should be noted that the computing device 1000 and/or the parameter server 2000 are each defined by the processing or implemented functionality it performs, and may indicate either a physical entity or a virtual entity, for example, the computing device 1000 may indicate an actual computing machine or a logical entity disposed on the computing machine, and likewise, the parameter server 2000 may indicate either an actual computing machine or a logical entity disposed on the same and/or different computing machine as the computing device 1000 as one or more logical entities. By way of example, parameter server 2000 may be deployed on a single computing machine; alternatively, the parameter server 2000 may be deployed on multiple computing machines simultaneously.

As an example, the parameter server 2000 may be configured to: multiple key-value pairs having the same key among the parameters of the plurality of machine learning models may be stored in a form in which a single key corresponds to a plurality of values, so as to avoid storing a large amount of duplicated information in the parameter server 2000.

Further, as an example, the parameter server 2000 may be configured to: and carrying out same key combination on the parameters of the plurality of machine learning models in a form that a single key corresponds to a plurality of values, and compressing and storing combined results in a first compression mode. That is, the parameters of the plurality of machine learning models after the same key combination are compressed again, in other words, the repeated information in the parameters of the plurality of machine learning models is combined and the non-repeated information is compressed, so as to further reduce the storage overhead of the parameter server 2000.

Fig. 4 illustrates an example in which a parameter server stores parameters of a plurality of machine learning models according to an exemplary embodiment of the present invention. As shown in fig. 4, each machine learning model corresponds to a set of key value pairs, in which the keys of different key value pairs are different, and each key corresponds to its own value, for example, the set of key value pairs corresponding to the machine learning model 1 at least includes the key k₁、k₂、k₃、…、k_mRespectively corresponding to the value v_m1,1、v_m1,2、v_m1,3、…、v_m1,m(ii) a The machine learning model 2 corresponds to another set of key value pairs, which at least includes a key k₁、k₂、k₃、…、k_mRespectively corresponding to the value v_m2,1、v_m2,2、v_m2,3、…、v_m2,mWhere m is an integer greater than 1, it can be seen that at least some of the two sets of key-value pairs have identical keys, and according to an exemplary embodiment of the present invention, the parameter server 2000, when saving parameters of multiple machine learning models, merges and saves key-value pairs corresponding to different machine learning models having identical keys in a form where a single key corresponds to multiple values, for example, as key k₁Corresponding to the value v_m1,1、v_m2,1、v_m3,1、…、v_mn,1The parameters after the merging processing can be further compressed on the basis, for example, an h compression function is used, so that the condition that the storage overhead is increased linearly when the parameters of a plurality of machine learning models are stored at the same time is avoided.

As an example, the parameter server 2000 may be configured to provide each computing device 1000 with parameters of a plurality of machine learning models that each computing device 1000 needs to read in order to execute the plurality of model training tasks, so that each computing device 1000 trains the plurality of machine learning models using the read parameters, and the parameter server 2000 updates the parameters of the plurality of machine learning models according to results (e.g., model update amounts) obtained by each computing device 1000 training the plurality of machine learning models; alternatively, the parameter server 2000 may provide the parameters of the plurality of machine learning models, which each computing device 1000 needs to read in order to perform the plurality of model prediction tasks, to each computing device 1000, so that each computing device 1000 uses the plurality of machine learning models for prediction using the read parameters.

As an example, when the plurality of machine learning tasks are: in training the plurality of models, each computing device 1000 may be configured to: the results of training the plurality of machine learning models are provided to the parameter server 2000 in the form that a single key corresponds to a plurality of values, so that the parameter server 2000 updates the parameters of the plurality of machine learning models. On the basis, the results obtained by training the plurality of machine learning models after the same key combination can be further compressed according to a second compression method and then provided to the parameter server 2000, that is, the results obtained by training the plurality of machine learning models after the combination and compression can be provided to the parameter server 2000. Therefore, the transmission of repeated information can be effectively avoided, the data volume required to be transmitted between the computing device 1000 and the parameter server 2000 is reduced, and the network overhead between the computing device 1000 and the parameter server 2000 is effectively reduced.

As an example, the parameter server 2000 may be configured to: multiple key-value pairs having the same key among the parameters of the plurality of machine learning models required for each computing device 1000 are provided to each computing device 1000 in a form in which a single key corresponds to multiple values. In addition, the parameters of the plurality of machine learning models required for each of the computing devices that are merged and compressed in the third compression manner may also be provided to each of the computing devices.

It should be understood that the first compression method, the second compression method, and the third compression method may be the same or different from each other. For example, the first compression method and the third compression method may be the same or different, and when the first compression method and the third compression method are different, and the parameter server 2000 sends the parameters of the plurality of machine learning models required by the parameter server to the computing apparatus 1000, the parameters stored in the parameter server 2000 and compressed according to the first compression method may be decompressed first, and then recompressed according to the third compression method and sent to the computing apparatus 1000; alternatively, the parameter server 2000 may compress the parameters stored therein in the first compression manner again in a third compression manner and then transmit the compressed parameters to the computing device 1000, thereby reducing the network overhead between the computing device 1000 and the parameter server 2000.

Fig. 5 illustrates an example of transmission of parameters of a multi-machine learning model according to an exemplary embodiment of the present invention. As shown in FIG. 5, parameter server 2000 is providing parameters of a plurality of machine learning models (e.g., with respect to key k) to computing device 1000₁Key-value pairs of (k), parameters (e.g., k) of multiple machine learning models may be mapped in such a way that a single key corresponds to multiple values₁：[v_m1,1、v_m2,1、v_m3,1、…、v_mn,1]) Provided to computing device 1000, and may further compress the same key combined parameters (e.g., f (k))₁，[v_m1,1、v_m2,1、v_m3,1、…、v_mn,1]) To the computing device 1000, i.e., to provide the computing device 1000 with the parameters of the plurality of machine learning models required by the combined and compressed computing device 1000, as shown in fig. 5, the f-function is a compression function, it should be understood that the h-function and the f-function may be the same compression function or different compression functions. According to the exemplary embodiment of the present invention, by merging the repeated information in the data to be transmitted and compressing the non-repeated information, the network transmission overhead between the computing device 1000 and the parameter server 2000 is effectively reduced, and the transmission cost is reduced.

FIG. 6 illustrates an example of parallel execution of a multi-machine learning task according to an exemplary embodiment of the present invention. In the prior art, the corresponding instruction needs to be used for each machine learning model to obtain the model updating amount (i.e. the model training result), and the method is in accordance with the present inventionFor example, the same vectorization instruction may be used for multiple machine learning models to obtain model update quantities for the multiple machine learning models. Further, as an example, the computing device 1000 may train the merged and compressed model training results (e.g., at least one parameter of the plurality of machine learning models (e.g., with key k)₁Corresponding parameter)) to the parameter server 2000, may decompress the received model training result, decompress the parameter of the plurality of machine learning models that the parameter server 2000 has stored, and may be based on the decompressed amount of fluctuation (k) in the form of a single key corresponding to a plurality of values, when the parameter server 2000 receives the merged and compressed model training result uploaded by the computing apparatus 1000₁：[△_m1,1、△_m2,1、△_m3,1、…、△_mn,1]) For the parameter k obtained after decompression₁：[v_m1,1、v_m2,1、v_m3,1、…、v_mn,1]And updating, and then compressing and storing the parameters of the plurality of updated machine learning models.

As an example, when the plurality of machine learning tasks are: in training the tasks for the plurality of models, the parameter server 2000 may be configured to: during the execution of the plurality of machine learning tasks by each computing device 1000, intermediate calculation results generated when one machine learning model is trained by the computing device 1000 and available for other machine learning models are saved, so that the intermediate calculation results are available for the other machine learning models. Therefore, reusable information is prevented from being repeatedly calculated by a plurality of machine learning tasks, and the use efficiency of computing resources is improved.

As an example, only the number of training rounds in the hyper-parameters corresponding to the one machine learning model and the other machine learning models is different, where the number of training rounds corresponding to the one machine learning model is greater than the number of training rounds corresponding to the other machine learning models, and the parameter server 2000 may be configured to: the parameters of the one machine learning model obtained when the number of training rounds of the calculation apparatus 1000 in the process of training the one machine learning model reaches the number of training rounds corresponding to the other machine learning models are used as the parameters of the other machine learning models. For example, if only the number of training rounds is different among the hyper-parameters corresponding to the model training task 1, the model training task 2, and the model training task 3, and the other hyper-parameters are the same, where the number of training rounds of the model training task 1 is 30 rounds, the number of training rounds of the model training task 2 is 20 rounds, and the number of training rounds of the model training task 3 is 10 rounds, the parameter of the machine learning model obtained when the number of training rounds of the model training task 1 reaches 10 rounds can be used as the parameter of the machine learning model corresponding to the model training task 3; and the parameters of the machine learning model obtained when the number of training rounds of the model training task 1 reaches 20 rounds are used as the parameters of the machine learning model corresponding to the model training task 2.

In view of the fact that in the exemplary embodiment of the present invention, when the computing device 1000 and the parameter server 2000 are transmitted, it is necessary to transmit the information related to the multiple machine learning models simultaneously, which makes the volume of the message transmitted at a single time significantly larger than the message volume when each machine learning task is executed independently, and therefore, according to the exemplary embodiment of the present invention, the network configuration can be adaptively configured according to the task type for the network transmission requirement specific to the parallel execution of the multiple machine learning tasks, and specifically, the network transmission parameters can be automatically adjusted for the multiple machine learning tasks, so that the network can transmit the message with a large volume more efficiently, thereby improving the network utilization rate and the task completion efficiency.

As an example, each computing device 1000 may be configured to: setting a network configuration used by the plurality of machine learning tasks.

As an example, each computing device 1000 may be configured to: network transmission using zero-copy (zero-copy) techniques for the plurality of machine learning tasks. The zero-copy technology enables network hardware (e.g., a network card) to directly transmit data in the memory without copying the data from the memory to a cache of the network hardware for transmission. The transmission mode has better transmission acceleration effect on a multi-machine learning task scene with a larger single data packet.

As an example, each computing device 1000 may automatically use a zero-copy technique when receiving data from or sending data to the parameter server 2000 for the plurality of machine learning tasks.

As another example, each computing device 1000 may be configured to: setting a size of a Maximum Transmission Unit (MTU) in network Transmission for the plurality of machine learning tasks. Namely, by setting a larger MTU, a large data packet can be split into a smaller number of network transmission packets with a larger volume in the network layer, so that the network can transmit data at a higher transmission rate.

As an example, the parameter server 2000 may be configured to: setting a network configuration for the plurality of machine learning tasks.

As an example, the parameter server 2000 may be configured to: network transmission using a zero-copy technique for the plurality of machine learning tasks; and/or setting a size of a maximum transmission unit in network transmission for the plurality of machine learning tasks.

In view of the fact that in the exemplary embodiment of the present invention, during the execution of multiple machine learning tasks, information related to multiple machine learning models needs to be stored simultaneously, compared to the case that the machine learning tasks are executed independently of each other, although the total amount of information requires less memory due to the merging and compression of model information, the memory size required for a single allocation is larger. Therefore, according to the exemplary embodiment of the invention, the memory can be configured adaptively according to the task type aiming at the memory management requirement specific to the parallel execution of the multi-machine learning task. Specifically, the memory allocation parameters can be automatically adjusted for the plurality of machine learning tasks, so as to improve the memory allocation efficiency and the memory usage efficiency.

As an example, each computing device 1000 may be configured to: and configuring the memory used by the plurality of machine learning tasks.

Considering that under the architecture of a multi-path Central Processing Unit (CPU), the speed of memory allocation and access by the CPU is related to the specific physical location of the memory. Under the scene of parallel execution of multi-machine learning tasks, the single access amount of a program is higher, and the bandwidth requirement for memory access across CPUs is higher.

As an example, each computing device 1000 may be configured to: binding the plurality of machine learning tasks with a set of CPUs therein such that the set of CPUs use memory proximate thereto to execute the plurality of machine learning tasks. Therefore, the near allocation and access of the memory are realized, and the allocation and access efficiency of the memory is improved. For example, binding the plurality of machine learning tasks to a set of CPUs (i.e., to a particular NUMA area of the computing device 1000) may be accomplished using a Non-Uniform Memory Access Architecture (NUMA).

As another example, each computing device 1000 may be configured to: a memory management unit is configured for the plurality of machine learning tasks, so that the operating system and the CPU of the computing device 1000 manage the memories used by the plurality of machine learning tasks in the configured memory management unit. The memory used by the plurality of machine learning tasks is managed by setting a larger memory management unit (namely, page), so that the memory allocator can allocate a small amount of large blocks of memory more efficiently, the operating system and the CPU need less pages to be managed, and the management efficiency is higher.

As an example, the parameter server 2000 may be configured to: configuring memory for the plurality of machine learning tasks.

As an example, the parameter server 2000 may be configured to: binding tasks related to the plurality of machine learning tasks with a set of Central Processing Units (CPUs) such that the set of CPUs use memory proximate thereto to perform tasks related to the plurality of machine learning tasks.

As an example, the parameter server 2000 may be configured to: a memory management unit is configured for the plurality of machine learning tasks, so that the operating system and the CPU of the parameter server 2000 manage the memory used by the tasks related to the plurality of machine learning tasks in the configured memory management unit.

FIG. 7 illustrates a flowchart of a method for performing a multi-machine learning task using a distributed system, according to an exemplary embodiment of the invention. The distributed system includes a plurality of computing devices.

As shown in fig. 7, in step S10, the plurality of computing devices respectively acquire different partial data of the specified data set.

By way of example, each computing device may request data from the specified data set from a data source and pre-process the requested data for storage locally.

At step S20, the plurality of computing devices collectively execute a plurality of machine learning tasks based on the acquired partial data, wherein each computing device executes the plurality of machine learning tasks in parallel based on the partial data acquired by itself, wherein the plurality of machine learning tasks are a plurality of model training tasks or a plurality of model prediction tasks.

As an example, each computing device may read locally-saved data and perform the plurality of machine learning tasks based on the read data.

As an example, step S10 may be executed first, and then step S20 is executed, that is, the computing device starts executing the plurality of machine learning tasks after storing all the partial data of the corresponding designated data set locally; as another example, steps S10 and S20 may be performed simultaneously, i.e., the computing device may obtain data for local storage while performing multiple machine learning tasks based on the locally stored data.

As an example, each computing device may execute, in parallel, based on each piece of data read, a machine learning task of the plurality of machine learning tasks that requires use of the piece of data.

As an example, each computing device may execute, in parallel, based on each piece of data read, using vectorization instructions, a machine learning task of the plurality of machine learning tasks that requires use of the piece of data.

As an example, the distributed system may further include a parameter server, wherein the method of performing a multi-machine learning task using the distributed system according to an exemplary embodiment of the present invention may further include: and the parameter server maintains parameters of a plurality of machine learning models related to the plurality of machine learning tasks, wherein the parameters of the machine learning models have a key value pair form, the parameter server performs same bonding and storage on the parameters of the plurality of machine learning models according to a form that a single key corresponds to a plurality of values, or the parameter server performs same bonding and storage on the parameters of the plurality of machine learning models according to a form that a single key corresponds to a plurality of values, and compresses and stores combined results according to a first compression mode.

As an example, the method of performing a multi-machine learning task using a distributed system according to an exemplary embodiment of the present invention may further include: when the plurality of machine learning tasks are the plurality of model training tasks, each computing device provides the merged results of training the plurality of machine learning models to the parameter server, or each computing device provides the merged results of training the plurality of machine learning models compressed in the second compression manner to the parameter server to cause the parameter server to update the parameters of the plurality of machine learning models, wherein the results are merged in a form that a single key corresponds to a plurality of values.

As an example, the method of performing a multi-machine learning task using a distributed system according to an exemplary embodiment of the present invention may further include: the parameter server provides the parameters of the plurality of machine learning models required by each of the computing devices that are merged to each of the computing devices, or the parameter server provides the parameters of the plurality of machine learning models required by each of the computing devices that are merged to each of the computing devices that are compressed in the third compression manner to each of the computing devices.

As an example, the method of performing a multi-machine learning task using a distributed system according to an exemplary embodiment of the present invention may further include: when the plurality of machine learning tasks are the plurality of model training tasks, the parameter server stores intermediate calculation results which are generated when the computing device trains one machine learning model and can be used for other machine learning models in the process that each computing device executes the plurality of machine learning tasks, so that the intermediate calculation results are used for the other machine learning models.

As an example, only the number of training rounds in the hyper-parameters corresponding to the one machine learning model and the other machine learning models is different, where the number of training rounds corresponding to the one machine learning model is greater than the number of training rounds corresponding to the other machine learning models, and the parameter server may use, as the parameter of the other machine learning model, a parameter of the one machine learning model obtained when the number of training rounds reaches the number of training rounds corresponding to the other machine learning model in the process of training the one machine learning model by the computing device.

As an example, the method of performing a multi-machine learning task using a distributed system according to an exemplary embodiment of the present invention may further include: each computing device setting a network configuration used by the plurality of machine learning tasks; and/or the parameter server sets a network configuration for the plurality of machine learning tasks.

As an example, each computing device may use a zero-copy technique for network transmission for the plurality of machine learning tasks; and/or setting a size of a maximum transmission unit in network transmission for the plurality of machine learning tasks.

As an example, the parameter server may use a zero-copy technique for network transmission for the plurality of machine learning tasks; and/or setting a size of a maximum transmission unit in network transmission for the plurality of machine learning tasks.

As an example, the method of performing a multi-machine learning task using a distributed system according to an exemplary embodiment of the present invention may further include: each computing device configuring memory used by the plurality of machine learning tasks; and/or the parameter server configures memory for the plurality of machine learning tasks.

As an example, each computing device may bind the plurality of machine learning tasks with a set of Central Processors (CPUs) such that the set of CPUs use memory proximate thereto to execute the plurality of machine learning tasks; and/or configuring a memory management unit for the plurality of machine learning tasks, so that an operating system and a CPU of the computing device manage the memories used by the plurality of machine learning tasks in the configured memory management unit.

As an example, the parameter server may configure a memory management unit for the plurality of machine learning tasks, so that the operating system and the CPU of the parameter server manage the memory used by the tasks related to the plurality of machine learning tasks in the configured memory management unit; and/or binding tasks related to the plurality of machine learning tasks with a set of Central Processing Units (CPUs) such that the set of CPUs use memory proximate thereto to perform tasks related to the plurality of machine learning tasks.

It should be understood that the steps involved in the above method may be performed by the computing device 1000 and/or the parameter server 2000 in the distributed system described above, and the operations involved in the above steps are described in detail with reference to fig. 1 to 6, and the details will not be described here.

It should be understood that the components of the parameter server, the computing device or the devices or units constituting them in the distributed system according to the exemplary embodiment of the present invention may be respectively configured as software, hardware, firmware or any combination of the above for performing specific functions. For example, these components may correspond to application specific integrated circuits, to pure software code, or to modules combining software and hardware. When they are implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor may perform the corresponding operations by reading and executing the corresponding program code or code segments. Further, one or more functions implemented by these components may also be performed collectively by components in a physical device (e.g., a computing machine, etc.).

It should be noted that the distributed system according to the exemplary embodiment of the present invention may completely depend on the execution of the computer program to realize the corresponding functions, that is, the respective components correspond to the respective steps in the functional architecture of the computer program, so that the entire system is called by a special software package (for example, lib library) to realize the corresponding functions.

While exemplary embodiments of the invention have been described above, it should be understood that the above description is illustrative only and not exhaustive, and that the invention is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. Therefore, the protection scope of the present invention should be subject to the scope of the claims.

Claims

1. A distributed system for performing multi-machine learning tasks, comprising:

a plurality of computing devices configured to respectively acquire different partial data of a specified data set and collectively perform a plurality of machine learning tasks;

wherein each computing device is configured to: executing the plurality of machine learning tasks in parallel based on the partial data acquired by the plurality of machine learning tasks, wherein the plurality of machine learning tasks are a plurality of model training tasks or a plurality of model prediction tasks;

and, each computing device is further configured to: on one hand, requesting data in the specified data set from a data source, preprocessing the requested data and storing the preprocessed data in the local area, on the other hand, reading the locally stored data and executing the plurality of machine learning tasks based on the read data; wherein each computing device executes, in parallel, based on each piece of the read locally-saved data, a machine learning task that needs to use the piece of data among the plurality of machine learning tasks using vectorization instructions.

2. The distributed system of claim 1, further comprising:

a parameter server configured to maintain parameters of a plurality of machine learning models involved in the plurality of machine learning tasks, wherein the parameters of the machine learning models are in the form of key-value pairs,

wherein the parameter server is configured to: and performing same bonding on the parameters of the multiple machine learning models according to the form that a single key corresponds to multiple values, and storing the parameters, or performing same bonding on the parameters of the multiple machine learning models according to the form that a single key corresponds to multiple values, and compressing and storing the merged result according to a first compression mode.

3. The distributed system of claim 2, wherein when the plurality of machine learning tasks are the plurality of model training tasks,

each computing device is configured to: providing the merged results of training the plurality of machine learning models to a parameter server, or providing the merged results of training the plurality of machine learning models compressed in a second compression manner to the parameter server to cause the parameter server to update the parameters of the plurality of machine learning models,

wherein the results are merged in a form where a single key corresponds to multiple values.

4. The distributed system of claim 2,

the parameter server is configured to: the parameters of the plurality of machine learning models required for each of the computing devices that are merged are provided to each of the computing devices, or the parameters of the plurality of machine learning models required for each of the computing devices that are merged and compressed in the third compression manner are provided to each of the computing devices.

5. The distributed system of claim 2 wherein, when the plurality of machine learning tasks train tasks for the plurality of models,

the parameter server is configured to: during the process that each computing device executes the plurality of machine learning tasks, intermediate computing results which are generated when the computing device trains one machine learning model and can be used for other machine learning models are saved, and the intermediate computing results are used for the other machine learning models.

6. The distributed system of claim 5, wherein the hyper-parameters corresponding to the one machine learning model and the other machine learning models differ in only the number of training rounds, wherein the number of training rounds corresponding to the one machine learning model is greater than the number of training rounds corresponding to the other machine learning models,

wherein the parameter server is configured to: and taking the parameters of the machine learning model obtained when the number of training rounds of the computing device reaches the number of training rounds corresponding to the other machine learning models in the process of training the machine learning model as the parameters of the other machine learning models.

7. The distributed system of claim 2,

each computing device is configured to: setting a network configuration used by the plurality of machine learning tasks;

and/or, the parameter server is configured to: setting a network configuration for the plurality of machine learning tasks.

8. The distributed system of claim 7, wherein each computing device is configured to: network transmission using a zero-copy technique for the plurality of machine learning tasks; and/or setting the size of the maximum transmission unit in network transmission aiming at the plurality of machine learning tasks;

and/or, the parameter server is configured to: network transmission using a zero-copy technique for the plurality of machine learning tasks; and/or setting a size of a maximum transmission unit in network transmission for the plurality of machine learning tasks.

9. The distributed system of claim 2,

each computing device is configured to: configuring a memory used by the plurality of machine learning tasks;

and/or, the parameter server is configured to: configuring memory for the plurality of machine learning tasks.

10. The distributed system of claim 9, wherein each computing device is configured to: binding the plurality of machine learning tasks with a set of Central Processing Units (CPUs) such that the set of CPUs use memory proximate thereto to execute the plurality of machine learning tasks; and/or configuring a memory management unit for the plurality of machine learning tasks, so that an operating system and a CPU of the computing device manage memories used by the plurality of machine learning tasks in the configured memory management unit;

and/or, the parameter server is configured to: configuring a memory management unit aiming at the plurality of machine learning tasks so that an operating system and a CPU of a parameter server manage memories used by tasks related to the plurality of machine learning tasks by the configured memory management unit; and/or binding tasks related to the plurality of machine learning tasks with a set of Central Processing Units (CPUs) such that the set of CPUs use memory proximate thereto to perform tasks related to the plurality of machine learning tasks.

11. A method for performing a multi-machine learning task using a distributed system, wherein the distributed system comprises a plurality of computing devices, wherein the method comprises:

the plurality of computing devices respectively acquire different partial data of the designated data set;

the plurality of computing devices collectively performing a plurality of machine learning tasks based on the acquired partial data, wherein each computing device performs the plurality of machine learning tasks in parallel based on the partial data acquired by itself,

and the step of the plurality of computing devices respectively acquiring different partial data of the specified data set comprises: each computing device requesting data in the specified dataset from a data source; each computing device preprocesses the requested data and stores the preprocessed data in the local;

and the plurality of machine learning tasks are a plurality of model training tasks or a plurality of model prediction tasks;

and the step of each computing device executing the plurality of machine learning tasks in parallel based on the partial data acquired by itself comprises: each computing device reads the locally stored data and executes the plurality of machine learning tasks based on the read data, wherein each computing device executes, in parallel, one of the plurality of machine learning tasks that needs to use each piece of data using vectorization instructions based on each piece of data that is locally stored that is read.

12. The method of claim 11, wherein the distributed system further comprises a parameter server, wherein the method further comprises:

a parameter server maintains parameters of a plurality of machine learning models involved in the plurality of machine learning tasks, wherein the parameters of the machine learning models are in the form of key-value pairs,

the parameter server performs same key combination on the parameters of the multiple machine learning models in a form that a single key corresponds to multiple values and then stores the parameters, or the parameter server performs same key combination on the parameters of the multiple machine learning models in a form that a single key corresponds to multiple values and then compresses combined results in a first compression mode and then stores the results.

13. The method of claim 12, further comprising:

when the plurality of machine learning tasks are the plurality of model training tasks, each computing device provides the merged results of training the plurality of machine learning models to the parameter server, or each computing device provides the merged results of training the plurality of machine learning models compressed according to the second compression method to the parameter server, so that the parameter server updates the parameters of the plurality of machine learning models,

14. The method of claim 12, further comprising:

the parameter server provides the parameters of the plurality of machine learning models required by each of the computing devices that are merged to each of the computing devices, or the parameter server provides the parameters of the plurality of machine learning models required by each of the computing devices that are merged to each of the computing devices that are compressed in the third compression manner to each of the computing devices.

15. The method of claim 12, further comprising:

when the plurality of machine learning tasks are the plurality of model training tasks, the parameter server stores intermediate calculation results which are generated when the computing device trains one machine learning model and can be used for other machine learning models in the process that each computing device executes the plurality of machine learning tasks, so that the intermediate calculation results are used for the other machine learning models.

16. The method of claim 15, wherein the hyper-parameters of the one machine learning model and the other machine learning models differ in only a number of training rounds, wherein the number of training rounds for the one machine learning model is greater than the number of training rounds for the other machine learning models,

and the parameter server takes the parameters of the machine learning model obtained when the number of training rounds reaches the number of training rounds corresponding to the other machine learning models in the process of training the machine learning model by the computing device as the parameters of the other machine learning models.

17. The method of claim 12, further comprising:

each computing device setting a network configuration used by the plurality of machine learning tasks;

and/or the parameter server sets a network configuration for the plurality of machine learning tasks.

18. The method of claim 17, wherein the step of each computing device setting a network configuration used by the plurality of machine learning tasks comprises: each computing device using a zero-copy technique for network transmission for the plurality of machine learning tasks; and/or setting the size of the maximum transmission unit in network transmission aiming at the plurality of machine learning tasks;

wherein the parameter server setting a network configuration for the plurality of machine learning tasks comprises: the parameter server performs network transmission on the plurality of machine learning tasks by using a zero copy technology; and/or setting a size of a maximum transmission unit in network transmission for the plurality of machine learning tasks.

19. The method of claim 12, further comprising:

each computing device configuring memory used by the plurality of machine learning tasks;

and/or the parameter server configures memory for the plurality of machine learning tasks.

20. The method of claim 19, wherein configuring, by each computing device, memory used by the plurality of machine learning tasks comprises: each computing device binding the plurality of machine learning tasks with a set of Central Processing Units (CPUs) such that the set of CPUs use memory proximate thereto to execute the plurality of machine learning tasks; and/or configuring a memory management unit for the plurality of machine learning tasks, so that an operating system and a CPU of the computing device manage memories used by the plurality of machine learning tasks in the configured memory management unit;

wherein the step of the parameter server configuring the memory for the plurality of machine learning tasks comprises: the parameter server configures memory management units for the plurality of machine learning tasks, so that an operating system and a CPU of the parameter server manage memories used by tasks related to the plurality of machine learning tasks by the configured memory management units; and/or binding tasks related to the plurality of machine learning tasks with a set of Central Processing Units (CPUs) such that the set of CPUs use memory proximate thereto to perform tasks related to the plurality of machine learning tasks.