CN113255931B

CN113255931B - Method and device for adjusting configuration parameters in model training process

Info

Publication number: CN113255931B
Application number: CN202110599044.0A
Authority: CN
Inventors: 朱海洋; 周俊; 陈为; 严凡; 钱中昊; 毛科添
Original assignee: Zhongda Group Co ltd; Zhejiang University ZJU
Current assignee: Zhongda Group Co ltd; Zhejiang University ZJU
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-10-01
Anticipated expiration: 2041-05-31
Also published as: CN113255931A

Abstract

The specification provides a method and a device for adjusting configuration parameters in a model training process, wherein according to the method for adjusting the configuration parameters, at least one node measurement index and at least one system measurement index of a working node are collected in the model training process; determining a first to-be-adjusted configuration parameter to be adjusted based on the node metric index; determining a first adjusting strategy corresponding to the first to-be-adjusted configuration parameter; adjusting a first to-be-adjusted configuration parameter according to a first adjustment strategy; sending the system measurement index to target equipment, enabling the target equipment to determine a second to-be-allocated parameter to be adjusted, and determining a second adjustment strategy corresponding to the second to-be-allocated parameter; receiving a second adjustment strategy returned by the target equipment; and adjusting the second to-be-allocated parameters according to a second adjustment strategy. Therefore, the aim of self-adaptively adjusting the configuration parameters in the model training process is fulfilled, and the efficiency and the performance of model training are improved.

Description

Method and device for adjusting configuration parameters in model training process

Technical Field

One or more embodiments of the present disclosure relate to the field of machine learning technologies, and in particular, to a method and an apparatus for adjusting configuration parameters during a model training process.

Background

Currently, when training a model through a distributed machine learning system, a large number of parameters need to be configured in advance. These parameters greatly affect the efficiency and performance of model training. However, once the task of model training begins to run, if some parameters are found to be not reasonable enough, the parameters cannot be adjusted during the process of training the model.

Disclosure of Invention

In order to solve one of the above technical problems, one or more embodiments of the present disclosure provide a method and an apparatus for adjusting configuration parameters during a model training process.

According to a first aspect, there is provided a method for adjusting configuration parameters in a model training process, the method being applied to a work node in a distributed machine learning system, and comprising:

in the model training process, collecting at least one node metric index and at least one system metric index of the working node; the node metric index is an index representing the current training state of the working node; the system metric index is an index affecting the current performance of the system;

determining a first to-be-adjusted configuration parameter to be adjusted based on the node metric index;

determining a first adjusting strategy corresponding to the first to-be-adjusted configuration parameter;

adjusting the first to-be-allocated parameter according to the first adjustment strategy; and

sending the system metric index to target equipment, enabling the target equipment to determine a second to-be-allocated parameter to be adjusted, and determining a second adjustment strategy corresponding to the second to-be-allocated parameter;

receiving the second adjustment strategy returned by the target equipment;

and adjusting the second to-be-allocated parameters according to the second adjustment strategy.

Optionally, the determining a first to-be-adjusted configuration parameter to be adjusted based on the node metric index includes:

judging whether local configuration parameters corresponding to the node measurement indexes need to be adjusted or not based on the node measurement indexes in the at least one node measurement index;

and determining the local configuration parameters to be adjusted as the first to-be-adjusted configuration parameters to be adjusted.

Optionally, for any node metric index, whether a local configuration parameter corresponding to the node metric index needs to be adjusted is determined by the following method:

acquiring a normal value range preset for the value of the node metric index;

and if the value of the node metric index does not fall within the normal value range, determining that the local configuration parameter corresponding to the node metric index needs to be adjusted.

Optionally, the at least one node metric includes any one or more of:

the CPU utilization rate of the working node;

the memory utilization rate of the working node;

network utilization of the working nodes;

the disk utilization rate of the working node;

the gradient signal-to-noise ratio calculated by the working node training model is obtained;

the working node training model calculates the variance of the obtained gradient; and

and the loss variable quantity is calculated by the working node training model.

Optionally, the at least one system metric includes any one or more of:

the utilization rate of the working node on resources;

the computational performance of the worker node;

communication performance of the working node; and

the operating speed of the working node.

According to a second aspect, there is provided an apparatus for adjusting configuration parameters in a model training process, the apparatus being applied to a work node in a distributed machine learning system, and the apparatus comprising:

the acquisition module is used for acquiring at least one node metric index and at least one system metric index of the working node in the model training process; the node metric index is an index representing the current training state of the working node; the system metric index is an index affecting the current performance of the system;

a determining module, configured to determine a first to-be-adjusted configuration parameter to be adjusted based on the node metric indicator;

the strategy module is used for determining a first adjustment strategy corresponding to the first to-be-adjusted configuration parameter;

the first adjusting module is used for adjusting the first to-be-allocated configuration parameter according to the first adjusting strategy;

the sending module is used for sending the system measurement index to the target equipment;

a receiving module, configured to receive the second adjustment policy returned by the target device;

and the second adjusting module is used for adjusting the second to-be-allocated parameters according to the second adjusting strategy.

Optionally, the determining module is configured to:

Optionally, the node metric includes any one or more of the following:

the CPU utilization rate of the working node;

the memory utilization rate of the working node;

network utilization of the working nodes;

the disk utilization rate of the working node;

Optionally, the system metric includes any one or more of the following:

the utilization rate of the working node on resources;

the computational performance of the worker node;

communication performance of the working node; and

the operating speed of the working node.

According to a third aspect, there is provided an apparatus for adjusting configuration parameters during model training, the apparatus being applied in a distributed machine learning system, the system comprising a plurality of working nodes and a service node; the device comprises a plurality of monitoring modules, a plurality of communication modules, a plurality of first adjusting modules and a second adjusting module; a monitoring module, a communication module and a first adjusting module are respectively deployed on each working node, and a second adjusting module is deployed on the service node;

wherein, for any working node:

the monitoring module is deployed on the working node and used for acquiring at least one node metric index and at least one system metric index of the working node in the model training process; transmitting the node metric index to a first adjusting module on the working node, and transmitting the system metric index to a communication module on the working node; the node metric index is an index representing the current training state of the working node; the system metric index is an index affecting the current performance of the system;

the communication module is deployed on the working node and used for sending the system metric index transmitted by the monitoring module on the working node to the second adjusting module on the service node, receiving a second adjusting strategy returned by the second adjusting module and transmitting the second adjusting strategy to the first adjusting module on the working node;

the first adjusting module is deployed on the working node and is used for determining a first to-be-adjusted configuration parameter to be adjusted based on the node metric index transmitted by the monitoring module on the working node, determining a first adjusting strategy corresponding to the first to-be-adjusted configuration parameter, adjusting the first to-be-adjusted configuration parameter according to the first adjusting strategy, and adjusting the second to-be-adjusted configuration parameter according to the second adjusting strategy transmitted by the communication module on the working node;

the second adjusting module is configured to receive each system metric index transmitted by each communication module, determine a second to-be-allocated parameter and a third to-be-allocated parameter to be adjusted based on each system metric index, respectively determine a second adjusting policy corresponding to the second to-be-allocated parameter and a third adjusting policy corresponding to the third to-be-allocated parameter, return the second adjusting policy to each communication module, and adjust the third to-be-allocated parameter according to the third adjusting policy.

According to a fourth aspect, there is provided a computer readable storage medium, storing a computer program which, when executed by a processor, implements the method of any of the first aspects above.

According to a fifth aspect, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the first aspects when executing the program.

The technical scheme provided by the embodiment of the specification can have the following beneficial effects:

in the method and the device for adjusting configuration parameters in the model training process provided by the embodiments of the present specification, each working node in the distributed machine learning system acquires its own node metric index and system metric index, and each working node determines its corresponding first configuration parameter to be adjusted and a first adjustment policy corresponding to the first configuration parameter to be adjusted based on its own node metric index, and adjusts the first configuration parameter to be adjusted according to the first adjustment policy. And the service node determines a second to-be-allocated parameter and a third to-be-allocated parameter to be adjusted based on the system metric index of each working node, and determines a second adjustment strategy corresponding to the second to-be-allocated parameter and a third adjustment strategy corresponding to the third to-be-allocated parameter. And adjusting the second to-be-configured parameters by each working node according to the second adjustment strategy. And the service node adjusts the third to-be-allocated parameter according to the third adjustment strategy. Therefore, the aim of self-adaptively adjusting the configuration parameters in the model training process is fulfilled, and the efficiency and the performance of model training are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram illustrating a scenario in which configuration parameters are adjusted during a model training process according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of adjusting configuration parameters during model training according to an exemplary embodiment;

FIG. 3 is a block diagram of an apparatus for adjusting configuration parameters during model training, according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.

The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Fig. 1 is a schematic diagram illustrating a scenario for adjusting configuration parameters during a model training process according to an exemplary embodiment, where a system architecture of the scenario is a distributed machine learning system architecture.

In the distributed machine learning system shown in fig. 1, one service node and a plurality of work nodes are included. The working node may obtain local parameters from the server and select a portion of the training data from a locally stored local training set. And then, obtaining a local gradient based on the local parameters and the training data, and uploading the local gradient to the service node. And the service node receives each local gradient uploaded by each working node and adjusts the model parameters by combining each local gradient. And training the model through a plurality of iterations of the process.

In this embodiment, each working node is respectively disposed with a monitoring module, a communication module, and a first adjusting module, and the service node is disposed with a second adjusting module. Firstly, for each working node, in the model training process, at intervals of a preset time period (which may be 10 minutes, or 1 hour, etc.), a monitoring module deployed on the working node acquires at least one node metric index (a node metric index is an index representing the current training state of the working node) and at least one system metric index (a system metric index is an index affecting the current performance of the distributed machine learning system) of the working node. And transmitting the node metric indicator to a first adjustment module on the working node and transmitting the system metric indicator to a communication module on the working node.

In one aspect, a first adjustment module on the working node determines a first to-be-adjusted configuration parameter to be adjusted based on the node metric, where the first to-be-adjusted configuration parameter may be a local configuration parameter of the working node. And then, determining a first adjusting strategy corresponding to the first to-be-adjusted configuration parameter, and adjusting the first to-be-adjusted configuration parameter according to the first adjusting strategy.

On the other hand, the communication module on the working node sends the system metric index transmitted by the monitoring module to the second adjusting module on the service node. The second adjusting module receives each system metric index transmitted by each communication module on the working node, and determines a second parameter to be adjusted based on each system metric index, where the second parameter to be adjusted may be a local configuration parameter of each working node or a system configuration parameter. And then, determining a second adjustment strategy corresponding to the second parameter to be configured, and correspondingly returning the second adjustment strategy to each communication module of each working node. It should be noted that the second adjustment policy of each working node may be the same or different. And the communication module of each working node receives the second adjustment strategy and transmits the second adjustment strategy to the first adjustment module of the working node. The first adjusting module of the working node may adjust the second to-be-configured parameter according to a second adjusting policy.

In yet another aspect, the second adjusting module at the service node may further determine a third to-be-adjusted configuration parameter to be adjusted, where the third to-be-adjusted configuration parameter may be a system configuration parameter. And determining a third adjusting strategy corresponding to the third to-be-allocated parameter, and adjusting the third to-be-allocated parameter according to the third adjusting strategy.

In the scheme for adjusting configuration parameters provided in this embodiment, each working node in the distributed machine learning system acquires a respective node metric index and a system metric index, and each working node determines a respective corresponding first to-be-adjusted configuration parameter and a first adjustment policy corresponding to the first to-be-adjusted configuration parameter based on the respective node metric index, and adjusts the first to-be-adjusted configuration parameter according to the first adjustment policy. And the service node determines a second to-be-allocated parameter and a third to-be-allocated parameter to be adjusted based on the system metric index of each working node, and determines a second adjustment strategy corresponding to the second to-be-allocated parameter and a third adjustment strategy corresponding to the third to-be-allocated parameter. And adjusting the second to-be-configured parameters by each working node according to the second adjustment strategy. And the service node adjusts the third to-be-allocated parameter according to the third adjustment strategy. Therefore, the aim of self-adaptively adjusting the configuration parameters in the model training process is fulfilled, and the efficiency and the performance of model training are improved.

The embodiments provided in the present specification will be described in detail with reference to specific examples.

Fig. 2 is a flow diagram illustrating a method for adjusting configuration parameters during model training according to an exemplary embodiment, which may be applied to a work node in a distributed machine learning system, where the work node may be any device, platform, or server with computing and processing capabilities. The method comprises the following steps:

in step 201, at least one node metric and at least one system metric of the working node are collected during model training.

In this embodiment, the node metric index of the working node is an index representing the current training state of the working node. The system metric index of the working node is an index which influences the current performance of the distributed machine learning system.

Wherein the at least one node metric may include any one or more of the following: the CPU utilization of the working node; the memory utilization rate of the working node; network utilization of the working node; the disk utilization of the working node; the gradient signal-to-noise ratio calculated by the working node training model is obtained; the working node training model calculates the variance of the obtained gradient; and the loss variable quantity calculated by the working node training model.

Wherein the at least one system metric may include any one or more of: the utilization rate of the working node to the resources; the computational performance of the worker node; the communication performance of the working node; and the operating speed of the working node.

In step 202, a first to-be-adjusted configuration parameter to be adjusted is determined based on the at least one node metric.

In this embodiment, whether the local configuration parameter corresponding to each node metric index needs to be adjusted may be determined based on each node metric index in the at least one node metric index, and the local configuration parameter that needs to be adjusted is determined as the first to-be-adjusted configuration parameter. The local configuration parameters of the working node may be configuration parameters set for the working node, independent of other working nodes. The local configuration parameters may include, but are not limited to, a sample batch size, a model training learning rate, a floating point number precision, a weight attenuation coefficient, and the like.

In an implementation manner, specifically, for any node metric index, whether a local configuration parameter corresponding to the node metric index needs to be adjusted may be determined in the following manner: and acquiring a normal value range preset for the value of the node metric index, and if the value of the node metric index does not fall within the normal value range, determining that the local configuration parameter corresponding to the node metric index needs to be adjusted. The local configuration parameter corresponding to the node metric index may be a local configuration parameter that affects the node metric index.

For example, the node metric index is a gradient signal-to-noise ratio calculated by the working node training model, a normal value range can be set to be greater than or equal to m in advance for the gradient signal-to-noise ratio, and if the acquired gradient signal-to-noise ratio is smaller than m and does not fall within the normal value range, it can be determined that a local configuration parameter corresponding to the gradient signal-to-noise ratio needs to be adjusted. The local configuration parameter corresponding to the gradient signal-to-noise ratio may be a sample batch size. Wherein the gradient signal-to-noise ratio can be calculated by the following formula:

wherein, B_noiseRepresenting the gradient signal-to-noise ratio, G_bRepresents the gradient of the b-th sample in the current batch, and G represents the average value of the accumulated gradients of each batch before the current batch.

In another implementation manner, it may also be determined whether the local configuration parameter corresponding to the node metric index needs to be adjusted by: and determining a preset reference value and an actual value of the local configuration parameter corresponding to the node metric index, and if the preset reference value is different from the actual value, determining that the local configuration parameter corresponding to the node metric index needs to be adjusted.

In step 203, a first adjustment policy corresponding to the first to-be-adjusted configuration parameter is determined, and the first to-be-adjusted configuration parameter is adjusted according to the first adjustment policy.

In this embodiment, the first adjustment policy of the first to-be-adjusted configuration parameter may be determined based on each node metric of the working node, and the first adjustment policy may be set in advance for each first to-be-adjusted configuration parameter.

For example, the node metric index is the CPU utilization of the working node, and the first to-be-adjusted configuration parameter corresponding to the node metric index is a data processing mode. It can be preset that when the CPU utilization does not exceed a preset threshold, data is processed in a normal manner, and when the CPU utilization exceeds the preset threshold, data is processed in a drop method (dropout). If the CPU utilization actually exceeds the preset threshold, it may be determined that the corresponding first adjustment policy is a manner of adjusting a general manner of processing data to a discarding method.

In step 204, the system metric index is sent to the target device, a second adjustment policy returned by the target device is received, and the second to-be-configured parameter is adjusted according to the second adjustment policy.

In this embodiment, the working node may send the system metric to a target device, where the target device may be a service node or other designated working node. The target device may determine a second to-be-adjusted configuration parameter to be adjusted, and determine a second adjustment policy corresponding to the second to-be-adjusted configuration parameter. And the working node returns a second adjustment strategy returned by the receiving target equipment and adjusts the second to-be-allocated parameters according to the second adjustment strategy.

Specifically, the target device may determine the second configuration parameter to be adjusted based on the partial system metric sent by each working node. For example, a normal value range set in advance for a value of a system metric index is obtained for the system metric index sent by any one working node, and if the value of the system metric index does not fall within the normal value range, it can be determined that a local configuration parameter of the working node corresponding to the system metric index needs to be adjusted. And taking the local configuration parameter to be adjusted as a second to-be-configured parameter of the working node. And returning the second to-be-configured parameter to the working node. The local configuration parameter of the working node corresponding to the system metric index may be a local configuration parameter affecting the system metric index.

In addition, the target device may also determine a third to-be-adjusted configuration parameter to be adjusted based on a part of the system metric index sent by each working node. And determining a third adjusting strategy corresponding to the third to-be-allocated parameter, and adjusting the third to-be-allocated parameter according to the third adjusting strategy. For example, the same system metric index of each working node may be integrated to determine the overall performance index of the system, and if the value of the overall performance index of the system does not fall within the normal value range, it may be determined that the system configuration parameter corresponding to the overall performance index of the system needs to be adjusted. And taking the system configuration parameter to be adjusted as a third to-be-adjusted parameter, and adjusting the third to-be-adjusted parameter according to a third adjustment strategy. The system configuration parameter corresponding to the overall performance index of the system may be a system configuration parameter that affects the overall performance index of the system.

For the embodiment, a specific application scenario may be that, for each work node in the distributed machine learning system, the CPU utilization of the work node is collected as a node metric index, and if the CPU utilization is greater than a preset CPU utilization, it is determined that a data processing manner needs to be adjusted. And determining that the data processing mode is adjusted from a common mode to a drop (dropout) mode.

And collecting the memory utilization rate of the working node as a node measurement index, and determining that the floating point number precision needs to be adjusted if the memory utilization rate is greater than a preset memory utilization rate. And determining that the floating point number precision is adjusted from a double-precision floating point number type to a single-precision floating point number type.

And collecting the network utilization rate of the working node as a node measurement index, and determining that the data transmission mode needs to be adjusted if the network utilization rate is greater than the preset network utilization rate. And determining that the data transmission mode is adjusted from the full data transmission mode to the random partial data discarding mode.

And acquiring the disk utilization rate of the working node as a node measurement index, and determining parameters influencing IO (input/output) quantity to be adjusted if the disk utilization rate is greater than a preset disk utilization rate. For example, the frequency of outputting logs is reduced to reduce the disk utilization of the working node.

And acquiring a gradient signal-to-noise ratio calculated by the working node training model as a node measurement index, if the gradient signal-to-noise ratio is greater than a preset signal-to-noise ratio, determining that the batch size of the sample needs to be adjusted, and determining that the batch size of the sample is increased.

And acquiring loss variation calculated by the working node training model as a node measurement index, if the loss variation is larger than the preset loss variation, determining that the learning rate needs to be adjusted, and determining to increase the learning rate.

For each working node in the distributed machine learning system, collecting the utilization rate of the working node on resources as a system measurement index, integrating the utilization rate of each working node on the resources by the service node, determining that the utilization rate of the system resources is greater than the preset resource utilization rate, determining that the working node needs to be adjusted, and determining to increase the number of the working nodes in the system.

For each working node in the distributed machine learning system, collecting the computing performance of the working node as a system metric index, synthesizing the computing performance of each working node by the service node to obtain the overall computing performance of the system, if the overall computing performance of the system is smaller than the preset computing performance, determining that a task execution subject needs to be adjusted, and determining that the task is migrated to a GPU.

For each working node in the distributed machine learning system, collecting the communication performance of the working node as a system measurement index, synthesizing the communication performance of each working node by the service node to obtain the overall communication performance of the system, if the overall communication performance of the system is less than the preset communication performance, determining that the communication structure of the system needs to be adjusted, and determining that the communication structure of the system is adjusted to be annular from a tree shape.

For each working node in the distributed machine learning system, collecting the operating speed of the working node as a system measurement index, integrating the operating speed of each working node by the service node, if the operating speed of the working node A is less than a preset operating speed, determining that the sample batch of the working node A needs to be adjusted, and determining that the sample batch of the working node A is reduced.

The embodiment is not limited to the above application scenario, and may also be applied to a business scenario or other scenarios in the aspect of intelligent supply chain integrated services for a large number of commodities. In one example, a particular business scenario may bring application success to a collaborative supply chain network including reducing transportation costs, improving supplier delivery performance, and minimizing supplier risk. In yet another example, a particular business scenario may lead to better situational intelligence for the entire supply chain operation, reduce inventory and operational costs, and shorten response time to customers. In yet another example, a particular business scenario may be to improve the accuracy of production planning and plant scheduling in a collaborative supply chain network, and to reduce supply chain delays for components used in a customized product.

In the method for adjusting configuration parameters in the model training process provided in the above embodiments of the present specification, each working node in the distributed machine learning system acquires its own node metric index and system metric index, and each working node determines, based on its own node metric index, its corresponding first to-be-adjusted configuration parameter and its corresponding first adjustment policy, and adjusts the first to-be-adjusted configuration parameter according to the first adjustment policy. And the service node determines a second to-be-allocated parameter to be adjusted based on the system metric index of each working node, and determines a second adjustment strategy corresponding to the second to-be-allocated parameter. And adjusting the second to-be-configured parameters by each working node according to the second adjustment strategy. Therefore, the aim of self-adaptively adjusting the configuration parameters in the model training process is fulfilled, and the efficiency and the performance of model training are improved.

It should be noted that although in the above-described embodiment of fig. 2, the operations of the method of the embodiments of the present specification are described in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. For example, step 203 may be performed first, and then step 204 may be performed. Step 204 may be performed first, then step 203 may be performed, or step 203 and step 204 may be performed simultaneously. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

In accordance with the foregoing embodiments of the method for adjusting configuration parameters during model training, the present specification also provides embodiments of an apparatus for adjusting configuration parameters during model training.

As shown in fig. 3, fig. 3 is a block diagram of an apparatus for adjusting configuration parameters during model training according to an exemplary embodiment, where the apparatus is deployed in a work node in a distributed machine learning system, and the apparatus may include: the system comprises an acquisition module 301, a determination module 302, a policy module 303, a first adjustment module 304, a sending module 305, a receiving module 306 and a second adjustment module 307.

The acquisition module 301 is configured to acquire at least one node metric index and at least one system metric index of a working node in a model training process. The node metric index is an index representing the current training state of the working node, and the system metric index is an index influencing the current performance of the system.

A determining module 302, configured to determine a first to-be-adjusted configuration parameter to be adjusted based on the node metric.

The policy module 303 is configured to determine a first adjustment policy corresponding to the first to-be-adjusted configuration parameter.

The first adjusting module 304 is configured to adjust the first to-be-adjusted configuration parameter according to the first adjusting policy.

A sending module 305, configured to send the system metric to the target device.

A receiving module 306, configured to receive the second adjustment policy returned by the target device.

A second adjusting module 307, configured to adjust the second to-be-configured parameter according to a second adjusting policy.

In some implementations, the determination module 302 is configured to: and judging whether the local configuration parameters corresponding to the node measurement indexes need to be adjusted or not based on the node measurement indexes in the at least one node measurement index, and determining the local configuration parameters needing to be adjusted as first to-be-adjusted configuration parameters.

In other embodiments, the node metric may include any one or more of: the CPU utilization rate of the working node; the memory utilization rate of the working node; network utilization of the working nodes; the disk utilization rate of the working node; the gradient signal-to-noise ratio obtained by the working node training model calculation; the variance of the gradient obtained by the working node training model calculation; and loss variation calculated by the working node training model.

In other embodiments, the system metrics may include any one or more of: the utilization rate of the working node on resources; the computational performance of the working node; communication performance of the working node; and the operating speed of the working node.

It should be understood that the above-mentioned apparatus may be preset in the working node, and may also be loaded into the working node by means of downloading or the like. The corresponding modules in the device can be matched with the modules in the working nodes to realize the scheme of adjusting the configuration parameters in the model training process.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of one or more embodiments of the present specification. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. The software modules may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments, objects, technical solutions and advantages of the present application are described in further detail, it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present application should be included in the scope of the present application.

Claims

1. A method of adjusting configuration parameters during model training, the method being applied to a work node in a distributed machine learning system, the method comprising:

receiving the second adjustment strategy returned by the target equipment;

adjusting the second to-be-allocated parameters according to the second adjustment strategy;

wherein the node metric index comprises at least: the CPU utilization rate of the working node, the memory utilization rate of the working node, the network utilization rate of the working node and the disk utilization rate of the working node.

2. The method of claim 1, wherein the determining a first to-be-adjusted configuration parameter to be adjusted based on the node metric comprises:

3. The method according to claim 2, wherein for any node metric index, whether the local configuration parameter corresponding to the node metric index needs to be adjusted is determined by:

acquiring a normal value range preset for the value of the node metric index;

4. The method of claim 1, wherein the node metric further comprises any one or more of:

5. The method of claim 1, wherein the at least one system metric comprises any one or more of:

the utilization rate of the working node on resources;

the computational performance of the worker node;

communication performance of the working node; and

the operating speed of the working node.

6. An apparatus for adjusting configuration parameters in a model training process, the apparatus being applied to a work node in a distributed machine learning system, the apparatus comprising:

the receiving module is used for receiving a second adjustment strategy returned by the target equipment;

the second adjusting module is used for adjusting a second to-be-allocated parameter according to the second adjusting strategy;

7. The apparatus of claim 6, wherein the determining module is configured to:

8. The apparatus of claim 6, wherein the node metric further comprises any one or more of:

9. The apparatus of claim 6, wherein the system metric comprises any one or more of:

the utilization rate of the working node on resources;

the computational performance of the worker node;

communication performance of the working node; and

the operating speed of the working node.

10. An apparatus for adjusting configuration parameters in a model training process, the apparatus being applied to a distributed machine learning system, the system comprising a plurality of working nodes and a service node; the device comprises a plurality of monitoring modules, a plurality of communication modules, a plurality of first adjusting modules and a second adjusting module; a monitoring module, a communication module and a first adjusting module are respectively deployed on each working node, and a second adjusting module is deployed on the service node;

wherein, for any working node:

the monitoring module is deployed on the working node and used for acquiring at least one node metric index and at least one system metric index of the working node in the model training process; transmitting the node metric index to a first adjusting module on the working node, and transmitting the system metric index to a communication module on the working node; the node metric index is an index representing the current training state of the working node; the node metric indicators include at least: the CPU utilization rate of the working node, the memory utilization rate of the working node, the network utilization rate of the working node and the disk utilization rate of the working node; the system metric index is an index affecting the current performance of the system;

the first adjusting module is deployed on the working node and is used for determining a first to-be-adjusted configuration parameter to be adjusted based on the node metric index transmitted by the monitoring module on the working node, determining a first adjusting strategy corresponding to the first to-be-adjusted configuration parameter, adjusting the first to-be-adjusted configuration parameter according to the first adjusting strategy, and adjusting a second to-be-adjusted configuration parameter according to the second adjusting strategy transmitted by the communication module on the working node;