CN111562985B

CN111562985B - Resource management method and device, electronic equipment and storage medium

Info

Publication number: CN111562985B
Application number: CN202010388242.8A
Authority: CN
Inventors: 胡伟政
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2024-03-22
Anticipated expiration: 2040-05-09
Also published as: CN111562985A

Abstract

The disclosure relates to a resource management method and device, an electronic device and a storage medium, which are applied to a distributed computing system, and are characterized in that the distributed computing system comprises a plurality of data processing nodes and a plurality of parameter servers, and the method comprises the following steps: determining a first time length consumed by the plurality of data processing nodes for performing at least one round of iterative training operation of the neural network model and a second time length consumed by the plurality of data processing nodes for performing data transmission operation with the plurality of parameter servers; and adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers according to the first duration and the second duration.

Description

Resource management method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a resource management method and device, an electronic device and a storage medium.

Background

In a distributed computer system, transactions are often co-processed by multiple units.

In the process of co-processing the transactions, some units are in a waiting state, so that resources in an idle waiting state exist in the system, namely, the problem of resource waste exists.

Disclosure of Invention

The present disclosure proposes a resource management solution.

According to an aspect of the present disclosure, there is provided a resource management method applied to a distributed computing system including a plurality of data processing nodes and a plurality of parameter servers, the method comprising:

determining a first time length consumed by the plurality of data processing nodes for performing at least one round of iterative training operation of the neural network model and a second time length consumed by the plurality of data processing nodes for performing data transmission operation with the plurality of parameter servers;

and adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers according to the first duration and the second duration.

In a possible implementation manner, the adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers according to the first duration and the second duration includes:

and adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers under the condition that the difference value between the first time length and the second time length is larger than a set threshold value.

In a possible implementation manner, the adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers when the difference value between the first duration and the second duration is greater than a set threshold includes:

Reducing the number of the plurality of parameter servers when a first difference value obtained by subtracting the second time length from the first time length is larger than a first threshold value; and/or

And increasing the number of the plurality of parameter servers when a second difference value obtained by subtracting the first time from the second time is larger than a second threshold value.

In one possible implementation, the number of the plurality of parameter servers after adjustment is positively correlated with a difference value between the first time period and the second time period, and is negatively correlated with the first time period or the second time period.

normalizing the difference value of the first time length and the second time length to obtain a normalized difference value;

and adjusting the number of the plurality of parameter servers according to the normalized difference value and a preset weight value, wherein the preset weight value is smaller than a first numerical value.

In one possible implementation, the plurality of data processing nodes perform transmission of data related to the iterative training operation between the data processing nodes and the plurality of parameter servers during performance of the iterative training operation.

In one possible implementation manner, during execution of an iterative training operation, the plurality of data processing nodes perform transmission of data related to the iterative training operation between the data processing nodes and the plurality of parameter servers, including:

during the time that the plurality of data processing nodes execute the current iterative training operation, the plurality of data processing nodes acquire data used by the next iterative training operation from the plurality of parameter servers; and/or the number of the groups of groups,

and during the period that the plurality of data processing nodes execute the current iterative training operation, transmitting the data obtained by the previous iterative training operation to the plurality of parameter servers.

In one possible implementation manner, the sending the data obtained by the previous iterative training operation to the plurality of parameter servers includes:

and each data processing node divides the data obtained by the previous iterative training operation into a plurality of parts and respectively transmits one part of the data to the plurality of parameter servers.

According to an aspect of the present disclosure, there is provided a resource management apparatus for use in a distributed computing system comprising a plurality of data processing nodes and a plurality of parameter servers, the apparatus comprising:

A determining unit, configured to determine a first duration consumed by the plurality of data processing nodes to perform at least one round of iterative training operation of the neural network model, and a second duration consumed by the plurality of data processing nodes to perform data transmission operation with the plurality of parameter servers;

and the adjusting unit is used for adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers according to the first duration and the second duration.

In a possible implementation manner, the adjusting unit is configured to adjust the number of the plurality of data processing nodes and/or the plurality of parameter servers when a difference value between the first duration and the second duration is greater than a set threshold.

In a possible implementation manner, the adjusting unit is configured to reduce the number of the plurality of parameter servers when a first difference obtained by subtracting the second time period from the first time period is greater than a first threshold; and/or, the method is used for increasing the number of the plurality of parameter servers when the second difference value obtained by subtracting the first time length from the second time length is larger than a second threshold value.

In one possible implementation, the adjustment unit includes a first adjustment subunit and a second adjustment subunit, where:

the first adjustment subunit is configured to normalize a difference value between the first duration and the second duration to obtain a normalized difference value;

the second adjustment subunit is configured to adjust the number of the plurality of parameter servers according to the normalized difference value and a preset weight value, where the preset weight value is smaller than a first numerical value.

According to an aspect of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, the number of the plurality of data processing nodes and the plurality of parameter servers is adjusted by executing the first time length consumed by at least one round of iterative training operation of the neural network model according to the data processing nodes and the second time length consumed by the data transmission operation between the plurality of data processing nodes and the plurality of parameter servers, so that the difference value of the time length consumed by the iterative training operation and the data transmission operation is within a set range, and the time that the data processing nodes or the parameter servers are in an idle state is reduced under the condition that the iterative training operation and the data transmission operation are executed in parallel, the waste of resources is reduced, and the utilization rate of the resources is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

FIG. 1 illustrates a schematic diagram of a distributed computer system according to an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a resource management method according to an embodiment of the present disclosure;

FIG. 3 illustrates a timing diagram of an iterative training operation and data transfer operation in accordance with an embodiment of the present disclosure;

FIG. 4 illustrates a timing diagram of an iterative training operation and data transfer operation in accordance with an embodiment of the present disclosure;

FIG. 5 illustrates a timing diagram of an iterative training operation and data transfer operation in accordance with an embodiment of the present disclosure;

FIG. 6 illustrates a block diagram of a resource management device, according to an embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of an electronic device, according to an embodiment of the present disclosure;

fig. 8 shows a block diagram of an electronic device, according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

The disclosure provides a resource management method, which can be applied to a distributed computer system, and please refer to fig. 1, which is a schematic structural diagram of the distributed computer system provided in the disclosure, wherein the distributed computer system includes a plurality of data processing nodes and a plurality of Parameter Servers (PS).

The data processing nodes may be deployed in the form of virtual machines and/or physical machines, and the parameter servers may also be deployed in the form of virtual machines and/or physical machines.

Optionally, the data processing nodes and the parameter servers in the distributed computer system may be deployed in the form of virtual machines, where the virtual machines may be implemented by a virtualization technology, and the physical machines may be virtualized into multiple logically isolated environments by the virtualization technology, where each virtual machine may independently complete its own task and do not interfere with each other. The present disclosure is not limited to the particular virtualization technique used.

The distributed computer system provided by the disclosure can be used for executing distributed machine learning tasks, in the system, a data processing node can execute iterative training operation of a neural network model, a parameter server can store data related to the iterative training operation, and the data related to the iterative training operation can also be provided for the data processing node. The data processing nodes and the parameter servers can communicate through a network to perform data transmission operation.

In the distributed computer system, a plurality of parameter servers can realize distributed storage of data, and data transmission between the parameter servers and the data processing nodes can be realized through a point-to-point hypermedia transmission protocol. The data processing node can divide the data to be stored into a plurality of parts and respectively send one part of the plurality of parts of data to each parameter server; when the data processing node reads data from the plurality of parameter servers, the data processing node can also read one data from the plurality of parameter servers respectively, and combine the read plurality of data into complete data.

The distributed storage can break through the limitation of network bandwidth of a single parameter server, and a plurality of data processing nodes can simultaneously perform data transmission with a plurality of parameter servers, so that rapid data transmission can be realized, and under the condition that the data transmission efficiency is not limited by the data processing nodes, the more the number of the parameter servers is, the higher the data transmission efficiency can be.

Fig. 2 shows a flowchart of a resource management method according to an embodiment of the present disclosure, as shown in fig. 2, including:

in step S11, a first duration consumed by the plurality of data processing nodes to perform at least one iteration of the training operation of the neural network model and a second duration consumed by the data transmission operation between the data processing nodes and the plurality of parameter servers are determined.

In the distributed computer system provided by the present disclosure, the iterative training operation performed by the data processing node may be periodic, that is, multiple rounds of iterative training are performed on the neural network model, where the neural network model is trained according to preset steps in a single iterative training period, for example, forward propagation is performed, or forward propagation and gradient calculation are performed, as an example, a processing result for the sample image may be obtained by inputting the sample image into the neural network model for processing, and further, network loss and gradient are obtained based on the processing result, and the data processing node may transmit data obtained by the iterative training operation to the parameter server, and perform the next round of iterative training using the data obtained from the parameter server.

The first time period may be a time period spent by the data processing node performing at least one round of iterative training operation, and the time period may include a time period spent by each round of iterative training, for example, a time period spent by forward propagation in each round of iterative training, a time period spent by forward propagation and gradient calculation in each round of iterative training, and the like, where optionally a waiting time period between two adjacent rounds of iterative training may not be included, but the embodiments of the disclosure are not limited thereto.

In an embodiment of the present disclosure, the data transmitted by the data transmission operation includes data related to an iterative training operation. During the iterative training operation performed by the plurality of data processing nodes, the data processing nodes may be in communication with the plurality of parameter servers for data related to the iterative training operation.

The second duration consumed by the data processing node in performing the data transmission operation may be a duration consumed in performing transmission of data related to at least one round of iterative training operation, for example, a duration consumed in acquiring data used in the at least one round and/or the next round of iterative training operation from a plurality of parameter servers, or a duration consumed in transmitting processing results generated in the at least one round and/or the previous round of iterative training operation.

It should be noted that, since the data processing node may perform the data uploading and downloading operations simultaneously, that is, the data processing node may perform multiple data transmission operations simultaneously, in this case, the second duration consumed by the data transmission operation may be optionally the duration consumed by the data transmission operation with the longest time consumption.

For example, in the case where the time period spent transmitting the data obtained by the iterative training operation is greater than the time period spent acquiring the data used by the iterative training operation, the second time period is the time period spent transmitting the data obtained by the at least one round and/or the previous iterative training operation.

In some embodiments, the first duration may be an average value of durations spent in multiple iterative training operations, or a maximum value or a minimum value of durations spent in multiple iterative training operations, or a duration spent in a certain iterative training operation, where multiple iterative training operations may be performed by multiple data processing nodes in the system respectively, or may be multiple iterative training operations by the same data processing node. Accordingly, the second duration may also be an average value, a maximum value, or a minimum value of durations consumed by a plurality of data transmission operations, or be a duration consumed by a certain data transmission operation, which is not limited in the embodiments of the present disclosure.

In practical applications, the time when the iterative training operation and the data transmission operation begin and the time when the iterative training operation and the data transmission operation end may be recorded, and the first time length and the second time length may be determined according to the time, or the time spent by the iterative training operation and the data transmission operation may be recorded by a timer, so as to determine the first time length and the second time length. Of course, the first duration and the second duration may also be recorded in other manners, which will not be described herein.

In step S12, the number of the plurality of data processing nodes and/or the plurality of parameter servers is adjusted according to the first duration and the second duration.

In the embodiment of the disclosure, the number of the plurality of data processing nodes and/or the plurality of parameter servers is adjusted so that the difference value of the time duration consumed by the iterative training operation and the data transmission operation is within a set range.

The setting range may be manually set according to experience, and when the difference value between the first duration and the second duration is not within the setting range, it indicates that the duration of the data processing node or the parameter server in the idle waiting state exceeds the tolerable value, and in this case, the number of the data processing nodes and/or the parameter servers is adjusted, so that the difference value between the duration of the iterative training operation and the duration of the data transmission operation is within the setting range, and thus, when the iterative training operation and the data transmission operation are performed in parallel, the time of the resource in the idle state can be reduced, the waste of the resource is reduced, and the utilization rate of the resource is improved.

The difference value is used for representing the difference degree of the first time length and the second time length, the larger the difference value is, the larger the difference degree is, the difference value can be specifically a difference value obtained by subtracting the first time length and the second time length, or can be a ratio obtained by dividing the first time length and the second time length, and the specific embodiment form of the difference value is not limited in the disclosure.

The adjustment of the number of data processing nodes and/or parameter servers may be achieved by a related technique, for example, idle data processing nodes may be disabled to reduce the number of data processing nodes, or after the content in the parameter server is backed up to other parameter servers, the parameter server may be disabled to reduce the number of parameter servers, or the parameter server or the data processing nodes may be used to perform other tasks, or unused data processing nodes or parameter servers may be enabled to increase the number of data processing nodes or parameter servers.

In a possible implementation manner, the method for resource management provided by the present disclosure may further include, during execution of iterative training operations by the plurality of data processing nodes, performing, between the data processing nodes and the plurality of parameter servers, transmission of data related to the iterative training operations, where the method includes: the plurality of data processing nodes acquire data used by a next iterative training operation from the plurality of parameter servers during the execution of a current iterative training operation by the plurality of data processing nodes, and/or transmit data obtained by a previous iterative training operation to the plurality of parameter servers during the execution of the current iterative training operation by the plurality of data processing nodes.

During the current iterative training operation of the neural network model, the data processing node can acquire the data used by the next iterative training operation without waiting until the execution of the current iterative training operation is completed, so that the next iterative training operation can be started as soon as possible, the waiting time of the data processing node is reduced, and the utilization rate of resources is improved.

In addition, the data processing node can send the data obtained by the previous iterative training operation to a plurality of parameter servers during the current iterative training operation, and the iterative training operation is started again without waiting until the data obtained by the previous iterative training operation is sent, so that the waiting time of the data processing node is reduced, and the utilization rate of resources is improved.

The data obtained by the iterative training operation may be a gradient or a weight parameter of the neural network model, which is not limited in the present disclosure, and the data obtained from the parameter server and used by the iterative training operation may be a gradient or a weight parameter of the neural network model, which is not limited in the present disclosure.

Referring to fig. 3, a timing chart of an iterative training operation and a data transmission operation is provided for the present disclosure, where the abscissa is time, the ordinate is a processing period i, and in an ideal case, neither the data processing node nor the parameter server has idle waiting, and the time consumed by the data transmission operation and the iterative training operation is the same.

For example, referring to FIG. 3, at t _j-1 ～t _j In the time period, the transmission of the processing result of the i-1 th period iterative training operation and/or the transmission of the data processed by the i+1 th period iterative training operation (data transmission operation) and the i-th period iterative training operation are just completed, so that the idle time of the data processing node and the parameter server is close to 0, and the higher resource utilization rate is realized.

In a possible implementation manner, the adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers according to the first duration and the second duration includes: and adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers under the condition that the difference value between the first time length and the second time length is larger than a set threshold value.

As described above, the difference value is used to represent the degree of difference between the first duration and the second duration, and the difference value is greater than the set threshold, which indicates that the degree of difference is too great, and the duration of the data processing node or the parameter server in the idle waiting state exceeds the tolerable value. Therefore, when the difference value is larger than the set threshold value, the number of the data processing nodes and/or the parameter servers is adjusted so as to reduce the waiting time of the data processing nodes and improve the utilization rate of resources.

In addition, the number of the plurality of data processing nodes and/or the plurality of parameter servers may be adjusted periodically, for example, the number of the plurality of data processing nodes and/or the plurality of parameter servers may be adjusted every 10 minutes according to the first duration and the second duration.

In a possible implementation manner, the adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers when the difference value between the first duration and the second duration is greater than a set threshold includes: reducing the number of the plurality of parameter servers and/or increasing the number of the data processing nodes under the condition that a first difference value obtained by subtracting the second time length from the first time length is larger than a first threshold value; and/or increasing the number of the plurality of parameter servers when a second difference value obtained by subtracting the first time from the second time is greater than a second threshold value.

Under the condition that the first time length is longer than the second time length, the time length consumed by the iterative training operation is longer than the time length consumed by the data transmission operation, then the parameter server has idle waiting conditions, a first difference value between the first time length and the second time length is greater than a first threshold value, the condition that the idle waiting time length of the parameter server exceeds a tolerable value is indicated, a specific value of the first threshold value can be set manually according to experience, and the method is not limited.

As shown in fig. 4, the time spent in the iterative training operation is longer than the time spent in the data transmission operation, in this case, the number of data processing nodes may be increased, that is, the data to be transmitted in the data transmission operation is increased, the idle waiting time of the parameter server is utilized, and the utilization rate of resources is improved. Alternatively, the number of parameter servers in the distributed computer system can be reduced, that is, the parameter servers waiting for idle are removed from the distributed computer system, so that the utilization rate of resources is improved.

Under the condition that the second time length is longer than the first time length, the time length consumed by the iterative training operation is smaller than the time length consumed by the data transmission operation, then the data processing node can have idle waiting conditions, a second difference value between the second time length and the first time length is greater than a second threshold value, the time length of the idle waiting of the data processing node is indicated to exceed a tolerable value, a specific value of the second threshold value can be set manually according to experience, and the method is not limited.

As shown in fig. 5, the duration of the data transmission operation is longer than the duration of the iterative training operation, in this case, the number of parameter servers can be increased, that is, the bandwidth of the data transmission operation is increased, the data can be transmitted more quickly, the time consumed by the data transmission operation is reduced, the idle waiting time of the data processing node is reduced, and the utilization rate of resources is improved. Alternatively, the number of data processing nodes in the distributed computer system can be reduced, that is, the idle waiting data processing nodes are removed from the distributed computer system, so that the utilization rate of resources is improved.

In one possible implementation, the number of the plurality of parameter servers that are adjusted is positively correlated with a difference value between the first duration and the second duration, and negatively correlated with the first duration or the second duration.

The number of the plurality of parameter servers that is adjusted may be the number of parameter servers that is increased or decreased in the process of adjusting the number of parameter servers.

For example, in reducing the number of the plurality of parameter servers, the number of the reduced parameter servers is positively correlated with the first difference, i.e., the larger the first difference, the greater the number of the reduced parameter servers. In addition, the reduced number of parameter servers is also inversely related to the first duration or the second duration.

In increasing the number of the plurality of parameter servers, the number of the increased parameter servers is positively correlated with the second difference, i.e., the larger the second difference, the greater the number of the increased parameter servers. In addition, the number of parameter servers that are increased is also inversely related to the first duration or the second duration.

In the embodiment of the disclosure, the number of the plurality of parameter servers is adjusted to be positively correlated with the difference value of the first time length and the second time length, so that the number of the parameter servers is adjusted greatly when the difference degree is large, and the number of the parameter servers is adjusted slightly when the difference degree is small, and the number of the plurality of parameter servers can be adjusted accurately according to the difference degree of the first time length and the second time length.

The number of the plurality of parameter servers is inversely related to the first time length or the second time length, and the difference value can be normalized through the first time length or the second time length, so that the method and the device are applicable to time lengths of different orders of magnitude and have strong universality.

In a possible implementation manner, the adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers according to the first duration and the second duration includes: normalizing the difference value of the first time length and the second time length to obtain a normalized difference value; and adjusting the number of the plurality of parameter servers according to the normalized difference value and a preset weight value, wherein the preset weight value is smaller than a first numerical value.

As described above, the number of the plurality of parameter servers that are adjusted is inversely related to the first duration or the second duration, and in the process of normalizing the difference value, the difference value may be specifically divided by the first duration or the second duration, so as to normalize the difference value. The method and the device eliminate the influence of the time lengths of different dimensions on the quantity adjustment of the parameter servers, can be suitable for the time lengths of different orders of magnitude, and have strong universality.

The preset weight value is used for determining the quantity adjustment amplitude of the parameter servers. The larger the preset weight value is, the larger the number of parameter servers adjusted each time is; the smaller the preset weight value is, the smaller the number of parameter servers is adjusted each time. The preset weight value can be set manually according to the requirement, and in practical application, the preset weight value is smaller than the first value, so that the number of the parameter servers is adjusted in small scale each time, and the problem of excessive number adjustment caused by overlarge single adjustment is solved.

Adjust the parameters for easy understandingProcess for counting the number of servers, the process for adjusting the number of parameter servers is described below by a specific expression, please refer to the formula (1), which is the number n of parameter servers reduced in one possible implementation of the present disclosure ₁ Is defined by the following formula:

wherein t is ₁ Represents a first time period, t ₂ Represents a second time period, W ₁ Representing weight coefficients []Representing an integer.

Weight coefficient W ₁ Can be empirically set, W ₁ The larger the number n of parameter servers that decrease ₁ The more by the pair W ₁ Setting is made so that the magnitude of the number of parameter servers to be reduced at a time can be adjusted.

In one possible implementation, n ₁ May also be added as data processing nodes, or may be based on n ₁ I.e. increasing a certain number of data processing nodes and also decreasing a certain number of parameter servers, so that the effect of reducing the idle time to within the allowed range is generally achieved.

Please refer to formula (2), for an increased number n of parameter servers in one possible implementation of the present disclosure ₂ Is defined by the following formula:

wherein t is ₁ Represents a first time period, t ₂ Represents a second time period, W ₂ Representing weight coefficients []Representing an integer.

Weight coefficient W ₂ Can be empirically set, W ₂ The larger the number of parameter servers n is increased ₂ The more by the pair W ₂ Setting is made so that the magnitude of the number of parameter servers added each time can be adjusted.

In one possible implementationIn n ₂ May also be reduced in number as data processing nodes, or may be based on n ₂ I.e. increasing a certain number of parameter servers, and also decreasing a certain number of data processing nodes, so that the effect of reducing the idle time to within the allowed range is generally achieved.

In one possible implementation manner, the sending the data obtained by the previous iterative training operation to the plurality of parameter servers includes: and each data processing node divides the data obtained by the previous iterative training operation into a plurality of parts and respectively transmits one part of the data to the plurality of parameter servers.

As described above, the plurality of parameter servers may implement distributed storage of data, and the data sent by the data processing node to the plurality of parameter servers may be different, and the distributed storage may break through the limitation of network bandwidth of a single parameter server, so that the utilization rate of resources may be improved by adjusting the number of data processing nodes and/or parameter servers.

In one possible implementation, the neural network model may be used in image processing, natural language processing, identity authentication, and the like.

In a possible implementation manner, the resource management method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a personal digital processing (Personal Digital Assistant, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, etc., and the method may be implemented by a processor invoking computer readable instructions stored in a memory. Alternatively, the method may be performed by a server.

In one possible implementation, the resource management method may be implemented by a management node in a distributed computing system.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.

In addition, the disclosure further provides a resource management device, an electronic device, a computer readable storage medium, and a program, where the foregoing may be used to implement any one of the resource management methods provided in the disclosure, and the corresponding technical schemes and descriptions and corresponding descriptions referring to the method parts are not repeated.

Fig. 6 shows a block diagram of a resource management apparatus according to an embodiment of the present disclosure, the apparatus being applied to a distributed computing system including a plurality of data processing nodes and a plurality of parameter servers, the apparatus 20 comprising:

a determining unit 201, configured to determine a first duration consumed by the plurality of data processing nodes to perform at least one iteration training operation of the neural network model, and a second duration consumed by the plurality of data processing nodes to perform data transmission operations with the plurality of parameter servers;

An adjusting unit 202, configured to adjust the number of the plurality of data processing nodes and/or the plurality of parameter servers according to the first duration and the second duration.

In a possible implementation manner, the adjusting unit 202 is configured to adjust the number of the plurality of data processing nodes and/or the plurality of parameter servers when a difference value between the first duration and the second duration is greater than a set threshold.

In a possible implementation manner, the adjusting unit 202 is configured to reduce the number of the plurality of parameter servers when a first difference obtained by subtracting the second time period from the first time period is greater than a first threshold; and/or, the method is used for increasing the number of the plurality of parameter servers when the second difference value obtained by subtracting the first time length from the second time length is larger than a second threshold value.

In one possible implementation, the adjusting unit 202 includes a first adjusting subunit and a second adjusting subunit, where:

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a non-volatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for implementing the resource management method provided in any of the embodiments above.

The disclosed embodiments also provide another computer program product for storing computer readable instructions that, when executed, cause a computer to perform the operations of the resource management method provided in any of the above embodiments.

The electronic device may be provided as a terminal, server or other form of device.

Fig. 7 illustrates a block diagram of an electronic device 800, according to an embodiment of the disclosure. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 7, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the electronic device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including computer program instructions executable by processor 820 of electronic device 800 to perform the above-described methods.

Fig. 8 illustrates a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server. Referring to fig. 8, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. Electronic device 1900 may operate based on an operating system stored in memory 1932,for example Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of resource management for a distributed computing system, the distributed computing system comprising a plurality of data processing nodes and a plurality of parameter servers, the method comprising:

adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers according to the first duration and the second duration;

wherein, during the time that the plurality of data processing nodes execute the current iterative training operation, the plurality of data processing nodes acquire data used by the next iterative training operation from the plurality of parameter servers; and/or transmitting data obtained by the previous iterative training operation to the plurality of parameter servers during the period that the plurality of data processing nodes execute the current iterative training operation.

2. The method according to claim 1, wherein said adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers according to the first time period and the second time period comprises:

3. The method according to claim 2, wherein said adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers in case the difference value of the first time period and the second time period is larger than a set threshold value comprises:

4. A method according to any one of claims 1-3, wherein the number of parameter servers adjusted is positively correlated with the difference value between the first time period and the second time period, and negatively correlated with the first time period or the second time period;

the adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers according to the first duration and the second duration includes:

According to the normalized difference value and a preset weight value, the number of the plurality of parameter servers is adjusted, wherein the preset weight value is smaller than a first numerical value;

normalizing the difference value of the first time length and the second time length to obtain a normalized difference value, including:

dividing the difference value by the first time length or the second time length to obtain a normalized difference value.

5. The method of claim 1, wherein the transmission of data related to the iterative training operation is performed between the data processing nodes and the plurality of parameter servers during the performance of the iterative training operation by the plurality of data processing nodes.

6. The method of claim 1, wherein the sending the data from the last iterative training operation to the plurality of parameter servers comprises:

and each data processing node divides the data obtained by the previous iterative training operation into a plurality of parts and respectively transmits one part of the plurality of parts of data to the plurality of parameter servers.

7. A resource management apparatus for use in a distributed computing system, the distributed computing system comprising a plurality of data processing nodes and a plurality of parameter servers, the apparatus comprising:

the adjusting unit is used for adjusting the number of the plurality of data processing nodes and/or the plurality of parameter servers according to the first duration and the second duration;

8. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 6.

9. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 6.