WO2017167044A1 - 一种分布式集群训练方法和装置 - Google Patents

一种分布式集群训练方法和装置 Download PDF

Info

Publication number
WO2017167044A1
WO2017167044A1 PCT/CN2017/077246 CN2017077246W WO2017167044A1 WO 2017167044 A1 WO2017167044 A1 WO 2017167044A1 CN 2017077246 W CN2017077246 W CN 2017077246W WO 2017167044 A1 WO2017167044 A1 WO 2017167044A1
Authority
WO
WIPO (PCT)
Prior art keywords
weight
server
sample data
training
gradient
Prior art date
Application number
PCT/CN2017/077246
Other languages
English (en)
French (fr)
Inventor
周俊
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to JP2018549518A priority Critical patent/JP6949045B2/ja
Publication of WO2017167044A1 publication Critical patent/WO2017167044A1/zh
Priority to US16/141,886 priority patent/US11636379B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • G06F9/3828Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage with global bypass, e.g. between pipelines, between clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/504Resource capping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/505Clust

Definitions

  • the present application relates to the field of machine learning technology, and in particular, to a distributed cluster training method and a distributed cluster training device.
  • target models based on big data such as predicting the user's preference for the product, need to use the corresponding sample data to train the weights in the target model.
  • the above target models all require the use of machine learning training.
  • Machine learning training generally includes stand-alone training and cluster training.
  • cluster training is to first divide the training samples into each machine according to certain rules (the data on each machine is different), each The machine calculates the gradient and then uses the reduce technique to summarize the gradients and perform weight updates. Repeat the above process until convergence.
  • cluster training has become a standard in the industry due to the huge amount of data.
  • the prior art employs the execution of machine learning tasks in a distributed cluster.
  • Training on the cluster can use more training data to achieve better prediction results. Since the gradient is aggregated after each round of calculation, the traffic needs to be aggregated and the traffic is huge and frequent, which may cause the network traffic in the cluster to be full, which affects the use of the switch and even the entire cluster.
  • embodiments of the present application have been made in order to provide a distributed cluster training method and a corresponding distributed cluster training apparatus that overcome the above problems or at least partially solve the above problems.
  • a distributed cluster training method including:
  • Reading a sample set includes at least one sample data
  • the aggregation instruction Before receiving the aggregation instruction, using the sample data and the current weight, substituting the target model training function for iterative training to obtain a first gradient; the aggregation instruction is issued by the scheduling server when the cluster system environment meets the threshold condition; Before receiving the aggregation instruction, if there are multiple rounds of iterative training, the first weight is generated based on the first gradient obtained by the previous training as the current weight of the subsequent round of iterative training;
  • the first gradient is sent to the aggregation server; the aggregation server summarizes each first gradient and calculates a second weight;
  • the second weight sent by the aggregation server is received to update the current weight.
  • the application also discloses a distributed cluster training device, comprising:
  • a sample reading module for reading a sample set; the sample set shown includes at least one sample data;
  • An iterative training module configured to perform iterative training on the target model training function by using the sample data and the current weight before receiving the aggregation instruction to obtain a first gradient; the aggregation instruction is met by the scheduling server in a cluster system environment to meet a threshold condition Is issued; wherein, if there are multiple rounds of iterative training before receiving the aggregation instruction, the first weight is generated based on the first gradient obtained by the previous training as the current weight of the subsequent round of iterative training;
  • a result sending module configured to send the first gradient to the aggregation server if the aggregation instruction is received; the aggregation server summarizes each first gradient and calculates a second weight;
  • an update module configured to receive a second weight sent by the aggregation server to update the current weight.
  • the training server may use the sample set read by the training server to continuously iterate and train the first gradient by using the sample data and the current weight in the sample set before receiving the aggregation instruction; at the same time, the scheduling server may monitor the cluster system environment. Whether the threshold condition is met, when the system monitors that the cluster system environment meets the threshold condition, the aggregation instruction may be sent to each training server, and each training server sends the first gradient obtained by the training to The aggregation server aggregates the first gradients and calculates the second weights, and sends the second weights to the respective training servers before the training servers have not yet trained their sample data, and updates the current weights.
  • the training server since the system monitors when the system environment controls the aggregation instructions, the training server sends the first gradient to the aggregation server after receiving the aggregation instruction.
  • the training results will not be sent to the server at the end of each round of training in the whole process, which reduces the network traffic, reduces the impact on the switch, and avoids affecting the use of the entire cluster.
  • FIG. 1 is a flow chart of steps of an embodiment of a distributed cluster training method of the present application
  • FIG. 2 is a flow chart of steps of another embodiment of a distributed cluster training method of the present application.
  • FIG. 3 is a flow chart of steps of another embodiment of a distributed cluster training method of the present application.
  • FIG. 4 is a flow chart of steps of another embodiment of a distributed cluster training method of the present application.
  • FIG. 5 is a structural block diagram of an embodiment of a distributed cluster training device of the present application.
  • FIG. 6 is a structural block diagram of an embodiment of a distributed cluster training system of the present application.
  • the training server may use the sample set read by the training server to continuously iterate and train the first gradient by using the sample data and the current weight in the sample set before receiving the aggregation instruction; at the same time, the system can monitor the cluster system environment. Whether the threshold condition is met, the threshold condition can prevent the network traffic from being full in the cluster system environment.
  • the collection instruction can be sent to each training server, and each training server will receive the first training.
  • the gradient is sent to the aggregation server; the aggregation server summarizes each first gradient and calculates a second weight, and sends the second weight to each training server to update its current weight before each training server has finished training its sample data.
  • the training server since the system monitors when the system environment controls the aggregation instructions, the training server sends the first gradient to the aggregation server after receiving the aggregation instruction.
  • the training results will not be sent to the server at the end of each round of training throughout the process, reducing network traffic. Reduce the impact on the switch and avoid affecting the use of the entire cluster.
  • FIG. 1 a flow chart of steps of an embodiment of a distributed clustering method of the present application is shown, which may specifically include the following steps:
  • Step 110 reading a sample set; the sample set shown includes at least one sample data;
  • the entire cluster may include multiple training servers, at least one scheduling server, and at least one aggregation server.
  • the training server can obtain the sample set responsible for iterative training to obtain the first gradient
  • the scheduling server can monitor the cluster system environment of the entire system, and decide whether to issue the aggregation instruction to the training server according to the cluster system environment.
  • the aggregation server may receive the first gradient sent by each training server and calculate a second weight.
  • communication data between the training server, the scheduling server, and the aggregation server are transmitted through switches in the cluster.
  • the scheduling server of the embodiment of the present application can send the acquisition parameters of the sample set that each training server needs to obtain to each training server. Then, for a training server, after receiving the acquisition parameter, it can read the required sample set from the specified location according to the acquisition parameter. For example, a batch of transaction log data specified by the parameter is obtained from the transaction log server as a sample set.
  • the embodiment of the present application may also obtain a corresponding sample set from other servers, which may be set according to requirements, and is not limited by the embodiment of the present application.
  • Step 120 Before receiving the aggregation instruction, using the sample data and the current weight, substituting the target model training function for iterative training to obtain a first gradient; and the aggregation instruction is issued by the scheduling server when the cluster system environment meets the threshold condition; Wherein, if there are multiple rounds of iterative training before receiving the aggregation instruction, the first weight is generated based on the first gradient obtained by the previous training as the current weight of the subsequent round of iterative training;
  • each current weight of the target model is a second weight X0 preset according to experience.
  • the sample data can be extracted one by one from the sample set, and the target model is input for training, and the first gradient belonging to the training server A is trained.
  • the training server A can read the sample data for iterative training until the aggregation instruction is received.
  • each training server can read all the samples of its training to the local and then train.
  • the first round uses the sample data M1 and the current weight X0, substitutes the target model training function, trains the first gradient ⁇ F(X0), and then uses ⁇ F(X0) to calculate the weight X1 as the current weight of the second round of training.
  • the target model training function is substituted, the first gradient ⁇ F(X1) is trained; and so on, until the gather instruction is received.
  • the target model training function may be the aforementioned loss function F(X)
  • the loss function F(X) can be set according to the actual situation, and the prior art has a detailed process for this, which is not described here. Similar for the second round. Assuming that the training server trains to the third round, the first gradient ⁇ F(X2) is obtained. At this time, receiving the aggregation instruction sent by the scheduling server, the first gradient ⁇ F(X2) can be directly sent to the aggregation server through the switch. .
  • the training server after the last collection, records the training round of the first gradient.
  • the dispatching server sends the aggregation instruction, it controls which round of the first gradient the training server sends.
  • the scheduling server may control each training server to perform N rounds of training before sending the aggregation instruction, and N is an integer greater than zero.
  • the training server is notified to perform only three rounds of training before receiving the aggregation instruction, and if the three rounds of training are finished, it waits for the instruction of the scheduling server.
  • N can be limited, and the value of N can be set according to the training precision error of the actual demand.
  • the training accuracy error of the actual demand can be set based on the experience of historical training results.
  • the collection instruction sent by the scheduling server to each training server includes a specified round. Then, each training server sends the first gradient obtained by the corresponding round training to the aggregation server.
  • the scheduling server monitors the cluster system environment.
  • the scheduling server sends a collection instruction to each training server.
  • the threshold condition may limit the frequency that the training server sends not too fast, resulting in network congestion, such as threshold conditions such as network utilization below 30%.
  • the aggregation instruction is issued by the scheduling server when the cluster system environment meets the threshold condition, including:
  • the aggregation instruction is issued by the scheduling server when the cluster network utilization of the entire cluster meets the first threshold condition.
  • the scheduling server can monitor the cluster network utilization of the entire cluster, for example, obtaining the amount of packets and the amount of packets received by each server, and the network card itself has a maximum traffic limit, such as 100M, then counting each network card. The amount of packets sent and received, divided by the total traffic limit of all NICs, can get the cluster network utilization.
  • the first threshold condition includes: the cluster network utilization is lower than the first threshold; for example, the first threshold is set to 30%, then when the cluster network utilization obtained by the scheduling server is less than 30%, then A collection instruction can be sent to each training server.
  • the aggregation instruction is issued by the scheduling server when the cluster system environment meets the threshold condition, including:
  • the aggregation instruction is issued by the scheduling server when the cluster failure rate of the entire cluster meets the second threshold condition.
  • each server in the entire cluster may be faulty.
  • the fault of each server may be monitored, and then the number of servers that are faulty is divided by the number of servers in the entire cluster to obtain a cluster failure rate.
  • the first number may also be Divide by the number of all training servers to get the cluster failure rate.
  • the second threshold condition includes: the cluster failure rate is lower than the second threshold. For example, if the second threshold is set to 5%, then when the cluster failure rate is less than 5%, the scheduling server can issue a pooling instruction to each training server.
  • the failure of the foregoing server includes that the server itself does not respond to the crash, and the server response delay exceeds a certain time.
  • the scheduling server may periodically send a test command to each server, and if the server does not respond within the specified time, the server may be considered to be faulty.
  • each training server before the scheduling server issues the aggregation instruction, the training situation of each training server may also be monitored. For example, after monitoring the distance from the last transmission of the aggregation instruction, each training server completes at least one round of training to meet the foregoing. A pooling instruction is issued under threshold conditions.
  • Step 130 If the aggregation instruction is received, send the first gradient to the aggregation server; in step 140, the aggregation server summarizes each first gradient and calculates a second weight;
  • the first gradient of the latest update may be sent to the aggregation server.
  • each training server Since there are training rounds in the aggregation instruction, each training server sends the first gradient of the same round to the aggregation server.
  • each training server may send its first gradient to the corresponding aggregation server according to a preset correspondence with the aggregation server.
  • Each aggregation server summarizes the received partial first gradients, and then each aggregation server summarizes the summarized first gradients into a collection server, and then the aggregation server performs final aggregation, and then based on the final summary first.
  • the gradient calculates the second weight.
  • the first gradient may be summarized, and then the second weight is calculated according to the aggregated result.
  • the aggregation server can judge whether each training server is trained, and if not trained, send the second weight to each training server.
  • each training server may send a first identifier whether the training of all the sample data of the sample set is completed when the first gradient is sent, for example, the first identifier is no, and the first identifier is not trained. Indicates that training is completed for yes.
  • the aggregation server can then determine, based on the identifier, whether the training server has trained all sample data of the sample set.
  • the aggregation server may determine, by other means, whether each training server has trained all the sample data of the sample set, which is not limited in the embodiment of the present application.
  • Step 150 Receive a second weight sent by the aggregation server to update the current weight.
  • the training server can receive the second weight sent by the aggregation server before the training of the sample data ends. Then the training server can update the current weight by the second weight, and then read the subsequent sample data for the next round of training. Of course, if the sample data has been read locally, the next round of sample data can be read locally for the next round of training.
  • the training server may use the sample set read by the training server to continuously iterate and train the first gradient by using the sample data and the current weight in the sample set before receiving the aggregation instruction; at the same time, the system can monitor the cluster system environment. Whether the threshold condition is met, the threshold condition can prevent the network traffic from being full in the cluster system environment.
  • the collection instruction can be sent to each training server, and each training server will receive the first training.
  • the gradient is sent to the aggregation server; the aggregation server summarizes the first gradients and calculates the second weight, and sends the second weight to each training server before the training servers have not trained the sample data, and updates the current weight.
  • the training server since the system monitors when the system environment controls the aggregation instructions, the training server sends the first gradient to the aggregation server after receiving the aggregation instruction.
  • the training results will not be sent to the server at the end of each round of training in the whole process, which reduces the network traffic, reduces the impact on the switch, and avoids affecting the use of the entire cluster.
  • FIG. 2 a flow chart of steps of another embodiment of the distributed clustering method of the present application is shown, which may specifically include the following steps:
  • Step 210 reading a sample set; the sample set shown includes at least one piece of sample data; the sample data includes time information;
  • the sample data in addition to the traditional data, such as the user ID, the user transaction behavior, the collection behavior data, the browsing behavior data, and the like, additional data is added to the sample data, and the column data records the sample data.
  • the time of production For example, the transaction data of the most recent day, the transaction data of the last two days.
  • Step 220 Calculate a third weight of the sample data by using time information of each piece of sample data
  • the present application can calculate the third weight of the sample data by using the time information of each piece of sample data. The third weight indicates that the closer the time information of the sample data is to the current time, the higher the weight, and conversely, the lower the weight.
  • the step of calculating the third weight of the sample data by using time information of each piece of sample data includes:
  • Sub-step 221 the time information of each piece of sample data is substituted into the negative exponential parameter of the exponential function, and the third weight is calculated.
  • the time information can be converted into digital information from the current time.
  • the time information of the sample data N1 is 1, indicating that the distance between the sample data N1 and the current time is 1 day
  • the time information of the sample data N2 is 3.
  • the time information is converted into digital information, and other methods may be used, which are not limited in the embodiment of the present application.
  • the base of the exponential function may be set to a natural number e, or may be set to other numbers greater than 1.
  • the natural number e is employed.
  • the application can calculate the third weight by using e-x, where x is time information, for example, for N1, the third weight is e-1, and so on.
  • the base of the exponential function can be other bases, such as 2, then the exponential function becomes 2-x.
  • Step 230 When the third weight is less than the third threshold, discard the corresponding sample data.
  • the third threshold is set to 0.001
  • the sample data is too far from the current time, and the sample data has little influence on the user's interests and intentions, and may be discarded. This reduces the amount of computation and saves system resources.
  • Step 240 before receiving the aggregation instruction, using the sample data and the current weight, substituting the target model training function for iterative training to obtain a first gradient; the aggregation instruction is issued by the scheduling server when the cluster system environment meets the threshold condition; Wherein, if there are multiple rounds of iterative training before receiving the aggregation instruction, the first weight is generated based on the first gradient obtained by the previous training as the current weight of the subsequent round of iterative training;
  • Step 250 if the aggregation instruction is received, sending the first gradient to the aggregation server, and transmitting the first coefficient obtained by summing the third weight of each sample data to the aggregation server;
  • the training server may calculate a third weight of each piece of sample data, and then may summarize the third weights of each retained sample data to obtain a first coefficient.
  • Step 260 The aggregation server performs weighting calculation according to each first gradient and a first coefficient corresponding to each first gradient to obtain a second gradient.
  • Step 270 the aggregation server calculates a second weight according to the second gradient.
  • the training server A sends a first gradient ⁇ F(X1)A with a first coefficient of 0.8; the training server B sends a first gradient ⁇ F(X1)B with a first coefficient of 0.7; the training server C sends a first gradient.
  • ⁇ F(X1)C the first coefficient of which is 0.5. Then the second gradient is
  • the second weight is then calculated from the second gradient.
  • the second weight can then be sent to each of the untrained training servers as described in the first embodiment.
  • Step 280 Receive a second weight sent by the aggregation server to update the current weight.
  • the training server may use the sample set read by the training server to continuously iterate and train the first gradient by using the sample data and the current weight in the sample set before receiving the aggregation instruction; at the same time, the system can monitor the cluster system environment. Whether the threshold condition is met, the threshold condition can prevent the network traffic from being full in the cluster system environment.
  • the collection instruction can be sent to each training server, and each training server will receive the first training.
  • the gradient is sent to the aggregation server; the aggregation server summarizes the first gradients and calculates the second weight, and sends the second weight to each training server before the training servers have not trained the sample data, and updates the current weight.
  • the training server since the system monitors when the system environment controls the aggregation instructions, the training server sends the first gradient to the aggregation server after receiving the aggregation instruction.
  • the training results will not be sent to the server at the end of each round of training in the whole process, which reduces the network traffic, reduces the impact on the switch, and avoids affecting the use of the entire cluster.
  • the embodiment of the present application can automatically increase the weight of the new data, reduce the weight of the old data, and discard some of the old data, so that the target model is more suitable for the user's current behavior, and can reduce the calculation. the amount.
  • FIG. 3 a flow chart of steps of another embodiment of the distributed clustering method of the present application is shown, which may specifically include the following steps:
  • Step 310 reading a sample set; the sample set shown includes at least one piece of sample data; the sample data includes time information;
  • Step 312 merging each sample data in the sample set
  • Step 314 recording the merged quantity of the sample data on the merged sample data.
  • the sample data of the same content may be merged according to the same time period. For example, for user A, he bought item A at 10:00 am on December 12, 2015, and bought item A at 3 pm on 2015-12-31. Then the two sample data can be merged, and user A has purchased commodity A in 2015-12-31, and the number of mergers is 2.
  • Step 316 using the time information of each sample data, calculating a weight reduction coefficient
  • the time reduction information of each sample data may be used to calculate the weight reduction coefficient, and the closer to the current time, the higher the weight reduction coefficient, and vice versa, the lower the weight reduction coefficient.
  • the step of calculating the weight reduction coefficient by using the time information of each piece of sample data includes:
  • Sub-step C11 the time information of each sample data is substituted into the negative exponential parameter of the exponential function, and the weight reduction coefficient is calculated.
  • the time information can be converted into digital information from the current time.
  • the time information of the sample data N1 is 1, indicating that the distance between the sample data N1 and the current time is 1 day
  • the time information of the sample data N2 is 3.
  • the time information is converted into digital information, and other methods may be used, which are not limited in the embodiment of the present application.
  • the application can calculate the weight reduction coefficient by using e-x, where x is time information, for example, for N1, the weight reduction coefficient is e-1, and so on.
  • x is time information, for example, for N1
  • the weight reduction coefficient is e-1, and so on.
  • the base of the exponential function can be other bases, such as 2, then the exponential function becomes 2-x.
  • Step 318 calculating a product of the weight reduction coefficient and the number of merges to obtain a third weight.
  • the sample data in the sample set is the merged sample data, and then the merged data of the sample data may be multiplied by the weight reduction coefficient to obtain a third weight.
  • steps 316-318 can be the preferred steps of step 220 in the second embodiment.
  • Step 320 When the third weight is less than the third threshold, discard the corresponding sample data.
  • Step 322 Before receiving the aggregation instruction, using the sample data and the current weight, substituting the target model training function for iterative training to obtain a first gradient; and the aggregation instruction is issued by the scheduling server when the cluster system environment meets the threshold condition; Wherein, if there are multiple rounds of iterative training before receiving the aggregation instruction, the first weight is generated based on the first gradient obtained by the previous training as the current weight of the subsequent round of iterative training;
  • Step 324 If the aggregation instruction is received, send the first gradient to the aggregation server, and send the first coefficient obtained by summing the third weight of each sample data to the aggregation server;
  • Step 326 the aggregation server performs weighting calculation according to each first gradient and a first coefficient corresponding to each first gradient to obtain a second gradient;
  • Step 328 Receive a second weight sent by the aggregation server to update the current weight.
  • the scheduling server since the scheduling server monitors the system environment, it controls when the aggregation instruction is issued. Accordingly, the training server sends the first gradient to the aggregation server after receiving the aggregation instruction. The training results will not be sent to the server at the end of each round of training in the whole process, which reduces the network traffic, reduces the impact on the switch, and avoids affecting the use of the entire cluster.
  • the embodiment of the present application merges the sample data, reduces the number of samples trained, and can improve the training speed.
  • the embodiment of the present application can automatically increase the weight of the new data according to the timeliness of the data, reduce the weight of the old data, and discard some of the old data, so that the target model is more suitable for the user's current behavior, and can be reduced. Calculated amount.
  • FIG. 4 a flow chart of steps of another embodiment of the distributed clustering method of the present application is shown, which may specifically include the following steps:
  • Step 410 The training server reads the sample set; the sample set includes at least one piece of sample data; and the sample data includes time information;
  • Step 412 The training server merges the sample data in the sample set
  • Step 414 The training server records the merged quantity of the sample data on the merged sample data.
  • Step 416 The training server calculates the weight reduction coefficient by using the time information of each piece of sample data.
  • Step 418 the training server calculates a product of the weight reduction coefficient and the number of merges to obtain a third weight.
  • steps 416-418 may be the preferred steps of step 220 of embodiment two.
  • Step 420 The training server discards the corresponding sample data when the third weight is less than the third threshold.
  • Step 422 Before receiving the aggregation instruction, the training server uses the sample data and the current weight to substitute the target model training function for iterative training to obtain a first gradient; wherein, if the aggregation instruction is received, there are multiple rounds of iterative training. And generating a first weight based on the first gradient obtained by the previous training as the current weight of the subsequent round of iterative training;
  • Step 424 The scheduling server issues a collection instruction when the cluster system environment meets the threshold condition
  • the dispatch server sends the collection instructions to the various training servers.
  • Step 426 the training server sends the first gradient to the aggregation server if the aggregation instruction is received, and sends the first coefficient obtained by summing the third weights of the respective sample data to the aggregation server;
  • Step 428 the aggregation server performs weighting calculation according to each first gradient and the first coefficient corresponding to each first gradient to obtain a second gradient;
  • Step 430 The aggregation server calculates a second weight according to the second gradient.
  • step 432 the aggregation server backs up the newly obtained second weight and sends the new second weight to each training server.
  • the second weight may be backed up.
  • the collecting server backing up the newly obtained second weight includes:
  • Step D11 the aggregation server determines whether the amount of change between the newly obtained second weight and the second weight of the previous backup exceeds a change threshold
  • step D12 if the change threshold is exceeded, the newly obtained second weight is backed up.
  • the aggregation server obtains a new second weight, and performs a calculation of the change amount with the second weight of at least one of the previous backups. For example, if the amount of change between the second weight and the last weight of the previous backup is less than the change threshold, such as 5%, if less than 5%, the new second weight is discarded, and if it is greater than or equal, the second weight is backed up. You can reduce the amount of backups.
  • the target model of the external business server may not be updated, thereby avoiding unnecessary use of the target model of the business server, such as testing.
  • the scheduling server can notify the aggregation server to send the latest second weight of the backup to the training server, so that the training server can update the latest.
  • the second weight is used as the initial current weight and continues to be trained in conjunction with the previous sample. Improve the efficiency of training.
  • the training may also be started from the first sample, but the current weight is the latest second weight of the backup.
  • the aggregation server sends the latest second weight to each training server.
  • step 434 the training server receives the second weight sent by the aggregation server to update the current weight.
  • the method further includes:
  • the aggregation server substitutes the second weight into the target model and outputs it to the service server.
  • the second weight of the backup may be directly substituted into the target model and output to the service server, so that the business party can directly use the target model for use.
  • Lazy communication mechanism According to the cluster environment and the iteration situation, it is automatically determined whether all machines need to perform weight summary operations, so that each round of training is avoided, resulting in possible network fullness.
  • Weight backup mechanism According to the rules, the weights are automatically backed up. Once some mechanisms have problems, the previous weights can be pulled back from the backup and the training can be continued, so that the training can be improved without further training.
  • Data segmentation device according to the timeliness of the data, automatically increase the weight of the new data, reduce the weight of the old data, and discard some of the old data.
  • FIG. 5 a structural block diagram of an embodiment of a distributed cluster training apparatus of the present application is shown, which may specifically include the following modules:
  • a sample reading module 510 configured to read a sample set; the sample set shown includes at least one piece of sample data;
  • the iterative training module 520 is configured to perform iterative training on the target model training function by using the sample data and the current weight before receiving the aggregation instruction to obtain a first gradient; the aggregation instruction is met by the scheduling server in the cluster system environment.
  • the condition is issued; if there are multiple rounds of iterative training before receiving the aggregation instruction, the first weight is generated based on the first gradient obtained by the previous training as the current weight of the subsequent round of iterative training;
  • a result sending module 530 configured to send the first gradient to the aggregation server if the aggregation instruction is received; the aggregation server summarizes each first gradient and calculates a second weight;
  • the update module 540 is configured to receive a second weight sent by the aggregation server to update the current weight.
  • the aggregation instruction is issued by the scheduling server when the cluster system environment meets the threshold condition, including:
  • the aggregation instruction is issued by the scheduling server when the cluster network utilization of the entire cluster meets the first threshold condition, and/or is issued by the scheduling server when the cluster failure rate of the entire cluster meets the second threshold condition.
  • the first threshold condition includes: the cluster network utilization is lower than the first threshold;
  • the second threshold condition includes: the cluster failure rate is lower than the second threshold.
  • the method further includes:
  • a third weight calculation module configured to calculate a third weight of the sample data by using time information of each piece of sample data
  • a sample discarding module configured to discard corresponding sample data when the third weight is less than a third threshold.
  • the third weight calculation module includes:
  • the index calculation module is configured to substitute the time information of each sample data into a negative index parameter of the exponential function to calculate a third weight.
  • the method before the third weight calculation module, the method further includes:
  • a merging module for merging sample data in a sample set
  • the merge record module is configured to record the merged quantity of the sample data for the merged sample data.
  • the third weight calculation module includes:
  • a weight reduction coefficient calculation module configured to calculate a weight reduction coefficient by using time information of each sample data
  • the first calculation module is configured to calculate a product of the weight reduction coefficient and the number of merges to obtain a third weight.
  • the result sending module is further configured to, if receiving the aggregation instruction, send the first coefficient obtained by summing the third weights of the respective sample data to the aggregation server;
  • the aggregation server includes: a first weighting summary module, configured to perform a weighting calculation to obtain a second gradient according to each first gradient and a first coefficient corresponding to each first gradient;
  • a second weight calculation module configured to calculate a second weight according to the second gradient.
  • the aggregation server further includes:
  • a backup module for backing up the newly obtained second weight.
  • the backup module includes:
  • a change calculation module configured to determine, by the aggregation server, whether a change amount between the newly obtained second weight and the second weight of the previous backup exceeds a change threshold
  • the first backup module is configured to back up the newly obtained second weight if the change threshold is exceeded.
  • the method further includes:
  • an output module configured to substitute the second weight into the target model and output to the service server.
  • Lazy communication mechanism According to the cluster environment and the iteration situation, it is automatically determined whether all machines need to perform weight summary operations, so that each round of training is avoided, resulting in possible network fullness.
  • Weight backup mechanism According to the rules, the weights are automatically backed up. Once some mechanisms have problems, the previous weights can be pulled back from the backup and the training can be continued, so that the training can be improved without further training.
  • Data segmentation device according to the timeliness of the data, automatically increase the weight of the new data, reduce the weight of the old data, and discard some of the old data.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • FIG. 6 a structural block diagram of an embodiment of a distributed cluster training apparatus of the present application is shown, which may specifically include the following modules:
  • the scheduling server 610, the aggregation server 620, and the plurality of training servers 630 are included.
  • the scheduling server 610 includes:
  • the cluster monitoring module 611 is configured to monitor whether the cluster system environment meets a threshold condition, and if yes, send a collection instruction to each training server 630.
  • the cluster monitoring module 611 is specifically configured to issue a pooling instruction when the cluster network utilization of the entire cluster meets the first threshold condition, and/or the cluster failure rate of the entire cluster meets the second threshold.
  • the collection instruction is issued when the condition is met.
  • the first threshold condition includes: the cluster network utilization is lower than the first threshold;
  • the second threshold condition includes: the cluster failure rate is lower than the second threshold.
  • the training server 630 includes:
  • a sample reading module 631 configured to read a sample set; the sample set shown includes at least one piece of sample data;
  • the iterative training module 632 is configured to perform iterative training on the target model training function by using the sample data and the current weight before receiving the aggregation instruction to obtain a first gradient; wherein, if there are multiple rounds before receiving the aggregation instruction The iterative training generates a first weight based on the first gradient obtained by the previous training as the current weight of the subsequent round of iterative training;
  • a result sending module 633 configured to send the first gradient to the aggregation server if the aggregation instruction is received
  • An update module 634 configured to receive a second weight to update the current weight
  • the method further includes:
  • a third weight calculation module configured to calculate a third weight of the sample data by using time information of each piece of sample data
  • a sample discarding module configured to discard corresponding sample data when the third weight is less than a third threshold.
  • the third weight calculation module includes:
  • the index calculation module is configured to substitute the time information of each sample data into a negative index parameter of the exponential function to calculate a third weight.
  • the method before the third weight calculation module, the method further includes:
  • a merging module for merging sample data in a sample set
  • the merge record module is configured to record the merged quantity of the sample data for the merged sample data.
  • the third weight calculation module includes:
  • a weight reduction coefficient calculation module configured to calculate a weight reduction coefficient by using time information of each sample data
  • the first calculation module is configured to calculate a product of the weight reduction coefficient and the number of merges to obtain a third weight.
  • the result sending module 633 is further configured to: if the aggregation instruction is received, send the first coefficient obtained by summing the third weights of the respective sample data to the aggregation server.
  • the aggregation server 620 includes:
  • a collection calculation module 621 configured to summarize each first gradient and calculate a second weight
  • the second weight sending module 622 is configured to send the latest second weight to each training server.
  • the aggregation server includes:
  • a first weighting and summarizing module configured to perform a weighting calculation according to each first gradient and a first coefficient corresponding to each first gradient to obtain a second gradient
  • a second weight calculation module configured to calculate a second weight according to the second gradient.
  • the aggregation server further includes:
  • a backup module for backing up the newly obtained second weight.
  • the backup module includes:
  • a change calculation module configured to determine, by the aggregation server, whether a change amount between the newly obtained second weight and the second weight of the previous backup exceeds a change threshold
  • the first backup module is configured to back up the newly obtained second weight if the change threshold is exceeded.
  • the method further includes:
  • an output module configured to substitute the second weight into the target model and output to the service server.
  • embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include non-persistent computer readable media, such as modulated data signals and carrier waves.
  • Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device
  • Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions generated in the computer readable memory are generated.
  • manufacture of an instruction device that implements the functions specified in one or more blocks of the flow or in a flow or block diagram of the flowchart.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computer And Data Communications (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请实施例提供了一种分布式集群训练方法和装置,涉及机器学习技术领域。所述方法包括:读取样本集;所示样本集包括至少一条样本数据;在接收到汇集指令之前,利用所述样本数据和当前权重,代入目标模型训练函数进行迭代训练,得到第一梯度,并且如果有多轮迭代训练,则基于前一次训练得到的第一梯度生成第一权重作为后一轮迭代训练的当前权重;如果接收到汇集指令,则将所述第一梯度发送至汇集服务器;所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出;所述汇集服务器汇总各第一梯度并计算第二权重;接收汇集服务器发送的第二权重以更新当前权重。本申请降低了网络通信量,降低对交换机的影响,避免影响整个集群的使用。

Description

一种分布式集群训练方法和装置
本申请要求2016年03月26日递交的申请号为201610180393.8、发明名称为“一种分布式集群训练方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及机器学习技术领域,特别是涉及一种分布式集群训练方法和一种分布式集群训练装置。
背景技术
随着大数据的应用,很多基于大数据的目标模型,比如预测用户对商品的喜好的目标模型,都需要利用相应的样本数据对目标模型中的权重进行训练。该权重可以理解为目标模型的参数,比如以一个简单的模型y=ax1+bx2+cx3,其中的a、b、c为权重,x1、x2、x3为输入量,y为输出量。而上述目标模型都需要利用机器学习训练。
机器学习训练一般包括单机训练和集群训练,单机训练就是利用所有样本,计算F(X)(F为损失函数,X为权重)的梯度:▽F(Xt-1),然后更新权重:Xt=Xt-1-α▽F(Xt-1),一直这样迭代,直到收敛;而集群训练,就是先按照一定规则,将训练样本分到各个机器上(各机器上数据都不一样),每个机器计算出梯度,然后利用reduce技术,将梯度汇总,并进行权重更新。重复上述过程,直到收敛。事实上,由于现在数据量巨大,集群训练已经成为工业界标配。
而单机上进行训练,当样本数据的数据量很大时,会出现因为数据量太大导致内存加载不下,导致无法进行训练。在单机上训练,没有通信(网络)代价,但无法支撑大数据(比如所有用户在最近2周内的浏览日志数据)。
基于单机训练的上述问题,在先技术中采用了在分布式集群中执行机器学习任务。现有集群训练方案:(1)将数据集T,按照一定规则,切分成N份,得到T={T1,T2,…,Tn};(2)每个训练服务器得到一份数据,设为Tx;(3)每个训练服务器利用得到的数据,计算对应的梯度▽FTx;(4)进行梯度汇总得到总梯度:total gradient=∑1nFi;(5)根据规则更新权重(类似单机训练的更新权重方法),并将新权重发给所有机器;(6)判定是否训练结束,如果没有结束,返回第三步。
在集群上训练,能利用更多的训练数据,取得更好的预测效果。由于每轮计算梯度后,都需要将梯度汇总,通信量巨大且频繁,可能导致集群中网络流量爆满,而影响交换机,甚至整个集群的使用。
发明内容
鉴于上述问题,提出了本申请实施例以便提供一种克服上述问题或者至少部分地解决上述问题的一种分布式集群训练方法和相应的一种分布式集群训练装置。
为了解决上述问题,本申请公开了一种分布式集群训练方法,包括:
读取样本集;所示样本集包括至少一条样本数据;
在接收到汇集指令之前,利用所述样本数据和当前权重,代入目标模型训练函数进行迭代训练,得到第一梯度;所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出;其中,如果在接收到汇集指令之前,有多轮迭代训练,则基于前一次训练得到的第一梯度生成第一权重作为后一轮迭代训练的当前权重;
如果接收到汇集指令,则将所述第一梯度发送至汇集服务器;所述汇集服务器汇总各第一梯度并计算第二权重;
接收汇集服务器发送的第二权重以更新当前权重。
本申请还公开了一种分布式集群训练装置,包括:
样本读取模块,用于读取样本集;所示样本集包括至少一条样本数据;
迭代训练模块,用于在接收到汇集指令之前,利用所述样本数据和当前权重,代入目标模型训练函数进行迭代训练,得到第一梯度;所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出;其中,如果在接收到汇集指令之前,有多轮迭代训练,则基于前一次训练得到的第一梯度生成第一权重作为后一轮迭代训练的当前权重;
结果发送模块,用于如果接收到汇集指令,则将所述第一梯度发送至汇集服务器;所述汇集服务器汇总各第一梯度并计算第二权重;
更新模块,用于接收汇集服务器发送的第二权重以更新当前权重。
本申请实施例包括以下优点:
本申请实施例,训练服务器可以利用其读取的样本集,在接收到汇集指令之前,利用该样本集中的样本数据和当前权重,不断迭代训练第一梯度;同时,调度服务器可以监控集群系统环境是否符合阈值条件,当系统监控到集群系统环境符合阈值条件时,则可以发送汇集指令给各个训练服务器,各个训练服务器则将训练得到的第一梯度发送到 汇集服务器;该汇集服务器汇总各第一梯度并计算第二权重,在各个训练服务器还未对其样本数据训练结束前,将该第二权重发送到各个训练服务器,更新期当前权重。如此,由于系统监控何时系统环境,控制何时发出汇集指令,相应的,训练服务器则在收到汇集指令后才将第一梯度发送到汇集服务器。不会在整个过程中每轮训练结束都将训练结果发送至服务器,降低了网络通信量,降低对交换机的影响,避免影响整个集群的使用。
附图说明
图1是本申请的一种分布式集群训练方法实施例的步骤流程图;
图2是本申请的另一种分布式集群训练方法实施例的步骤流程图;
图3是本申请的另一种分布式集群训练方法实施例的步骤流程图;
图4是本申请的另一种分布式集群训练方法实施例的步骤流程图;
图5是本申请的一种分布式集群训练装置实施例的结构框图;
图6是本申请的一种分布式集群训练系统实施例的结构框图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
本申请实施例的核心构思之一在于,由于在先技术中在集群中训练目标模型时,每轮训练完毕之后,都直接将集群中各训练服务器训练得到的梯度进行汇总,导致通信量巨大而且频繁,可能导致集群中网络流量爆满,从而影响交换机甚至整个集群的使用。本申请实施例中,训练服务器可以利用其读取的样本集,在接收到汇集指令之前,利用该样本集中的样本数据和当前权重,不断迭代训练第一梯度;同时,系统可以监控集群系统环境是否符合阈值条件,该阈值条件可以避免集群系统环境出现网络流量爆满,当系统监控到集群系统环境符合阈值条件时,则可以发送汇集指令给各个训练服务器,各个训练服务器则将训练得到的第一梯度发送到汇集服务器;该汇集服务器汇总各第一梯度并计算第二权重,在各个训练服务器还未对其样本数据训练结束前,将该第二权重发送到各个训练服务器,更新其当前权重。如此,由于系统监控何时系统环境,控制何时发出汇集指令,相应的,训练服务器则在收到汇集指令后才将第一梯度发送到汇集服务器。不会在整个过程中每轮训练结束都将训练结果发送至服务器,降低了网络通信量, 降低对交换机的影响,避免影响整个集群的使用。
实施例一
参照图1,示出了本申请的一种分布式集群方法实施例的步骤流程图,具体可以包括如下步骤:
步骤110,读取样本集;所示样本集包括至少一条样本数据;
在本申请实施例中,整个集群可以包括多台训练服务器、至少一台调度服务器、至少一台汇集服务器。该训练服务器可以获取其负责的样本集进行迭代训练,以得到第一梯度,调度服务器可以监控整个系统的集群系统环境情况,并根据集群系统环境决定是否发出汇集指令到训练服务器。汇集服务器可以接收各个训练服务器发送的第一梯度并计算第二权重。
在本申请实施例中,训练服务器、调度服务器、汇集服务器之间的通信数据都通过集群中的交换机传输。
可以理解,本申请实施例的调度服务器可以将各训练服务器需要获取的样本集的获取参数发送给各个训练服务器。那么,对于一台训练服务器来说,其收到获取参数后,可以根据该获取参数,从指定位置读取其需要的样本集。比如从交易日志服务器获取该参数规定的一批交易日志数据作为样本集。当然,本申请实施例还可以从其他服务器获取相应样本集,可以按照需求设定,本申请实施例不对其加以限制。
步骤120,在接收到汇集指令之前,利用所述样本数据和当前权重,代入目标模型训练函数进行迭代训练,得到第一梯度;所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出;其中,如果在接收到汇集指令之前,有多轮迭代训练,则基于前一次训练得到的第一梯度生成第一权重作为后一轮迭代训练的当前权重;
对于一个训练服务器A而言,其读取到样本集后,在初始情况下,目标模型的各个当前权重是根据经验预设的第二权重X0。此时可以从样本集中按序逐个提取样本数据,输入目标模型进行训练,训练属于该训练服务器A的第一梯度。
该训练服务器A,在未接收到汇集指令之前,可以一直读取样本数据进行迭代训练。当然在实际应用中,各个训练服务器可以将其训练的样本全部读到本地,然后进行训练。比如第一轮利用样本数据M1和当前权重X0,代入目标模型训练函数,训练第一梯度▽F(X0),然后利用▽F(X0)计算权重X1该X1以作为第二轮训练的当前权重;然后利用样本数据M2和当前权重X1,代入目标模型训练函数,训练第一梯度▽F(X1);以此类推,直到接收到汇集指令。其中Xi(i=1、2、3……)是一个多维向量,其中每个维度对应目 标模型中的一个参数。其中所述目标模型训练函数可以为前述的损失函数F(X)
在实际应用中,以上述过程为例,在第一轮中,将第一个样本数据代入损失函数F(X),其中,X为当前权重,然后计算F(X)的梯度▽F(X0),然后根据公式Xt=Xt-1-α▽F(Xt-1)更新得到第一梯度▽F(X1)。其中,损失函数F(X)可以根据实际情况设定,在先技术对此有详细过程,在此不在赘叙。对于第二轮类似。假设训练服务器训练到第三轮,得到第一梯度▽F(X2),此时,接收到调度服务器发送的汇集指令,则可以直接将第一梯度▽F(X2)通过交换机发送至汇集服务器中。
在本申请实施例中,训练服务器在上一次汇集之后,会记录第一梯度的训练轮次。调度服务器发送汇集指令时,则会控制训练服务器发送哪一轮的第一梯度。调度服务器可以在发送汇集指令之前控制各训练服务器执行N轮训练,N为大于0的整数。比如通知训练服务器在接收汇集指令之前,只进行三轮训练,如果三轮训练结束,则等待调度服务器的指令。当然,实际应用中,可以对N做限制,也可根据实际需求的训练精度误差设置N的值。该实际需求的训练精度误差可以根据历史训练结果的经验设定。
在本申请实施例中,调度服务器向各训练服务器发送的汇集指令中,包含了指定轮次。则各训练服务器将相应轮次训练得到的第一梯度发送到汇集服务器中。
在本申请实施例中,在各个训练服务器进行迭代训练的过程中,调度服务器会监控集群系统环境,当集群系统环境符合阈值条件时,调度服务器向各个训练服务器发出汇集指令。该阈值条件可以限定训练服务器发送的频率不太快,导致网络拥堵,如网络利用率低于30%等阈值条件。
在本申请另一优选的实施例中,所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出,包括:
所述汇集指令由调度服务器在整个集群的集群网络利用率符合第一阈值条件时发出。
在本申请实施例中,调度服务器可以监控整个集群的集群网络利用率,比如获取每个服务器的网卡的发包量和收包量,而网卡本身有一个最大流量限制,比如100M,那么统计各个网卡的发包量和收包量,再除以所有网卡的总流量限制,则可以得到集群网络利用率。当然,也可以计算每个服务器的网卡的利用率,然后把各个网卡的利用率进行加权平均得到集群网络利用率。在该种情况下,所述第一阈值条件包括:集群网络利用率低于第一阈值;比如第一阈值设置为30%,那么当调度服务器监控得到的集群网络利用率低于30%,则可以向各个训练服务器发送汇集指令。
在本申请另一优选的实施例中,所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出,包括:
所述汇集指令由调度服务器在整个集群的集群故障率符合第二阈值条件时发出。
在本申请实施例中,整个集群中的各个服务器可能出现故障,那么本申请实施例可以监控各个服务器的故障,然后根据出现故障的服务器个数除以整个集群的服务器的个数得到集群故障率。当然,在本申请实施例中,可以只监控训练服务器出现故障的第一个数,然后将该第一个数除以整个集群的个数得到集群故障率;当然,该第一个数也可以除以所有训练服务器的个数得到集群故障率。该种情况下,所述第二阈值条件包括:集群故障率低于第二阈值。比如第二阈值设置为5%,那么集群故障率低于5%时,调度服务器可以向各个训练服务器发出汇集指令。
需要说明的是,前述服务器的故障,包括服务器本身崩溃没响应、服务器响应延迟超过一定时间。本申请实施例中,调度服务器可以定期向各个服务器发送测试命令,如果服务器未在规定时间内响应,则可认为该服务器出现故障。
当然,本申请实施例中,调度服务器发出汇集指令之前,还可以监控各训练服务器的训练情况,比如监控到距离上次发送汇集指令之后,各个训练服务器完成了至少一轮训练才会在符合前述阈值条件下,发出汇集指令。
步骤130,如果接收到汇集指令,则将所述第一梯度发送至汇集服务器;步骤140,所述汇集服务器汇总各第一梯度并计算第二权重;
在本申请实施例中,对于一个训练服务器,如果接收到汇集指令,则可以将最新更新的第一梯度发送至汇集服务器。
由于汇集指令中有训练轮次,各个训练服务器则将相同轮次的第一梯度发送至汇集服务器。
在本申请实施例中,如果有多台汇集服务器,则各训练服务器可以根据预先设定的与汇集服务器的对应关系,将其第一梯度发送至相应的汇集服务器中。各个汇集服务器对接收到的部分第一梯度进行汇总,然后各个汇集服务器再将汇总后的第一梯度再汇总至一个汇集服务器中,然后该汇集服务器进行最后的汇总,然后基于最后汇总的第一梯度计算第二权重。
而对于汇集服务器,在接收到所有训练服务器的第一梯度后,则可以对第一梯度进行汇总,然后根据汇总的结果计算第二权重。
此时,汇集服务器可以判断各个训练服务器是否训练完毕,如果未训练完毕,则将第二权重发送到各个训练服务器。
可以理解,在实际应用中,各个训练服务器在发送其第一梯度时,可以发送是否对样本集的所有样本数据训练完毕的第一标识,如第一标识为no表示未训练完,第一标识为yes表示训练完。汇集服务器则可以根据该标识判断训练服务器是否训练完该样本集的所有样本数据。当然,实际应用中,汇集服务器还可以通过其他方式去确定各个训练服务器是否训练完其样本集的所有样本数据,本申请实施例不对其加以限制。
步骤150,接收汇集服务器发送的第二权重以更新当前权重。
对于训练服务器而言,其在对所述样本数据训练结束前,可以接收到汇集服务器发送的第二权重。那么训练服务器则可以将该第二权重更新当前权重,然后读取后续的样本数据进行下一轮训练。当然,如果样本数据已经读取到了本地,则可以从本地读取下一轮的样本数据进行下一轮训练。
本申请实施例中,训练服务器可以利用其读取的样本集,在接收到汇集指令之前,利用该样本集中的样本数据和当前权重,不断迭代训练第一梯度;同时,系统可以监控集群系统环境是否符合阈值条件,该阈值条件可以避免集群系统环境出现网络流量爆满,当系统监控到集群系统环境符合阈值条件时,则可以发送汇集指令给各个训练服务器,各个训练服务器则将训练得到的第一梯度发送到汇集服务器;该汇集服务器汇总各第一梯度并计算第二权重,在各个训练服务器还未对其样本数据训练结束前,将该第二权重发送到各个训练服务器,更新期当前权重。如此,由于系统监控何时系统环境,控制何时发出汇集指令,相应的,训练服务器则在收到汇集指令后才将第一梯度发送到汇集服务器。不会在整个过程中每轮训练结束都将训练结果发送至服务器,降低了网络通信量,降低对交换机的影响,避免影响整个集群的使用。
实施例二
参照图2,示出了本申请的另一种分布式集群方法实施例的步骤流程图,具体可以包括如下步骤:
步骤210,读取样本集;所示样本集包括至少一条样本数据;所述样本数据包括时间信息;
在本申请实施例中,在样本数据中除了传统的数据,比如用户ID、用户交易行为、收藏行为数据、浏览行为数据等数据,还额外增加了一列数据,该列数据记录了该条样本数据产生的时间。比如最近一天的交易记录数据、最近两天的交易数据。
步骤220,利用每条样本数据的时间信息,计算所述样本数据的第三权重;
在本申请实施例中,距离当前越近的样本数据,越能反应用户真实的兴趣与意图,采用该样本数据训练出来的模型更精准。而本申请可以利用每条样本数据的时间信息,计算所述样本数据的第三权重。该第三权重表示该样本数据的时间信息距离当前时间越近其权重越高,反之,其权重越低。
在本申请另一优选的实施例中,所述利用每条样本数据的时间信息,计算所述样本数据的第三权重的步骤包括:
子步骤221,将每条样本数据的时间信息,代入指数函数的负的指数参数,计算第三权重。
在本申请实施例中,可以将时间信息距离当前时间转换为数字信息,比如样本数据N1的时间信息为1,表示样本数据N1距离当前时间的距离为1天,样本数据N2的时间信息为3,表示样本数据N2距离当前时间的距离为3天。当然,将时间信息转换为数字信息,也可以采用其他方式,本申请实施例不对其加以限制。
在本申请实施例中,指数函数的底数可以设置为自然数e,也可以设置为其他大于1的数。优选的采用自然数e。那么本申请可以利用e-x计算第三权重,其中x为时间信息,比如对于N1,其第三权重为e-1,其他情况以此类推。当然该指数函数的底数可以为其他底数,比如2,那么指数函数变为2-x。
步骤230,当所述第三权重小于第三阈值,则丢弃相应的样本数据。
比如设置第三阈值为0.001,那么当第三权重小于该第三阈值,则说明该样本数据离当前时间太远,该样本数据对用户的兴趣和意图影响不大,可以将其丢弃。从而可以降低计算量,从而节省系统资源。
步骤240,在接收到汇集指令之前,利用所述样本数据和当前权重,代入目标模型训练函数进行迭代训练,得到第一梯度;所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出;其中,如果在接收到汇集指令之前,有多轮迭代训练,则基于前一次训练得到的第一梯度生成第一权重作为后一轮迭代训练的当前权重;
步骤250,如果接收到汇集指令,则将所述第一梯度发送至汇集服务器,以及将各个样本数据的第三权重进行汇总得到的第一系数发送至汇集服务器;
在本申请实施例中,训练服务器在对样本集的数据进行训练之前,可以计算各条样本数据的第三权重,然后可以对各个保留下来的样本数据第三权重进行汇总,得到第一系数。
步骤260,所述汇集服务器根据各第一梯度及与各第一梯度相应的第一系数,进行加权计算得到第二梯度;
步骤270,所述汇集服务器根据第二梯度计算第二权重。
比如训练服务器A发送第一梯度▽F(X1)A,其第一系数为0.8;训练服务器B发送第一梯度▽F(X1)B,其第一系数为0.7;训练服务器C发送第一梯度▽F(X1)C,其第一系数为0.5。那么第二梯度为
(0.8▽F(X1)A+0.7▽F(X1)B+0.5▽F(X1)C)
然后再根据第二梯度计算第二权重。
然后可以按照实施例一中的描述,将第二权重发送至各个未训练完毕的训练服务器。
步骤280,接收汇集服务器发送的第二权重以更新当前权重。
本申请实施例中,训练服务器可以利用其读取的样本集,在接收到汇集指令之前,利用该样本集中的样本数据和当前权重,不断迭代训练第一梯度;同时,系统可以监控集群系统环境是否符合阈值条件,该阈值条件可以避免集群系统环境出现网络流量爆满,当系统监控到集群系统环境符合阈值条件时,则可以发送汇集指令给各个训练服务器,各个训练服务器则将训练得到的第一梯度发送到汇集服务器;该汇集服务器汇总各第一梯度并计算第二权重,在各个训练服务器还未对其样本数据训练结束前,将该第二权重发送到各个训练服务器,更新期当前权重。如此,由于系统监控何时系统环境,控制何时发出汇集指令,相应的,训练服务器则在收到汇集指令后才将第一梯度发送到汇集服务器。不会在整个过程中每轮训练结束都将训练结果发送至服务器,降低了网络通信量,降低对交换机的影响,避免影响整个集群的使用。
另外,本申请实施例可以根据数据的时效性,自动对新的数据加大权重,对老数据降权,并丢弃部分旧数据,从而使目标模型更契合用户当前的行为习惯,并且可以降低计算量。
实施例三
参照图3,示出了本申请的另一种分布式集群方法实施例的步骤流程图,具体可以包括如下步骤:
步骤310,读取样本集;所示样本集包括至少一条样本数据;所述样本数据包括时间信息;
步骤312,对样本集中的各样本数据进行归并;
步骤314,对归并后的样本数据,记录所述样本数据的归并数量。
在本申请实施例中,对于相同内容的样本数据,可以按照相同时间段进行归并。比如对于用户A,其在2015-12-31的上午10点,买了商品A,在在2015-12-31的下午3点,买了商品A。那么这两条样本数据则可以归并,得到用户A在2015-12-31购买了商品A,归并数量为2。
在实际中,对于样本数据,还可以在其中添加一列归并数量列,把归并数量填入该列。
步骤316,利用每条样本数据的时间信息,计算降权系数;
在本申请实施例中,可以利用每条样本数据的时间信息,计算降权系数,距离当前时间越近其降权系数越高,反之,其降权系数越低。
在本申请另一优选的实施例中,所述利用每条样本数据的时间信息,计算降权系数的步骤包括:
子步骤C11,将每条样本数据的时间信息,代入指数函数的负的指数参数,计算降权系数。
在本申请实施例中,可以将时间信息距离当前时间转换为数字信息,比如样本数据N1的时间信息为1,表示样本数据N1距离当前时间的距离为1天,样本数据N2的时间信息为3,表示样本数据N2距离当前时间的距离为3天。当然,将时间信息转换为数字信息,也可以采用其他方式,本申请实施例不对其加以限制。
那么本申请可以利用e-x计算降权系数,其中x为时间信息,比如对于N1,其降权系数为e-1,其他情况以此类推。当然该指数函数的底数可以为其他底数,比如2,那么指数函数变为2-x。
步骤318,计算所述降权系数与归并数量之积,得到第三权重。
由于本申请实施例中,对于样本数据进行了归并,那么样本集中的样本数据则为归并后的样本数据,那么可以将该样本数据的归并数据乘以其降权系数,得到第三权重。
可以理解,步骤316-318可以为实施例二中步骤220优选的步骤。
步骤320,当所述第三权重小于第三阈值,则丢弃相应的样本数据。
步骤322,在接收到汇集指令之前,利用所述样本数据和当前权重,代入目标模型训练函数进行迭代训练,得到第一梯度;所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出;其中,如果在接收到汇集指令之前,有多轮迭代训练,则基于前一次训练得到的第一梯度生成第一权重作为后一轮迭代训练的当前权重;
步骤324,如果接收到汇集指令,则将所述第一梯度发送至汇集服务器,以及将各个样本数据的第三权重进行汇总得到的第一系数发送至汇集服务器;
步骤326,所述汇集服务器根据各第一梯度及与各第一梯度相应的第一系数,进行加权计算得到第二梯度;
步骤328,接收汇集服务器发送的第二权重以更新当前权重。
本申请实施例由于调度服务器监控系统环境,控制何时发出汇集指令,相应的,训练服务器则在收到汇集指令后才将第一梯度发送到汇集服务器。不会在整个过程中每轮训练结束都将训练结果发送至服务器,降低了网络通信量,降低对交换机的影响,避免影响整个集群的使用。
另外,本申请实施例对样本数据进行了归并,减少了训练的样本数量,能够提高训练速度。
再者,本申请实施例可以根据数据的时效性,自动对新的数据加大权重,对老数据降权,并丢弃部分旧数据,从而使目标模型更契合用户当前的行为习惯,并且可以降低计算量。
实施例四
参照图4,示出了本申请的另一种分布式集群方法实施例的步骤流程图,具体可以包括如下步骤:
步骤410,训练服务器读取样本集;所示样本集包括至少一条样本数据;所述样本数据包括时间信息;
步骤412,训练服务器对样本集中的各样本数据进行归并;
步骤414,训练服务器对归并后的样本数据,记录所述样本数据的归并数量。
步骤416,训练服务器利用每条样本数据的时间信息,计算降权系数;
步骤418,训练服务器计算所述降权系数与归并数量之积,得到第三权重。
可以理解,步骤416-418可以为实施例二中步骤220优选的步骤。
步骤420,训练服务器当所述第三权重小于第三阈值,则丢弃相应的样本数据。
步骤422,训练服务器在接收到汇集指令之前,利用所述样本数据和当前权重,代入目标模型训练函数进行迭代训练,得到第一梯度;其中,如果在接收到汇集指令之前,有多轮迭代训练,则基于前一次训练得到的第一梯度生成第一权重作为后一轮迭代训练的当前权重;
步骤424,调度服务器在集群系统环境符合阈值条件时发出汇集指令;
调度服务器将汇集指令发送给各个训练服务器。
步骤426,训练服务器如果接收到汇集指令,则将所述第一梯度发送至汇集服务器,以及将各个样本数据的第三权重进行汇总得到的第一系数发送至汇集服务器;
步骤428,汇集服务器根据各第一梯度及与各第一梯度相应的第一系数,进行加权计算得到第二梯度;
步骤430,汇集服务器根据第二梯度计算第二权重;
步骤432,汇集服务器将新得到的第二权重进行备份,并将新的第二权重发送至各训练服务器。
在本申请实施例中,汇集服务器在得到新的第二权重之后,可以将该第二权重进行备份。
在本申请另一优选的实施例中,所述汇集服务器将新得到的第二权重进行备份包括:
步骤D11,所述汇集服务器判断新得到的第二权重与至少前一次备份的第二权重之间的变化量是否超过变化阈值;
步骤D12,如果超过变化阈值,则对所述新得到的第二权重进行备份。
在本申请实施例中,汇集服务器得到新的第二权重,则会与之前的备份的至少一次的第二权重进行变化量的计算。比如与之前备份的最近一次的第二权重之间的变化量是否小于变化阈值,如5%,如果小于5%,则抛弃新的第二权重,如果大于等于,则备份该第二权重,如此可以减少备份量。使步骤C13中,可以不用更新给外部的业务服务器的目标模型,避免无谓的影响业务服务器对该目标模型的使用,比如测试。
可以理解,因为进行了权重备份,如果某个时刻整个训练失败,则在重新训练时,调度服务器可以通知汇集服务器,将备份的最新的第二权重发送给训练服务器,使训练服务器可以将该最新的第二权重作为初始的当前权重,结合之前的样本继续进行训练。提高训练的效率。
当然,本申请实施例中,训练失败之后,也可以从第一个样本开始进行训练,但是其当前权重为备份的最新的第二权重。
汇集服务器将最新的第二权重发送至各训练服务器。
步骤434,训练服务器接收汇集服务器发送的第二权重以更新当前权重。
在本申请另一优选的实施例中,所述汇集服务器将新得到的第二权重进行备份之后,还包括:
子步骤C13,汇集服务器将所述第二权重代入目标模型,并输出至业务服务器。
在本申请实施例中,对于备份的第二权重,可以直接将其代入目标模型,输出给业务服务器,使业务方可以直接利用该目标模型进行使用。
本申请具备如下几个方面的优点:
(1)懒惰通信机制:根据集群环境以及迭代情况,自动判断是否需要所有机器进行权重汇总动作,从而避免每轮训练都汇集一次,导致可能出现网络占满的情况。
(2)权重备份机制:根据规则,自动备份权重,一旦某些机制出现问题,可以从备份拉回以前的权重,继续进行训练,从而不用从头再此训练,提高训练效率。
(3)数据切分装置:根据数据的时效性,自动对新的数据加大权重,对老数据降权,并丢弃部分旧数据。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
实施例五
参照图5,示出了本申请的一种分布式集群训练装置实施例的结构框图,具体可以包括如下模块:
样本读取模块510,用于读取样本集;所示样本集包括至少一条样本数据;
迭代训练模块520,用于在接收到汇集指令之前,利用所述样本数据和当前权重,代入目标模型训练函数进行迭代训练,得到第一梯度;所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出;其中,如果在接收到汇集指令之前,有多轮迭代训练,则基于前一次训练得到的第一梯度生成第一权重作为后一轮迭代训练的当前权重;
结果发送模块530,用于如果接收到汇集指令,则将所述第一梯度发送至汇集服务器;所述汇集服务器汇总各第一梯度并计算第二权重;
更新模块540,用于接收汇集服务器发送的第二权重以更新当前权重。
在本申请另一优选的实施例中,所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出,包括:
所述汇集指令由调度服务器在整个集群的集群网络利用率符合第一阈值条件时发出,和/或由调度服务器在整个集群的集群故障率符合第二阈值条件时发出。
在本申请另一优选的实施例中,所述第一阈值条件包括:集群网络利用率低于第一阈值;
所述第二阈值条件包括:集群故障率低于第二阈值。
在本申请另一优选的实施例中,所述样本读取模块之后,还包括:
第三权重计算模块,用于利用每条样本数据的时间信息,计算所述样本数据的第三权重;
样本丢弃模块,用于当所述第三权重小于第三阈值,则丢弃相应的样本数据。
在本申请另一优选的实施例中,所述第三权重计算模块包括:
指数计算模块,用于将每条样本数据的时间信息,代入指数函数的负的指数参数,计算第三权重。
在本申请另一优选的实施例中,在第三权重计算模块之前,还包括:
归并模块,用于对样本集中的各样本数据进行归并;
归并记录模块,用于对归并后的样本数据,记录所述样本数据的归并数量。
在本申请另一优选的实施例中,所述第三权重计算模块,包括:
降权系数计算模块,用于利用每条样本数据的时间信息,计算降权系数;
第一计算模块,用于计算所述降权系数与归并数量之积,得到第三权重。
在本申请另一优选的实施例中,所述结果发送模块还用于,如果接收到汇集指令,将各个样本数据的第三权重进行汇总得到的第一系数发送至汇集服务器;
则,所述汇集服务器包括:第一加权汇总模块,用于根据各第一梯度及与各第一梯度相应的第一系数,进行加权计算得到第二梯度;
第二权重计算模块,用于根据第二梯度计算第二权重。
在本申请另一优选的实施例中,所述汇集服务器还包括:
备份模块,用于将新得到的第二权重进行备份。
在本申请另一优选的实施例中,所述备份模块包括:
变化计算模块,用于所述汇集服务器判断新得到的第二权重与至少前一次备份的第二权重之间的变化量是否超过变化阈值;
第一备份模块,用于如果超过变化阈值,则对所述新得到的第二权重进行备份。
在本申请另一优选的实施例中,所述备份模块之后,还包括:
输出模块,用于将所述第二权重代入目标模型,并输出至业务服务器。
本申请具备如下几个方面的优点:
(1)懒惰通信机制:根据集群环境以及迭代情况,自动判断是否需要所有机器进行权重汇总动作,从而避免每轮训练都汇集一次,导致可能出现网络占满的情况。
(2)权重备份机制:根据规则,自动备份权重,一旦某些机制出现问题,可以从备份拉回以前的权重,继续进行训练,从而不用从头再此训练,提高训练效率。
(3)数据切分装置:根据数据的时效性,自动对新的数据加大权重,对老数据降权,并丢弃部分旧数据。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
实施例五
参照图6,示出了本申请的一种分布式集群训练装置实施例的结构框图,具体可以包括如下模块:
包括调度服务器610、汇集服务器620、多台训练服务器630。
所述调度服务器610包括:
集群监控模块611,用于监控集群系统环境是否符合阈值条件,如果符合,则向各训练服务器630发出汇集指令。
在本申请另一优选的实施例中,集群监控模块611具体用于在整个集群的集群网络利用率符合第一阈值条件时发出汇集指令,和/或在整个集群的集群故障率符合第二阈值条件时发出汇集指令。
在本申请另一优选的实施例中,所述第一阈值条件包括:集群网络利用率低于第一阈值;
所述第二阈值条件包括:集群故障率低于第二阈值。
所述训练服务器630包括:
样本读取模块631,用于读取样本集;所示样本集包括至少一条样本数据;
迭代训练模块632,用于在接收到汇集指令之前,利用所述样本数据和当前权重,代入目标模型训练函数进行迭代训练,得到第一梯度;其中,如果在接收到汇集指令之前,有多轮迭代训练,则基于前一次训练得到的第一梯度生成第一权重作为后一轮迭代训练的当前权重;
结果发送模块633,用于如果接收到汇集指令,则将所述第一梯度发送至汇集服务器;
更新模块634,用于接收第二权重以更新当前权重;
在本申请另一优选的实施例中,所述样本读取模块631之后,还包括:
第三权重计算模块,用于利用每条样本数据的时间信息,计算所述样本数据的第三权重;
样本丢弃模块,用于当所述第三权重小于第三阈值,则丢弃相应的样本数据。
在本申请另一优选的实施例中,所述第三权重计算模块包括:
指数计算模块,用于将每条样本数据的时间信息,代入指数函数的负的指数参数,计算第三权重。
在本申请另一优选的实施例中,在第三权重计算模块之前,还包括:
归并模块,用于对样本集中的各样本数据进行归并;
归并记录模块,用于对归并后的样本数据,记录所述样本数据的归并数量。
在本申请另一优选的实施例中,所述第三权重计算模块,包括:
降权系数计算模块,用于利用每条样本数据的时间信息,计算降权系数;
第一计算模块,用于计算所述降权系数与归并数量之积,得到第三权重。
在本申请另一优选的实施例中,所述结果发送模块633还用于,如果接收到汇集指令,将各个样本数据的第三权重进行汇总得到的第一系数发送至汇集服务器。
所述汇集服务器620包括:
汇集计算模块621,用于汇总各第一梯度并计算第二权重;
第二权重发送模块622,用于向各训练服务器发送最新的第二权重。
在本申请另一优选的实施例中,所述汇集服务器包括:
第一加权汇总模块,用于根据各第一梯度及与各第一梯度相应的第一系数,进行加权计算得到第二梯度;
第二权重计算模块,用于根据第二梯度计算第二权重。
在本申请另一优选的实施例中,所述汇集服务器还包括:
备份模块,用于将新得到的第二权重进行备份。
在本申请另一优选的实施例中,所述备份模块包括:
变化计算模块,用于所述汇集服务器判断新得到的第二权重与至少前一次备份的第二权重之间的变化量是否超过变化阈值;
第一备份模块,用于如果超过变化阈值,则对所述新得到的第二权重进行备份。
在本申请另一优选的实施例中,所述备份模块之后,还包括:
输出模块,用于将所述第二权重代入目标模型,并输出至业务服务器。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
在一个典型的配置中,所述计算机设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非持续性的电脑可读媒体(transitory media),如调制的数据信号和载波。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括 指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种分布式集群训练方法和一种分布式集群训练装置,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (22)

  1. 一种分布式集群训练方法,其特征在于,包括:
    读取样本集;所示样本集包括至少一条样本数据;
    在接收到汇集指令之前,利用所述样本数据和当前权重,代入目标模型训练函数进行迭代训练,得到第一梯度;所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出;其中,如果在接收到汇集指令之前,有多轮迭代训练,则基于前一次训练得到的第一梯度生成第一权重作为后一轮迭代训练的当前权重;
    如果接收到汇集指令,则将所述第一梯度发送至汇集服务器;所述汇集服务器汇总各第一梯度并计算第二权重;
    接收汇集服务器发送的第二权重以更新当前权重。
  2. 根据权利要求1所述的方法,其特征在于,所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出,包括:
    所述汇集指令由调度服务器在整个集群的集群网络利用率符合第一阈值条件时发出,和/或由调度服务器在整个集群的集群故障率符合第二阈值条件时发出。
  3. 根据权利要求2所述的方法,其特征在于:
    所述第一阈值条件包括:集群网络利用率低于第一阈值;
    所述第二阈值条件包括:集群故障率低于第二阈值。
  4. 根据权利要求1所述的方法,其特征在于,所述样本数据包括时间信息,在读取样本集的步骤之后,还包括:
    利用每条样本数据的时间信息,计算所述样本数据的第三权重;
    当所述第三权重小于第三阈值,则丢弃相应的样本数据。
  5. 根据权利要求4所述的方法,其特征在于,所述利用每条样本数据的时间信息,计算所述样本数据的第三权重的步骤包括:
    将每条样本数据的时间信息,代入指数函数的负的指数参数,计算第三权重。
  6. 根据权利要求4所述的方法,其特征在于,在利用每条样本数据的时间信息,计算所述样本数据的第三权重的步骤之前,还包括:
    对样本集中的各样本数据进行归并;
    对归并后的样本数据,记录所述样本数据的归并数量。
  7. 根据权利要求6所述的方法,其特征在于,所述利用每条样本数据的时间信息,计算所述样本数据的第三权重的步骤,包括:
    利用每条样本数据的时间信息,计算降权系数;
    计算所述降权系数与归并数量之积,得到第三权重。
  8. 根据权利要求4所述的方法,其特征在于,如果接收到汇集指令,还包括:
    将各个样本数据的第三权重进行汇总得到的第一系数发送至汇集服务器;
    则,所述汇集服务器汇总各第一梯度并计算第二权重包括:
    根据各第一梯度及与各第一梯度相应的第一系数,进行加权计算得到第二梯度;
    根据第二梯度计算第二权重。
  9. 根据权利要求1-8其中之一所述的方法,其特征在于,所述汇集服务器汇总各第一梯度并计算第二权重之后,还包括:
    所述汇集服务器将新得到的第二权重进行备份。
  10. 根据权利要求9所述的方法,其特征在于,所述汇集服务器将新得到的第二权重进行备份包括:
    所述汇集服务器判断新得到的第二权重与至少前一次备份的第二权重之间的变化量是否超过变化阈值;
    如果超过变化阈值,则对所述新得到的第二权重进行备份。
  11. 根据权利要求9所述的方法,其特征在于,所述汇集服务器将新得到的第二权重进行备份之后,还包括:
    将所述第二权重代入目标模型,并输出至业务服务器。
  12. 一种分布式集群训练装置,其特征在于,包括:
    样本读取模块,用于读取样本集;所示样本集包括至少一条样本数据;
    迭代训练模块,用于在接收到汇集指令之前,利用所述样本数据和当前权重,代入目标模型训练函数进行迭代训练,得到第一梯度;所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出;其中,如果在接收到汇集指令之前,有多轮迭代训练,则基于前一次训练得到的第一梯度生成第一权重作为后一轮迭代训练的当前权重;
    结果发送模块,用于如果接收到汇集指令,则将所述第一梯度发送至汇集服务器;所述汇集服务器汇总各第一梯度并计算第二权重;
    更新模块,用于接收汇集服务器发送的第二权重以更新当前权重。
  13. 根据权利要求12所述的装置,其特征在于,所述汇集指令由调度服务器在集群系统环境符合阈值条件时发出,包括:
    所述汇集指令由调度服务器在整个集群的集群网络利用率符合第一阈值条件时发 出,和/或由调度服务器在整个集群的集群故障率符合第二阈值条件时发出。
  14. 根据权利要求13所述的装置,其特征在于:
    所述第一阈值条件包括:集群网络利用率低于第一阈值;
    所述第二阈值条件包括:集群故障率低于第二阈值。
  15. 根据权利要求12所述的装置,其特征在于,所述样本读取模块之后,还包括:
    第三权重计算模块,用于利用每条样本数据的时间信息,计算所述样本数据的第三权重;
    样本丢弃模块,用于当所述第三权重小于第三阈值,则丢弃相应的样本数据。
  16. 根据权利要求15所述的装置,其特征在于,所述第三权重计算模块包括:
    指数计算模块,用于将每条样本数据的时间信息,代入指数函数的负的指数参数,计算第三权重。
  17. 根据权利要求15所述的装置,其特征在于,在第三权重计算模块之前,还包括:
    归并模块,用于对样本集中的各样本数据进行归并;
    归并记录模块,用于对归并后的样本数据,记录所述样本数据的归并数量。
  18. 根据权利要求17所述的装置,其特征在于,所述第三权重计算模块,包括:
    降权系数计算模块,用于利用每条样本数据的时间信息,计算降权系数;
    第一计算模块,用于计算所述降权系数与归并数量之积,得到第三权重。
  19. 根据权利要求15所述的装置,其特征在于,所述结果发送模块还用于,如果接收到汇集指令,将各个样本数据的第三权重进行汇总得到的第一系数发送至汇集服务器;
    则,所述汇集服务器包括:第一加权汇总模块,用于根据各第一梯度及与各第一梯度相应的第一系数,进行加权计算得到第二梯度;
    第二权重计算模块,用于根据第二梯度计算第二权重。
  20. 根据权利要求12-19其中之一所述的装置,其特征在于,所述汇集服务器还包括:
    备份模块,用于将新得到的第二权重进行备份。
  21. 根据权利要求20所述的装置,其特征在于,所述备份模块包括:
    变化计算模块,用于所述汇集服务器判断新得到的第二权重与至少前一次备份的第二权重之间的变化量是否超过变化阈值;
    第一备份模块,用于如果超过变化阈值,则对所述新得到的第二权重进行备份。
  22. 根据权利要求20所述的装置,其特征在于,所述备份模块之后,还包括:
    输出模块,用于将所述第二权重代入目标模型,并输出至业务服务器。
PCT/CN2017/077246 2016-03-26 2017-03-20 一种分布式集群训练方法和装置 WO2017167044A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2018549518A JP6949045B2 (ja) 2016-03-26 2017-03-20 分散クラスタ型訓練方法及び装置
US16/141,886 US11636379B2 (en) 2016-03-26 2018-09-25 Distributed cluster training method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610180393.8A CN107229518B (zh) 2016-03-26 2016-03-26 一种分布式集群训练方法和装置
CN201610180393.8 2016-03-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/141,886 Continuation US11636379B2 (en) 2016-03-26 2018-09-25 Distributed cluster training method and apparatus

Publications (1)

Publication Number Publication Date
WO2017167044A1 true WO2017167044A1 (zh) 2017-10-05

Family

ID=59932603

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/077246 WO2017167044A1 (zh) 2016-03-26 2017-03-20 一种分布式集群训练方法和装置

Country Status (5)

Country Link
US (1) US11636379B2 (zh)
JP (1) JP6949045B2 (zh)
CN (1) CN107229518B (zh)
TW (1) TWI712900B (zh)
WO (1) WO2017167044A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222779A (zh) * 2019-06-11 2019-09-10 腾讯科技(深圳)有限公司 分布式数据处理方法及系统
WO2019239821A1 (ja) * 2018-06-15 2019-12-19 日本電信電話株式会社 分散処理システムおよび分散処理方法
CN111226238A (zh) * 2017-11-07 2020-06-02 华为技术有限公司 一种预测方法及终端、服务器
CN112235384A (zh) * 2020-10-09 2021-01-15 腾讯科技(深圳)有限公司 分布式系统中的数据传输方法、装置、设备及存储介质
CN114900482A (zh) * 2022-03-28 2022-08-12 中国科学技术大学苏州高等研究院 Ps架构下基于可编程交换机的梯度调度方法和装置
US11610110B2 (en) 2018-12-05 2023-03-21 Bank Of America Corporation De-conflicting data labeling in real time deep learning systems

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046475A1 (en) * 2016-08-11 2018-02-15 Twitter, Inc. Detecting scripted or otherwise anomalous interactions with social media platform
CN107423883B (zh) * 2017-06-15 2020-04-07 创新先进技术有限公司 待处理业务的风险识别方法及装置、电子设备
CN112955909A (zh) * 2019-02-01 2021-06-11 华为技术有限公司 神经网络的分布式训练方法及装置
CN109871702A (zh) * 2019-02-18 2019-06-11 深圳前海微众银行股份有限公司 联邦模型训练方法、系统、设备及计算机可读存储介质
CN110084380A (zh) * 2019-05-10 2019-08-02 深圳市网心科技有限公司 一种迭代训练方法、设备、系统及介质
US11321207B2 (en) * 2019-07-09 2022-05-03 Cisco Technology, Inc. Seamless multi-cloud SDWAN distaster recovery using orchestration plane
CN111144584B (zh) * 2019-12-31 2024-01-19 深圳Tcl新技术有限公司 参数调优方法、装置及计算机存储介质
CN113469206A (zh) * 2020-03-31 2021-10-01 华为技术有限公司 获取人工智能模型的方法、装置、设备及存储介质
CN112016699B (zh) * 2020-08-31 2024-02-02 北京灵汐科技有限公司 一种深度学习模型训练方法、工作节点和参数服务器
CN111931947B (zh) * 2020-10-12 2021-02-05 支付宝(杭州)信息技术有限公司 一种用于分布式模型训练的训练样本重组方法及系统
CN112863175B (zh) * 2020-12-31 2022-11-22 平安科技(深圳)有限公司 汽车道路监测数据处理方法、装置、设备及存储介质
EP4307037A1 (en) * 2021-03-12 2024-01-17 Japan Display Inc. Liquid crystal device
CN112862111B (zh) * 2021-04-26 2021-08-24 之江实验室 一种加速分布式机器学习梯度汇聚的方法和装置
CN114723071B (zh) * 2022-04-26 2023-04-07 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 一种基于客户端分类和信息熵的联邦学习方法及装置
CN116980420B (zh) * 2023-09-22 2023-12-15 新华三技术有限公司 一种集群通信方法、系统、装置、设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463324A (zh) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 一种基于大规模高性能集群的卷积神经网络并行处理方法
CN104714852A (zh) * 2015-03-17 2015-06-17 华中科技大学 一种适用于分布式机器学习的参数同步优化方法及其系统
CN105005911A (zh) * 2015-06-26 2015-10-28 深圳市腾讯计算机系统有限公司 深度神经网络的运算系统及运算方法
US20150324690A1 (en) * 2014-05-08 2015-11-12 Microsoft Corporation Deep Learning Training System

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1223757B1 (en) * 2001-01-09 2006-03-22 Metabyte Networks, Inc. System, method, and software application for targeted advertising via behavioral model clustering, and preference programming based on behavioral model clusters
US20060123421A1 (en) * 2002-12-27 2006-06-08 Loboz Charles Z Streamlining cpu utilization by delaying transactions
US20050289089A1 (en) * 2004-06-28 2005-12-29 Naoki Abe Methods for multi-class cost-sensitive learning
US8150723B2 (en) * 2009-01-09 2012-04-03 Yahoo! Inc. Large-scale behavioral targeting for advertising over a network
JP5557590B2 (ja) * 2010-05-06 2014-07-23 株式会社日立製作所 負荷分散装置及びシステム
JP5584914B2 (ja) * 2010-07-15 2014-09-10 株式会社日立製作所 分散計算システム
US8924314B2 (en) * 2010-09-28 2014-12-30 Ebay Inc. Search result ranking using machine learning
US9569401B2 (en) 2011-12-06 2017-02-14 Akamai Technologies, Inc. Parallel training of a support vector machine (SVM) with distributed block minimization
US9633315B2 (en) * 2012-04-27 2017-04-25 Excalibur Ip, Llc Method and system for distributed machine learning
US9390370B2 (en) 2012-08-28 2016-07-12 International Business Machines Corporation Training deep neural network acoustic models using distributed hessian-free optimization
CN103559504B (zh) * 2013-11-04 2016-08-31 北京京东尚科信息技术有限公司 图像目标类别识别方法及装置
CN103544528A (zh) * 2013-11-15 2014-01-29 南京大学 一种基于Hadoop的BP神经网络分类方法
US9858534B2 (en) * 2013-11-22 2018-01-02 California Institute Of Technology Weight generation in machine learning
TWI524307B (zh) * 2013-11-22 2016-03-01 Univ Nat Yunlin Sci & Tech Two - dimensional image depth value estimation method and its system
US9984337B2 (en) 2014-10-08 2018-05-29 Nec Corporation Parallelized machine learning with distributed lockless training
US10229357B2 (en) * 2015-09-11 2019-03-12 Facebook, Inc. High-capacity machine learning system
US11087234B2 (en) 2016-01-29 2021-08-10 Verizon Media Inc. Method and system for distributed deep machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150324690A1 (en) * 2014-05-08 2015-11-12 Microsoft Corporation Deep Learning Training System
CN104463324A (zh) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 一种基于大规模高性能集群的卷积神经网络并行处理方法
CN104714852A (zh) * 2015-03-17 2015-06-17 华中科技大学 一种适用于分布式机器学习的参数同步优化方法及其系统
CN105005911A (zh) * 2015-06-26 2015-10-28 深圳市腾讯计算机系统有限公司 深度神经网络的运算系统及运算方法

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111226238A (zh) * 2017-11-07 2020-06-02 华为技术有限公司 一种预测方法及终端、服务器
CN111226238B (zh) * 2017-11-07 2023-10-24 华为技术有限公司 一种预测方法及终端、服务器
WO2019239821A1 (ja) * 2018-06-15 2019-12-19 日本電信電話株式会社 分散処理システムおよび分散処理方法
US11610110B2 (en) 2018-12-05 2023-03-21 Bank Of America Corporation De-conflicting data labeling in real time deep learning systems
CN110222779A (zh) * 2019-06-11 2019-09-10 腾讯科技(深圳)有限公司 分布式数据处理方法及系统
CN110222779B (zh) * 2019-06-11 2023-08-01 腾讯科技(深圳)有限公司 分布式数据处理方法及系统
CN112235384A (zh) * 2020-10-09 2021-01-15 腾讯科技(深圳)有限公司 分布式系统中的数据传输方法、装置、设备及存储介质
CN112235384B (zh) * 2020-10-09 2023-10-31 腾讯科技(深圳)有限公司 分布式系统中的数据传输方法、装置、设备及存储介质
CN114900482A (zh) * 2022-03-28 2022-08-12 中国科学技术大学苏州高等研究院 Ps架构下基于可编程交换机的梯度调度方法和装置
CN114900482B (zh) * 2022-03-28 2023-05-30 中国科学技术大学苏州高等研究院 Ps架构下基于可编程交换机的梯度调度方法和装置

Also Published As

Publication number Publication date
US11636379B2 (en) 2023-04-25
TW201734863A (zh) 2017-10-01
TWI712900B (zh) 2020-12-11
JP2019511054A (ja) 2019-04-18
CN107229518A (zh) 2017-10-03
US20190026657A1 (en) 2019-01-24
CN107229518B (zh) 2020-06-30
JP6949045B2 (ja) 2021-10-13

Similar Documents

Publication Publication Date Title
WO2017167044A1 (zh) 一种分布式集群训练方法和装置
JP6731201B2 (ja) 時間ベースのノード選出方法及び装置
CN106293892B (zh) 分布式流计算系统、方法和装置
US10394821B2 (en) Providing reconstructed data based on stored aggregate data in response to queries for unavailable data
US10909018B2 (en) System and method for end-to-end application root cause recommendation
US11119662B2 (en) Determining when to perform a data integrity check of copies of a data set using a machine learning module
US8904144B1 (en) Methods and systems for determining at risk index for storage capacity
CN112751726B (zh) 一种数据处理方法、装置、电子设备和存储介质
CN112988398A (zh) 一种微服务动态伸缩及迁移方法和装置
CN110187995B (zh) 一种熔断对端节点的方法及熔断装置
CN111966289A (zh) 基于Kafka集群的分区优化方法和系统
CN112882889A (zh) 异常监控方法、系统、电子设备和存储介质
JP2023534696A (ja) ネットワークトポロジーにおけるアノマリー検知
JP6252309B2 (ja) 監視漏れ特定処理プログラム,監視漏れ特定処理方法及び監視漏れ特定処理装置
CN107391230B (zh) 一种确定虚拟机负载的实现方法和装置
CN108932241B (zh) 日志数据统计方法、装置及节点
WO2022252546A1 (zh) 一种信息调节方法、设备及存储介质
WO2021257263A1 (en) Techniques for generating a consistent view of an eventually consistent database
CN116662022B (zh) 分布式消息处理方法、系统、装置、通信设备及存储介质
WO2021147319A1 (zh) 一种数据处理方法、装置、设备及介质
CN111445027B (zh) 机器学习模型的训练方法和装置
CN112241240A (zh) 用于并行传输数据的方法、设备和计算机程序产品
CN116804957A (zh) 一种系统监控方法及装置
WO2020211719A1 (zh) 一种数据获取方法、装置及设备
CN113467982A (zh) 异常客户端设备确定方法及装置

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2018549518

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17773074

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17773074

Country of ref document: EP

Kind code of ref document: A1