WO2022161081A1 - Training method, apparatus and system for integrated learning model, and related device - Google Patents

Training method, apparatus and system for integrated learning model, and related device Download PDF

Info

Publication number
WO2022161081A1
WO2022161081A1 PCT/CN2021/142240 CN2021142240W WO2022161081A1 WO 2022161081 A1 WO2022161081 A1 WO 2022161081A1 CN 2021142240 W CN2021142240 W CN 2021142240W WO 2022161081 A1 WO2022161081 A1 WO 2022161081A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
learning model
node
sub
nodes
Prior art date
Application number
PCT/CN2021/142240
Other languages
French (fr)
Chinese (zh)
Inventor
余思
贾佳峰
熊钦
王工艺
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022161081A1 publication Critical patent/WO2022161081A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present application relates to the field of computer technology, and in particular, to a training method, apparatus, system, device, and computer-readable storage medium for an integrated learning model.
  • the ensemble learning model can be used in the processing of classification, regression and other problems in the field of machine learning, in order to obtain better classification accuracy and prediction performance.
  • the ensemble learning model is widely used in Anping, operators, finance and other industries as well as various production systems.
  • the ensemble learning model is usually trained on a cluster including multiple servers. Specifically, each server uses a set of training samples to train the ensemble learning model. Only part of the training results of the ensemble learning model can be obtained through model training, so different servers will also exchange part of the training results obtained by their respective training, so that the trained ensemble learning model is determined based on the training results obtained by the aggregation.
  • the data size of the training samples is large, the data volume of some training results that need to be interacted between different servers is also large, which makes the training efficiency of the ensemble learning model low due to the large amount of data interacting between servers. , it may even cause the training of the ensemble learning model to fail due to training timeout or bandwidth overload between servers, making it difficult to meet the user's application needs. Therefore, how to provide an efficient ensemble learning model training method has become an urgent technical problem to be solved.
  • the present application provides an integrated learning model training method, apparatus, device, system, computer-readable storage medium and computer program product, so as to improve the training efficiency and training success rate of the integrated learning model.
  • a training method for an integrated learning model is provided, and the method can be applied to a model training system including a control node and a working node.
  • the control node obtains the training request of the ensemble learning model, and generates a training task set according to the training request, wherein the generated training task set includes a plurality of training tasks, and among the plurality of training tasks; Then, the control node sends the training tasks in the training task set to the plurality of working nodes in the working node set respectively, each training task can be assigned to one working node and executed by one working node, and each training task is used for At least one sub-learning model in the ensemble learning model is trained, and different training tasks are used to train different sub-learning models.
  • a worker node may only be assigned to a training task of one sub-learning model, or may be assigned to a training task of multiple sub-learning models.
  • each sub-learning model in the ensemble learning model is trained by a working node, the training results of each sub-learning model can be processed by one working node, so that after the working node completes the training of the sub-learning model, it is not necessary to Obtain training results for this sub-learning model from other worker nodes. In this way, the amount of data required to communicate between each working node in the process of training the sub-learning model can be effectively reduced, thereby not only reducing the resource consumption required for training the ensemble learning model, but also effectively improving the training of the ensemble learning model. efficiency and success rate.
  • the training request includes an instruction for training the ensemble model and the number of sub-learning models in the ensemble learning model
  • the control node may, after receiving the training request, Instruct, trigger the training of the ensemble learning model, and generate an equal number of training tasks for the training of the sub-learning models according to the number of the sub-learning models included in the training request.
  • the number of sub-learning models in the ensemble learning model may be fixed.
  • the training request may only include a training instruction for the ensemble learning model, and the control node may generate a fixed number based on the training instruction. sub-learning model to complete the training of the ensemble learning model.
  • control node when the control node generates a training task set according to the training request, it may specifically generate a training task set according to the number of sub-learning models included in the training request, and the training task set includes the training task set.
  • the number of tasks is equal to the number of sub-learning models, for example, each training task can be used to train a sub-learning model, etc.
  • multiple training tasks may also be used to train a sub-learning model, etc., which is not limited in this embodiment.
  • control node when the control node sends the training tasks in the training task set to the plurality of working nodes in the working node set, it may specifically obtain the load of each working node in the working node set, and According to the load of each worker node, a training task is sent to each worker node in the first part of the worker nodes, wherein the worker node set includes the second part of worker nodes in addition to the first part of worker nodes, and the received The load of each worker node in the first part of the training task is smaller than the load of each worker node in the second part of the worker node.
  • control node can sort the load of each worker node in the work set, and assign training tasks to the first n worker nodes with less load in turn, while the last m worker nodes with greater load may not be assigned to Training nodes (m+n worker nodes are included in the working set).
  • training tasks m+n worker nodes are included in the working set.
  • it can be avoided that the load of some working nodes is too high and the training efficiency of the whole ensemble learning model is lowered or even the training process fails.
  • other possible implementations may also be used to assign training tasks to the working nodes, etc., which is not limited in this embodiment of the present application.
  • the sub-learning model in the ensemble learning model may specifically be a decision tree model.
  • the sub-learning model may also be other types of models, which are not limited in this embodiment of the present application.
  • the training termination condition of the sub-learning model includes at least one of the following conditions: the number of training samples corresponding to the leaf nodes of the sub-learning model is less than or equal to The number threshold, or the impurity index of the training sample set used to train the sub-learning model is less than the impurity threshold, or the depth of the nodes in the sub-learning model is greater than or equal to the depth threshold.
  • the control node may end the training of the ensemble learning model.
  • the present application also provides a training device for an integrated learning model, where the training device for an integrated learning model includes a training device for performing the integrated learning model in the first aspect or any possible implementation manner of the first aspect. the individual modules of the method.
  • the present application also provides a device, comprising: a processor and a memory; the memory is used to store an instruction, and when the device is running, the processor executes the instruction stored in the memory, so that the device executes the above-mentioned instructions
  • a method for training an ensemble learning model in the first aspect or any implementation method of the first aspect may be integrated in the processor, or may be independent of the processor.
  • the business requirement adjustment system may also include a bus. Among them, the processor is connected to the memory through the bus.
  • the memory may include readable memory and random access memory.
  • the present application also provides a model training system, where the model training system includes a control node and a working node, where the control node is used to execute the integrated learning model in the first aspect or any implementation manner of the first aspect. Training method, the worker node is used to execute the training task sent by the control node.
  • the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer executes the first aspect and any one of the first aspect. The method described in the embodiment.
  • the present application provides a computer program product comprising instructions, which, when run on a computer, cause the computer to execute the method described in the first aspect and any one of the embodiments of the first aspect.
  • the present application may further combine to provide more implementation manners.
  • FIG. 1 is a schematic diagram of the architecture of an exemplary model training system provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of an exemplary application scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a training method for an integrated learning model provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of an exemplary configuration interface provided by an embodiment of the present application.
  • Fig. 5 is the schematic diagram that each work node utilizes training sample set to train each sub-learning model
  • FIG. 6 is a schematic structural diagram of a training device for an integrated learning model provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a hardware structure of a device provided by an embodiment of the present application.
  • the model training system includes a control node (driver) 100 and a plurality of worker nodes (workers). Among them, the control node 100 and the worker nodes cooperate to complete the training of the integrated learning model, and the control node 100 is used to generate the training task of the integrated learning model, and send the above training tasks to the worker nodes; the worker nodes are used to execute the received training tasks.
  • the training task is to obtain the sub-learning model in the ensemble learning model. It is worth noting that in FIG. 1 , the model training system includes 10 working nodes as an example for illustration, which are working nodes 201 to 210 respectively.
  • control node 100 can also include any number (more than one) At the same time, the number of the control node 100 may not be limited to one.
  • the control node 100 and each working node may interact through an intermediate device (such as a switch, etc., not shown in FIG. 1 ).
  • the model training system shown in FIG. 1 can be deployed in a cluster including multiple servers, and the control node 100 and the work Node 201 to worker node 210 may be deployed by servers in the cluster, respectively, and the servers in the cluster are divided into control nodes and worker nodes according to the functions performed by the servers.
  • server 1 to server n form a cluster, in which a big data platform such as a database (for example, Hadoop) or a computing engine (for example, Spark) can be deployed, and a model training system for training an ensemble learning model can be based on The above big data platform is deployed to implement the training and inference of the integrated learning model.
  • a big data platform such as a database (for example, Hadoop) or a computing engine (for example, Spark)
  • a model training system for training an ensemble learning model can be based on The above big data platform is deployed to implement the training and inference of the integrated learning model.
  • the model training system can train an integrated learning model on the big data platform, and the integrated learning model can include multiple sub-learning models; then, the integrated learning model can be used to infer the known input data, specifically, the The known input data is respectively input into multiple sub-learning models, each sub-learning model infers the known input data and outputs the corresponding inference result, and finally determines the integration from the inference results of the multiple sub-learning models by voting
  • the inference result of the learning model may be the inference result with the most votes among the inference results of the multiple sub-learning models as the inference result of the integrated learning model.
  • the ensemble learning model may be an ensemble learning model based on random forests, such as Spark ML, Spark MLlib, etc.; the ensemble learning model may also be other types of ensemble learning models, which are not limited in this application.
  • the application scenario of deploying the model training system shown in FIG. 2 is only an exemplary illustration, and the model training system can also be deployed in other possible scenarios in practical application, for example, the control node 100 in FIG. 1 and Each worker node may also be deployed on a server. In this case, the control node 100 and each worker node may specifically be different computing units on the server, and are divided into control nodes and worker nodes by function. In this embodiment, the applicable application scenarios of the model training system are not limited.
  • each sub-learning model in the process of training the ensemble learning model in the model training system, can be trained by using a working node, so that the training results of each sub-learning model can all be located in a working node, thus After the worker node completes the training of the sub-learning model, it is not necessary to obtain the training result for the sub-learning model from other worker nodes. In this way, the amount of data required to communicate between each working node in the process of training the sub-learning model can be effectively reduced, thereby not only reducing the resource consumption (mainly the consumption of communication resources) required for training the ensemble learning model, but also It can effectively improve the training efficiency and success rate of the integrated learning model.
  • the model training system can generate a training task for each sub-learning model when training the integrated learning model. For example, when the model training system receives a training request for the integrated training model, it can parse the integrated learning from the training request. The number of sub-learning models included in the model and generating the same number of training tasks, the model training system can assign each training task to a worker node for execution. Optionally, the model training system can also assign multiple sub-learning models to one worker node for execution.
  • the model training system may allocate training tasks to corresponding worker nodes according to the load of the worker nodes. For example, the model training system (specifically, the control node 100 in the model training system) obtains the load of each working node, and sorts the load of each working node; then, the model training system can preferentially allocate training tasks to those with smaller loads The worker nodes with a large load can not allocate training tasks, so that the load balance can be achieved in the model training system. Of course, in other possible examples, the model training system can also directly issue the training tasks in sequence according to the numbering sequence of the worker nodes. The specific implementation is not limited.
  • the sub-learning model in the ensemble learning model may be a decision tree model.
  • the decision tree model refers to a tree diagram composed of decision points, strategy points (event points) and results, and can be applied to sequence decision-making. In practical application, it can be based on the maximum expected benefit value or the lowest expected cost as the decision criterion, and the decision results of various schemes under different conditions can be solved graphically, and then the final decision results can be given by comparing the decision results of various schemes.
  • the sub-learning model may also be other models having a tree structure, etc., which is not limited in this embodiment.
  • FIG. 3 is a schematic flowchart of the training method of the integrated learning model provided by the embodiment of the present application.
  • the sub-learning model is specifically a decision tree model as an example, the method can be applied to the model training system shown in FIG. 1 above, or other applicable model training systems.
  • the method may include:
  • the control node 100 obtains a training request of the ensemble learning model.
  • the model training system may trigger the training process of the ensemble learning model when receiving a training request for the ensemble learning model.
  • the model training system may have a communication connection with a user terminal, and the user may perform a trigger operation for training the integrated learning model on the user terminal, and the user terminal generates a training request for the corresponding integrated learning model based on the operation, It is then sent to the model training system, so that the control node 100 in the model training system obtains the training request and triggers the execution of the subsequent model training process.
  • the control node 100 generates a training task set according to the received training request, wherein the training task set includes multiple training tasks, and each training task in the multiple training tasks is executed by a worker node, and each training task uses for training at least one sub-learning model in the ensemble learning model, and different training tasks are used for training different sub-learning models.
  • control node 100 in the model training system may obtain training samples for training multiple sub-learning models in the integrated learning model, wherein the number of samples is usually multiple; and the control node 100 may further Multiple training sample sets are generated according to the training samples, and the samples included in different training sample sets may be different.
  • control node 100 may sample the training samples by means of sampling with replacement to obtain P training sample sets.
  • control node 100 may also generate multiple training sample sets in other manners, which is not limited in this embodiment.
  • the number (P) of training sample sets sampled by the control node 100 may be determined by the user.
  • the control node can present the configuration interface as shown in FIG. 4 to the user, so that the user can configure the parallel parameter of model training to be P on the configuration interface, that is, to execute P model training processes at the same time, the control node 100 can be based on the user's
  • the configuration generates P training sample sets, each of which can support the training of a sub-learning model, as shown in Figure 4.
  • the nodes that execute the training process may specifically be working nodes. Therefore, the control node 100 can distribute the generated multiple training sample sets to different working nodes respectively. For example, each training sample set can be distributed to a worker nodes, etc.
  • the control node 100 can also generate a training task set based on the received training request, the training task includes a plurality of training tasks, and each training task can be used to instruct the working node to execute a sub-learning model based on the received training sample set the training process.
  • the training request received by the control node 100 may specifically include an instruction for training the integrated training model and the number of sub-learning models included in the integrated learning model, so that the control node 100 receives the training request
  • the training process of the integrated learning model can be started based on the instruction in the training request, and an equal number of training tasks can be generated according to the number of sub-learning models, and each training task is used to instruct the worker node to complete the training of a sub-learning model. process, and the sub-learning models trained by different training tasks are different.
  • the control node 100 respectively sends the training tasks in the training task set to the plurality of working nodes in the working node set.
  • control node 100 After the control node 100 generates the training task set, it can issue the training tasks in the training task set to each worker node, and each training task can be issued to a worker node for execution, and the sub-learning model corresponding to the training task
  • the training results of can all be executed by one worker node.
  • the control node 100 may be a worker node that selects a training task based on the load of each worker node.
  • the control node 100 may determine the first part of the working nodes and the second part of the working nodes in the set of working nodes according to the load order of each working node in the set of working nodes, wherein the first part of the working nodes
  • the load of each work node in a part of the work nodes is less than or equal to the load of each work node in the second part of the work nodes; then, the control node 100 can issue the training tasks to each work node in the first part of the work nodes one by one
  • the control node 100 may not assign training tasks to them.
  • the model training system may also randomly select worker nodes to perform training tasks, or sequentially issue training tasks in the order of the number of the worker nodes, etc. In this embodiment, for each training task
  • the specific implementation of the delivery to which worker node for execution is not limited.
  • each worker node may include multiple executors, and the executor example may be a logic execution unit in the worker node, and the executor is used to execute the training task required by the worker node.
  • each training task may include multiple subtasks (tasks), and each subtask may be used to instruct a worker node to execute a partial model training process of a sublearning model, thereby controlling the node 100
  • multiple executors on the worker node can be used to execute different subtasks in the training task. In this way, by executing multiple different sub-tasks in parallel by multiple executors, the efficiency of training a single sub-learning model by worker nodes can be improved.
  • each executor on the work node can use some samples in the training sample set 1, respectively, The node is trained, and each executor uses different block samples when training the tree node.
  • the training result of the tree node of the decision tree model 1 based on each block sample can be obtained, that is, the complete training result of the decision tree model 1 using the training sample set 1 can be obtained.
  • the number of training tasks can be the same as the number of training sample sets, so that in the process of training the sub-learning model for multiple working nodes, each working node can use the received training sample set, and according to the training sample set included in a single training task
  • the task performs the corresponding model training process; and the number of subtasks in each training task can be determined according to the number of training samples in the training sample set.
  • the control node 100 can generate 5 subtasks according to the 5 parts training samples, each subtask corresponds to one of the block samples, the block samples corresponding to different subtasks are different, and the generated 5 subtasks can be A training task is constituted, so that the control node 100 can generate different training tasks based on different training sample sets.
  • the number of blocks for each training sample set can be configured by the user on the configuration interface shown in This is not limited.
  • the number of training tasks and the number of worker nodes may be the same or different.
  • each worker node can execute all the tasks in one training task.
  • one worker node can execute all subtasks in multiple training tasks, that is, one worker node can complete the training of multiple sub-learning models.
  • the model training system usually performs multiple rounds of iterative training process when training the ensemble learning model. And during each round of model training, the control node 100 can regenerate multiple training tasks and send them to the worker nodes for execution. During multiple rounds of iterative training, the training sample set used to train each sub-learning model can remain unchanged, and the content of the sub-tasks included in the training task of training a sub-learning model in each round can be the same as the previous round of training the sub-task. There are differences in the subtasks included in the training task of the learned model.
  • the sub-learning model is a decision tree model
  • the sub-task in the previous round of training is used to train tree node 1 in the decision tree model
  • the sub-task in the current round of training is used for the decision tree model.
  • the current round of training refers to the rounds in which the model training system is performing training on the ensemble learning model in the process of training the ensemble learning model.
  • the model training system is performing the second round of model training on the ensemble learning model
  • the second round of model training is the current round of training.
  • the control node 100 may first allocate a sub-learning model for each training task, that is, each training task Subtasks in a training task are used to train a sublearning model. When the number of sub-learning models is greater than the number of training tasks, after assigning a sub-learning model to each training task, the control node 100 continues to assign another sub-learning model to each training task from the remaining sub-learning models. , subtasks in one training task can be used to train multiple sub-learning models.
  • the sub-task in the training task may specifically be used to determine the best split point of a tree node in the decision tree model.
  • determining the best split point refers to determining a tree node suitable for sample splitting in the decision tree model, and the training samples contained in the two sub-nodes obtained after the tree node split belong to different attributes respectively. scope.
  • the control node 100 may create a tree node list to be trained, and initialize the tree nodes in the list as the root nodes of each decision tree model, that is, when the When the number is x, the tree node list includes x root nodes.
  • the control node 100 can be trained from Select a tree node from the list of tree nodes to be trained and add it to the current round of training tree node list (cur-tree node list).
  • the control node 100 may specifically add the tree nodes in the tree node list to be trained to the current round of training tree node list one by one according to the order of the index numbers of the tree nodes.
  • control node 100 may limit the number of nodes added to the current round of training node list not to exceed a node number threshold, that is, limit the length of the current round of training node list, while the tree node list to be trained is limited.
  • the nodes that have not participated in the current round of training can participate in the next round of model training, so that when there are too many nodes in the tree node list to be trained, the control node 100 can control the training in batches.
  • control node 100 can generate the mapping relationship between the subtask and the tree node to be trained, and broadcast it to each worker node.
  • control node 100 may further deliver the subtasks in the generated training task to the working nodes, for example, through a round-robin mechanism to deliver the subtasks to each working node and the like.
  • the worker node can perform the corresponding model training task by using the training sample set corresponding to the training task according to the received training task.
  • the executor on each work node can determine the tree node to be trained according to the mapping relationship between the subtask broadcasted by the control node 100 and the tree node to be trained, and the subtask issued by the control node 100, and Use the training sample corresponding to the subtask to train the tree node, specifically, first determine the sample attribute for splitting the tree node, such as age, etc., and then determine the sample attribute for the sample in the block sample corresponding to the subtask.
  • the training samples are classified into one class, such as the attribute value of the age attribute greater than 23, and which training samples are classified into another class, such as the attribute value of the age attribute Training samples smaller than 23 are classified into another class.
  • the sample attribute may be selected by a preset random algorithm, or selected by other methods. In this way, after the multiple executors on the working node have completed the model training respectively, the working node can obtain the training results of all training samples in the entire training sample set for the tree node, and the training result can be, for example, the training samples for the tree node distribution histogram, etc.
  • the work node when all the subtasks in the training task are completed by the executor on one work node, the work node can directly obtain the training results of the current round of training for the tree nodes, without needing to obtain from other work nodes, so that different There is no need to interact between the working nodes of their training results for the decision tree model. In this way, the amount of data communicated between working nodes during the training process of the ensemble learning model can be effectively reduced.
  • the worker node After the worker node obtains the complete training results for the tree node in the current round of training, it can calculate the best split point for the tree node, and the best split point is used to split the training samples contained in the tree node.
  • the training samples corresponding to the root node are all training samples in the entire training sample set).
  • the complete training result may indicate the sample value (ie, the attribute value) of each training sample for a predetermined sample attribute, and the working node may determine all samples according to the sample value of each training sample for the sample attribute
  • the sample value that can achieve the maximum information gain in the value, so as to determine the sample value as the boundary, divide the training sample into two parts, and the sample value that achieves the maximum information gain is the best segmentation point.
  • the worker node can determine the sample value corresponding to the above-mentioned attribute feature of all the training samples in the training sample set according to the obtained complete training result, and traverse the calculation method from the sample value. Determine the sample value as the best split point.
  • the optimal split point is a variable s, and the value of s is any one of the above-mentioned sample values.
  • the training sample set D of size N is divided into two sets, namely the left training sample set D left and the right training sample set D right , for example, the sample value in the training sample set is smaller than the variable s
  • the training samples of s are classified into the left training sample set, and the training samples whose sample values in the training sample set are greater than or equal to the variable s are classified into the right training sample set, etc.
  • the working node can calculate the information gain IG(D, s) possessed by the value of the variable s, wherein, the information gain IG can be calculated by the following formulas (1) and (2), for example:
  • Impurity refers to the index of impurities in the training sample set, which can also be called “impurity index”, K is the number of sample categories in the training sample set, and p i is the probability of the ith sample category in the training sample set. It is worth noting that the Impurity in the formula (2) is based on Gini as an example. In practical application, the Impurity may also include entropy, variance, etc., which is not limited in this embodiment.
  • the working node can traverse and calculate the information gain corresponding to each value of the variable s, so that the value of s corresponding to the maximum information gain can be used as the best segmentation point.
  • the working node feeds back the calculated optimal split point as the final training result of the current round of training to the control node 100, so that the control node 100 can obtain the current training results of multiple decision tree models from multiple working nodes, that is, the first round of training results.
  • the control node 100 may calculate a split node according to the optimal split point corresponding to the decision tree model, such as splitting the root node into a left node and a right node, wherein the training sample corresponding to the left node may be is the training sample in the above left training sample set D left , and the training sample corresponding to the right node may be the training sample in the above right training sample set D right .
  • the training sample corresponding to the left node may be is the training sample in the above left training sample set D left
  • the training sample corresponding to the right node may be the training sample in the above right training sample set D right .
  • control node 100 can judge whether the decision tree model obtained after the current round of training satisfies the training termination condition, if so, the multiple decision tree models in the integrated learning model are the decision tree models obtained by the current round of training, and If not, the control node 100 may continue to perform the next round of training on the decision tree model.
  • the training termination condition includes at least one of the following:
  • Mode 1 the number of training samples corresponding to the leaf nodes of the decision tree model are all smaller than the number threshold.
  • the impurity index of the training sample set used for training the sub-learning model is smaller than the impurity threshold.
  • the depth of the nodes in the sub-learning model is greater than or equal to the depth threshold.
  • control node 100 can stop the model training process, that is, complete the training of the ensemble learning model.
  • the above training termination condition is only an example, and in practical application, the training termination condition may also be implemented in other ways.
  • control node 100 may continue to perform the next round of training process for each decision tree model.
  • control node 100 may clear the tree node list to be trained and the tree nodes that have been split in the current round of training tree node list, and add the split nodes for each decision tree model in the previous round to the tree node list. Then, the control node 100 adds the tree nodes in the current to-be-trained tree node list to the current round of training tree nodes list, and based on the above-mentioned similar process, uses the working node to split the tree nodes in the current round of training tree node list again. , in this way, each decision tree model of the ensemble learning model is trained through multiple iterations until the training of the ensemble learning model is completed when each decision tree model satisfies the training termination condition.
  • each sub-learning model in the ensemble learning model is trained by using a working node, so that the training results of each sub-learning model can all be located in one working node, so that the working node
  • the training results for the sub-learning model may not be obtained from other working nodes.
  • the amount of data required to communicate between each working node in the process of training the sub-learning model can be effectively reduced, thereby not only reducing the resource consumption required for training the ensemble learning model, but also effectively improving the training of the ensemble learning model. efficiency and success rate.
  • control node and the working node are deployed in a cluster including multiple servers as an example for illustration.
  • the above-mentioned training process for the integrated learning model It can also be implemented by a cloud service provided by a cloud data center.
  • the user can send a training request for the integrated learning model to the cloud data center through a corresponding terminal device, so as to request the cloud data center to train the integrated learning model and feed it back to the user.
  • the cloud data center can call the corresponding computing resources to complete the training of the integrated learning model.
  • the cloud data center can call some computing resources (such as a server supporting the cloud service, etc.) to realize the above control node 100. function, and invoking another part of computing resources (such as multiple servers supporting the cloud service, etc.) to realize the functions of the above-mentioned multiple working nodes.
  • the cloud data center completes the training process of the integrated learning model based on the invoked computing resources, and reference may be made to the relevant descriptions in the above embodiments, which will not be repeated here.
  • the cloud data center After the cloud data center completes the training of the ensemble learning model, it can send the ensemble learning model obtained by training to the terminal device on the user side, so that the user can obtain the required ensemble learning model.
  • the training method of the integrated learning model provided by the present application is described in detail above with reference to Fig. 1 to Fig. 5 , and the training device of the integrated learning model provided by the present application and the training device for training the integrated learning model provided by the present application will be described below in conjunction with Fig. 6 to Fig. 7 Devices that integrate learning models.
  • FIG. 6 is a schematic structural diagram of an integrated learning model training device provided by the present application.
  • the integrated learning model training device 600 can be applied to a control node in a model training system, and the model training system further includes a working node.
  • the device 600 includes:
  • a generating module 602 configured to generate a training task set according to the training request, the training task set includes multiple training tasks, each training task in the multiple training tasks is executed by a worker node, and each training task for training at least one sub-learning model in the integrated learning model, and different training tasks are used for training different sub-learning models;
  • the communication module 603 is configured to send the training tasks in the training task set to a plurality of working nodes in the working node set respectively.
  • the apparatus 600 in this embodiment of the present application may be implemented by a central processing unit (central processing unit, CPU), an application-specific integrated circuit (application-specific integrated circuit, ASIC), or a programmable logic device (programmable logic device).
  • device, PLD programmable logic device
  • the above-mentioned PLD can be a complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL) or its random combination.
  • the apparatus 600 and its respective modules can also be software modules.
  • the training request includes an instruction to train the ensemble learning model and the number of sub-learning models in the ensemble learning model.
  • the generating module 602 is specifically configured to generate the training task set according to the number of sub-learning models included in the training request, where the number of training tasks included in the training task set is the same as the number of the sub-learning models. equal in quantity.
  • the communication module 603 specifically includes:
  • a load obtaining unit configured to obtain the load of each working node in the set of working nodes
  • the sending unit is configured to send a training task to each working node in the first part of working nodes according to the load of each working node, the set of working nodes includes the first part of working nodes and the second part of working nodes, so The load of each working node in the first part of the working nodes is smaller than the load of each working node in the second part of the working nodes.
  • the sub-learning model includes a decision tree model.
  • the training termination condition of the sub-learning model includes at least one of the following conditions:
  • the number of training samples corresponding to the leaf nodes of the sub-learning model is less than the number threshold; or,
  • the impurity index of the training sample set used to train the sub-learning model is less than the impurity threshold; or,
  • the depth of the nodes in the sub-learning model is greater than or equal to the depth threshold.
  • each sub-learning model in the ensemble learning model is trained by using a working node, so that the training results of each sub-learning model can all be located in one working node, so that the working node is completing the After the sub-learning model is trained, the training results for the sub-learning model may not be obtained from other working nodes. In this way, the amount of data required to communicate between each working node in the process of training the sub-learning model can be effectively reduced, thereby not only reducing the resource consumption required for training the ensemble learning model, but also effectively improving the training of the ensemble learning model. efficiency and success rate.
  • the training device 600 for the integrated learning model may correspond to executing the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of the respective modules of the training device for the integrated learning model 600 are for the purpose of realizing Fig. For the sake of brevity, the corresponding processes of each method in 3 will not be repeated here.
  • FIG. 7 is a schematic diagram of a device 700 provided by this application.
  • the device 700 includes a processor 701 , a memory 702 , and a communication interface 703 .
  • the processor 701 , the memory 702 , and the communication interface 703 communicate through the bus 704 , and can also communicate through other means such as wireless transmission.
  • the memory 702 is used for storing instructions
  • the processor 701 is used for executing the instructions stored in the memory 702 .
  • the device 700 may further include a memory unit 705 , and the memory unit 705 may be connected to the processor 701 , the storage medium 702 and the communication interface 703 through the bus 704 .
  • the memory 702 stores program codes
  • the processor 701 can call the program codes stored in the memory 702 to perform the following operations:
  • a training task set is generated according to the training request, the training task set includes a plurality of training tasks, each training task in the plurality of training tasks is executed by a worker node, and each training task is used to train the ensemble at least one sub-learning model in the learning model, and different training tasks are used to train different sub-learning models;
  • the training tasks in the training task set are respectively sent to a plurality of working nodes in the working node set.
  • the processor 701 may be a CPU, and the processor 701 may also be other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete device components, graphics processing unit (GPU), neural network processing unit (NPU), tensor processor (tensor) At least one of processing unit, TPU), artificial intelligence (artificial intelligent) chip, etc.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGA field programmable gate arrays
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the memory 702 which may include read-only memory and random access memory, provides instructions and data to the processor 701 .
  • Memory 702 may also include non-volatile random access memory.
  • memory 702 may also store device type information.
  • the memory 702 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which acts as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous link dynamic random access memory direct rambus RAM, DR RAM
  • the communication interface 703 is used to communicate with other devices connected to the device 700 .
  • the bus 704 may also include a power bus, a control bus, a status signal bus, and the like.
  • the various buses are labeled as bus 704 in the figure.
  • the device 700 may correspond to the training apparatus 600 of the integrated learning model in the embodiment of the present application, and may correspond to the control node 100 that executes the method shown in FIG. 3 according to the embodiment of the present application, And the above and other operations and/or functions implemented by the device 700 are respectively to implement the corresponding processes of the respective methods in FIG. 3 , and are not repeated here for brevity.
  • the device provided by the present application may also be composed of multiple devices as shown in FIG. 7 , and the multiple devices communicate through the network, and the device is used to implement each method in the above-mentioned FIG. 3 The corresponding process, for the sake of brevity, will not be repeated here.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media.
  • the semiconductor medium may be a solid state drive (SSD)

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A training method for an integrated learning model, which method is applied to a model training system comprising a control node and a working node. The method comprises: when training an integrated learning model, a control node acquiring a training request for the integrated learning model; generating, according to the training request, a training task set comprising a plurality of training tasks; and then, the control node respectively sending the training tasks in the training task set to a plurality of working nodes in a working node set, wherein each training task is executed by one working node, each training task is used for training at least one learning sub-model in the integrated learning model, and different training tasks are used for training different learning sub-models. A training result of each learning sub-model can be processed by a working node, such that the amount of data needing to be communicated between working nodes during the process of training the learning sub-models can be effectively reduced, and the training efficiency and success rate of an integrated learning model are improved.

Description

集成学习模型的训练方法、装置、系统和相关设备Training method, device, system and related equipment for ensemble learning model 技术领域technical field
本申请涉及计算机技术领域,尤其涉及一种集成学习模型的训练方法、装置、系统、设备和计算机可读存储介质。The present application relates to the field of computer technology, and in particular, to a training method, apparatus, system, device, and computer-readable storage medium for an integrated learning model.
背景技术Background technique
集成学习模型,可以用于机器学习领域的分类、回归等问题的处理过程,目的是为了获得更优的分类准确性和预测表现。集成学习模型在安平、运营商、金融等行业以及各种生产系统中都有广泛应用。The ensemble learning model can be used in the processing of classification, regression and other problems in the field of machine learning, in order to obtain better classification accuracy and prediction performance. The ensemble learning model is widely used in Anping, operators, finance and other industries as well as various production systems.
目前,为了达到模型训练的性能和精度要求,集成学习模型通常是由包括多个服务器的集群上完成训练,具体是由每个服务器利用一组训练样本对集成学习模型进行训练,由于每个服务器通过模型训练仅能得到集成学习模型的部分训练结果,因此不同服务器之间还会交互各自训练获得的部分训练结果,从而基于汇总得到的训练结果确定训练后的集成学习模型。但是,当训练样本的数据规模较大时,不同服务器之间需要交互的部分训练结果的数据量也较大,这使得集成学习模型的训练效率因为服务器之间交互的数据量过大而较低,甚至可能会因为训练超时或服务器之间带宽过载而导致集成学习模型训练失败,难以满足用户的应用需求。因此,如何提供一种高效的集成学习模型的训练方法成为亟待解决的技术问题。At present, in order to meet the performance and accuracy requirements of model training, the ensemble learning model is usually trained on a cluster including multiple servers. Specifically, each server uses a set of training samples to train the ensemble learning model. Only part of the training results of the ensemble learning model can be obtained through model training, so different servers will also exchange part of the training results obtained by their respective training, so that the trained ensemble learning model is determined based on the training results obtained by the aggregation. However, when the data size of the training samples is large, the data volume of some training results that need to be interacted between different servers is also large, which makes the training efficiency of the ensemble learning model low due to the large amount of data interacting between servers. , it may even cause the training of the ensemble learning model to fail due to training timeout or bandwidth overload between servers, making it difficult to meet the user's application needs. Therefore, how to provide an efficient ensemble learning model training method has become an urgent technical problem to be solved.
发明内容SUMMARY OF THE INVENTION
本申请提供了一种集成学习模型的训练方法、装置、设备、系统、计算机可读存储介质及计算机程序产品,用以提高集成学习模型的训练效率以及训练成功率。The present application provides an integrated learning model training method, apparatus, device, system, computer-readable storage medium and computer program product, so as to improve the training efficiency and training success rate of the integrated learning model.
第一方面,提供一种集成学习模型的训练方法,该方法可以应用于包括控制节点以及工作节点的模型训练系统中。在训练集成学习模型时,控制节点获取集成学习模型的训练请求,并根据该训练请求生成训练任务集合,其中,所生成的训练任务集合包括多个训练任务,并且该多个训练任务中的;然后,控制节点向工作节点集合中的多个工作节点分别发送该训练任务集合中的训练任务,每个训练任务可以被分配到一个工作节点,并且由一个工作节点执行,每个训练任务用于训练集成学习模型中的至少一个子学习模型,而不同训练任务用于训练不同的子学习模型。当然,一个工作节点可以仅分配到一个子学习模型的训练任务,也可以是被分配到多个子学习模型的训练任务。In a first aspect, a training method for an integrated learning model is provided, and the method can be applied to a model training system including a control node and a working node. When training the ensemble learning model, the control node obtains the training request of the ensemble learning model, and generates a training task set according to the training request, wherein the generated training task set includes a plurality of training tasks, and among the plurality of training tasks; Then, the control node sends the training tasks in the training task set to the plurality of working nodes in the working node set respectively, each training task can be assigned to one working node and executed by one working node, and each training task is used for At least one sub-learning model in the ensemble learning model is trained, and different training tasks are used to train different sub-learning models. Of course, a worker node may only be assigned to a training task of one sub-learning model, or may be assigned to a training task of multiple sub-learning models.
由于集成学习模型中的每个子学习模型均利用一个工作节点进行训练,这样,每个子学习模型的训练结果可以全部由一个工作节点处理,从而工作节点在完成对于子学习模型的训练后,可以不用从其它工作节点获取针对于该子学习模型的训练结果。如此,可以有效减少训练子学习模型过程中各个工作节点之间所需通信的数据量,从而不仅能够减少训练集成学习模型中所需的资源消耗,而且,也能够有效提高对于集成学习模型的训练效率以及成功率。Since each sub-learning model in the ensemble learning model is trained by a working node, the training results of each sub-learning model can be processed by one working node, so that after the working node completes the training of the sub-learning model, it is not necessary to Obtain training results for this sub-learning model from other worker nodes. In this way, the amount of data required to communicate between each working node in the process of training the sub-learning model can be effectively reduced, thereby not only reducing the resource consumption required for training the ensemble learning model, but also effectively improving the training of the ensemble learning model. efficiency and success rate.
在一种可能的实施方式中,训练请求包括训练该集成学模型的指示以及该集成学习模型中子学习模型的数量,而控制节点在该接收到该训练请求后,可以基于该训练请求中的指示,触发集成学习模型的训练,并根据该训练请求中所包括的子学习模型的数量,生成同等数量的训练任务,以用于子学习模型的训练。当然,在其它可能的实施方式中,集成学习模型中 的子学习模型的数量可以固定,此时,训练请求可以仅包括针对集成学习模型的训练指示,而控制节点可以基于该训练指示生成固定数量的子学习模型,以完成对于集成学习模型的训练。In a possible implementation, the training request includes an instruction for training the ensemble model and the number of sub-learning models in the ensemble learning model, and the control node may, after receiving the training request, Instruct, trigger the training of the ensemble learning model, and generate an equal number of training tasks for the training of the sub-learning models according to the number of the sub-learning models included in the training request. Of course, in other possible implementations, the number of sub-learning models in the ensemble learning model may be fixed. In this case, the training request may only include a training instruction for the ensemble learning model, and the control node may generate a fixed number based on the training instruction. sub-learning model to complete the training of the ensemble learning model.
在另一种可能的实施方式中,控制节点根据该训练请求生成训练任务集合时,具体可以是根据该训练请求中包括的子学习模型的数量生成训练任务集合,并且该训练任务集合包括的训练任务数量与子学习模型的数量相等,如每个训练任务可以用于训练一个子学习模型等。当然,在其它可能的实施方式中,也可以是多个训练任务用于训练一个子学习模型等,本实施例对此并不进行限定。In another possible implementation, when the control node generates a training task set according to the training request, it may specifically generate a training task set according to the number of sub-learning models included in the training request, and the training task set includes the training task set. The number of tasks is equal to the number of sub-learning models, for example, each training task can be used to train a sub-learning model, etc. Of course, in other possible implementations, multiple training tasks may also be used to train a sub-learning model, etc., which is not limited in this embodiment.
在另一种可能的实施方式中,控制节点在向工作节点集合中的多个工作节点分别发送训练任务集合中的训练任务时,具体可以是获取该工作节点集合中各个工作节点的负载,并根据各个工作节点的负载,向第一部分工作节点中的每个工作过节点发送一个训练任务,其中,该工作节点集合除了包括第一部分工作节点之外,还包括第二部分工作节点,而接收到训练任务的第一部分工作节点中的每个工作节点的负载均小于第二部分工作节点中每个工作节点的负载。例如,控制节点可以对工作集合中的各个工作节点的负载进行排序,并将训练任务依次分配给负载较小的前n个工作节点,而负载较大的后m个工作节点可以不被分配到训练节点(工作集合中包括m+n个工作节点)。如此,在训练集成学习模型的过程中可以避免部分工作节点负载过高而导致整个集成学习模型的训练效率被拉低甚至训练过程失败。当然,实际应用时,也可以是采用其它可能的实施方式为工作节点分配训练任务等,本申请实施例对此并不进行限定。In another possible implementation manner, when the control node sends the training tasks in the training task set to the plurality of working nodes in the working node set, it may specifically obtain the load of each working node in the working node set, and According to the load of each worker node, a training task is sent to each worker node in the first part of the worker nodes, wherein the worker node set includes the second part of worker nodes in addition to the first part of worker nodes, and the received The load of each worker node in the first part of the training task is smaller than the load of each worker node in the second part of the worker node. For example, the control node can sort the load of each worker node in the work set, and assign training tasks to the first n worker nodes with less load in turn, while the last m worker nodes with greater load may not be assigned to Training nodes (m+n worker nodes are included in the working set). In this way, in the process of training the ensemble learning model, it can be avoided that the load of some working nodes is too high and the training efficiency of the whole ensemble learning model is lowered or even the training process fails. Of course, in practical application, other possible implementations may also be used to assign training tasks to the working nodes, etc., which is not limited in this embodiment of the present application.
在另一种可能的实施方式中,集成学习模型中的子学习模型具体可以是决策树模型。当然,实际应用中,该子学习模型也可以是其它类型的模型,本申请实施例对此并不进行限定。In another possible implementation, the sub-learning model in the ensemble learning model may specifically be a decision tree model. Of course, in practical applications, the sub-learning model may also be other types of models, which are not limited in this embodiment of the present application.
在另一种可能的实施方式中,当子学习模型具体为决策树模型时,子学习模型的训练终止条件包括如下条件中的至少一种:子学习模型的叶子节点对应的训练样本数量均小于数量阈值,或者用于训练子学习模型的训练样本集的不纯度指标小于不纯度阈值,或者子学习模型中节点的深度大于等于深度阈值。当集成学习模型中的每个子学习模型均满足上述条件中的至少一种时,控制节点可以结束对于集成学习模型的训练。In another possible implementation, when the sub-learning model is specifically a decision tree model, the training termination condition of the sub-learning model includes at least one of the following conditions: the number of training samples corresponding to the leaf nodes of the sub-learning model is less than or equal to The number threshold, or the impurity index of the training sample set used to train the sub-learning model is less than the impurity threshold, or the depth of the nodes in the sub-learning model is greater than or equal to the depth threshold. When each sub-learning model in the ensemble learning model satisfies at least one of the above conditions, the control node may end the training of the ensemble learning model.
第二方面,本申请还提供了一种集成学习模型的训练装置,所述集成学习模型的训练装置包括用于执行第一方面或第一方面任一种可能实现方式中的集成学习模型的训练方法的各个模块。In a second aspect, the present application also provides a training device for an integrated learning model, where the training device for an integrated learning model includes a training device for performing the integrated learning model in the first aspect or any possible implementation manner of the first aspect. the individual modules of the method.
第三方面,本申请还提供了一种设备,包括:处理器和存储器;该存储器用于存储指令,当该设备运行时,该处理器执行该存储器存储的该指令,以使该设备执行上述第一方面或第一方面的任一实现方法中集成学习模型的训练方法。需要说明的是,该存储器可以集成于处理器中,也可以是独立于处理器之外。业务需求调整系统还可以包括总线。其中,处理器通过总线连接存储器。其中,存储器可以包括可读存储器以及随机存取存储器。In a third aspect, the present application also provides a device, comprising: a processor and a memory; the memory is used to store an instruction, and when the device is running, the processor executes the instruction stored in the memory, so that the device executes the above-mentioned instructions A method for training an ensemble learning model in the first aspect or any implementation method of the first aspect. It should be noted that the memory may be integrated in the processor, or may be independent of the processor. The business requirement adjustment system may also include a bus. Among them, the processor is connected to the memory through the bus. The memory may include readable memory and random access memory.
第四方面,本申请还提供了一种模型训练系统,该模型训练系统包括控制节点以及工作节点,该控制节点用于执行上述第一方面或第一方面的任一实现方式中集成学习模型的训练方法,该工作节点用于执行控制节点发送的训练任务。In a fourth aspect, the present application also provides a model training system, where the model training system includes a control node and a working node, where the control node is used to execute the integrated learning model in the first aspect or any implementation manner of the first aspect. Training method, the worker node is used to execute the training task sent by the control node.
第五方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面以及第一方面中任意一种实施方式所述的方法。In a fifth aspect, the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer executes the first aspect and any one of the first aspect. The method described in the embodiment.
第六方面,本申请提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使 得计算机执行上述第一方面以及第一方面中任意一种实施方式所述的方法。In a sixth aspect, the present application provides a computer program product comprising instructions, which, when run on a computer, cause the computer to execute the method described in the first aspect and any one of the embodiments of the first aspect.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。On the basis of the implementation manners provided by the above aspects, the present application may further combine to provide more implementation manners.
附图说明Description of drawings
图1为本申请实施例提供的一示例性模型训练系统的架构示意图;FIG. 1 is a schematic diagram of the architecture of an exemplary model training system provided by an embodiment of the present application;
图2为本申请实施例提供的一示例性应用场景示意图;FIG. 2 is a schematic diagram of an exemplary application scenario provided by an embodiment of the present application;
图3为本申请实施例提供的一种集成学习模型的训练方法的流程示意图;3 is a schematic flowchart of a training method for an integrated learning model provided by an embodiment of the present application;
图4为本申请实施例提供的一示例性配置界面的示意图;4 is a schematic diagram of an exemplary configuration interface provided by an embodiment of the present application;
图5为各工作节点利用训练样本集训练各个子学习模型的示意图;Fig. 5 is the schematic diagram that each work node utilizes training sample set to train each sub-learning model;
图6为本申请实施例提供的一种集成学习模型的训练装置的结构示意图;6 is a schematic structural diagram of a training device for an integrated learning model provided by an embodiment of the present application;
图7为本申请实施例提供的一种设备的硬件结构示意图。FIG. 7 is a schematic diagram of a hardware structure of a device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请中的技术方案进行描述。The technical solutions in the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
如图1所示,为本申请实施例提供的一种模型训练系统,该模型训练系统包括控制节点(driver)100以及多个工作节点(worker)。其中,控制节点100与工作节点之间协同完成集成学习模型的训练,控制节点100用于生成集成学习模型的训练任务,并将上述训练任务发送给工作节点;工作节点用于执行其接收到的训练任务,得到集成学习模型中的子学习模型。值得注意的是,图1中是以模型训练系统包括10个工作节点为例进行示例性说明,分别为工作节点201至工作节点210,实际应用时也可以是包括任意数量(大于1个)的工作节点,同时,控制节点100的数量也可以不局限为一个。控制节点100与每个工作节点之间可以通过中间设备(如交换机等,图1中未示出)进行交互。As shown in FIG. 1 , a model training system is provided in an embodiment of the present application. The model training system includes a control node (driver) 100 and a plurality of worker nodes (workers). Among them, the control node 100 and the worker nodes cooperate to complete the training of the integrated learning model, and the control node 100 is used to generate the training task of the integrated learning model, and send the above training tasks to the worker nodes; the worker nodes are used to execute the received training tasks. The training task is to obtain the sub-learning model in the ensemble learning model. It is worth noting that in FIG. 1 , the model training system includes 10 working nodes as an example for illustration, which are working nodes 201 to 210 respectively. In actual application, it can also include any number (more than one) At the same time, the number of the control node 100 may not be limited to one. The control node 100 and each working node may interact through an intermediate device (such as a switch, etc., not shown in FIG. 1 ).
示例地,在企业应用场景中,为了达成模型训练的性能和精度要求,图1所示的模型训练系统可以部署于包括多个服务器的集群中,而模型训练系统中的控制节点100、以及工作节点201至工作节点210可以分别由集群中的服务器部署,并且根据服务器执行的功能将集群中的服务器划分为控制节点与工作节点。例如,服务器1至服务器n组成一个集群,在该集群中可以部署数据库(例如,Hadoop)、或计算引擎(例如,Spark)等大数据平台,而用于训练集成学习模型的模型训练系统可以基于上述大数据平台进行部署,用于实现集成学习模型的训练与推理。具体的,模型训练系统可以在大数据平台上训练得到一个集成学习模型,该集成学习模型可以包括多个子学习模型;然后,再利用集成学习模型在对已知输入数据进行推理,具体可以是将该已知输入数据分别输入至多个子学习模型中,每个子学习模型分别对该已知输入数据进行推理并输出相应的推理结果,最后再通过投票的方式从多个子学习模型的推理结果中确定集成学习模型的推理结果,例如,可以是将多个子学习模型的推理结果中投票最多的推理结果作为集成学习模型的推理结果等。For example, in an enterprise application scenario, in order to achieve the performance and accuracy requirements of model training, the model training system shown in FIG. 1 can be deployed in a cluster including multiple servers, and the control node 100 and the work Node 201 to worker node 210 may be deployed by servers in the cluster, respectively, and the servers in the cluster are divided into control nodes and worker nodes according to the functions performed by the servers. For example, server 1 to server n form a cluster, in which a big data platform such as a database (for example, Hadoop) or a computing engine (for example, Spark) can be deployed, and a model training system for training an ensemble learning model can be based on The above big data platform is deployed to implement the training and inference of the integrated learning model. Specifically, the model training system can train an integrated learning model on the big data platform, and the integrated learning model can include multiple sub-learning models; then, the integrated learning model can be used to infer the known input data, specifically, the The known input data is respectively input into multiple sub-learning models, each sub-learning model infers the known input data and outputs the corresponding inference result, and finally determines the integration from the inference results of the multiple sub-learning models by voting The inference result of the learning model, for example, may be the inference result with the most votes among the inference results of the multiple sub-learning models as the inference result of the integrated learning model.
具体实施时,集成学习模型可以是基于随机森林的集成学习模型,如Spark ML、Spark MLlib等;集成学习模型也可以是其它类型的集成学习模型,本申请对此并不进行限定。并且,图2所示的部署模型训练系统的应用场景仅作为一种示例性说明,实际应用时该模型训练系统也可以是部署于其它可能的场景中,比如,图1中的控制节点100以及各个工作节点也可以是部署于一个服务器上,此时,控制节点100以及各个工作节点,具体可以是该服务器上的不同计算单元,并且通过功能将其划分为控制节点以及工作节点等。本实施例中,对 于模型训练系统所适用的应用场景并不进行限定。During specific implementation, the ensemble learning model may be an ensemble learning model based on random forests, such as Spark ML, Spark MLlib, etc.; the ensemble learning model may also be other types of ensemble learning models, which are not limited in this application. In addition, the application scenario of deploying the model training system shown in FIG. 2 is only an exemplary illustration, and the model training system can also be deployed in other possible scenarios in practical application, for example, the control node 100 in FIG. 1 and Each worker node may also be deployed on a server. In this case, the control node 100 and each worker node may specifically be different computing units on the server, and are divided into control nodes and worker nodes by function. In this embodiment, the applicable application scenarios of the model training system are not limited.
为提高集成学习模型的训练效率,模型训练系统在训练集成学习模型的过程中,每个子学习模型可以利用一个工作节点进行训练,这样,每个子学习模型的训练结果可以全部位于一个工作节点,从而工作节点在完成对于子学习模型的训练后,可以不用从其它工作节点获取针对于该子学习模型的训练结果。如此,可以有效减少训练子学习模型过程中各个工作节点之间所需通信的数据量,从而不仅能够减少训练集成学习模型中所需的资源消耗(主要可以是通信资源的消耗),而且,也能够有效提高对于集成学习模型的训练效率以及成功率。此外,当集成学习训练模型需要经过多轮训练过程时,在每轮训练集成学习模型的过程中,也可以按照上述方法,每个子学习模型利用一个工作节点进行训练,有效减少训练子学习模型过程中各个工作节点之间所需通信的数据量。In order to improve the training efficiency of the ensemble learning model, in the process of training the ensemble learning model in the model training system, each sub-learning model can be trained by using a working node, so that the training results of each sub-learning model can all be located in a working node, thus After the worker node completes the training of the sub-learning model, it is not necessary to obtain the training result for the sub-learning model from other worker nodes. In this way, the amount of data required to communicate between each working node in the process of training the sub-learning model can be effectively reduced, thereby not only reducing the resource consumption (mainly the consumption of communication resources) required for training the ensemble learning model, but also It can effectively improve the training efficiency and success rate of the integrated learning model. In addition, when the ensemble learning training model needs to go through multiple rounds of training, in the process of training the ensemble learning model in each round, you can also use the above method to train each sub-learning model using a working node, which effectively reduces the process of training the sub-learning model. The amount of data that needs to be communicated between each worker node in the
其中,模型训练系统在训练集成学习模型时,可以针对于每个子学习模型生成一个训练任务,例如,模型训练系统接收到针对集成训练模型的训练请求时,可以从该训练请求中解析出集成学习模型所包括的子学习模型的数量,并生成同等数量的训练任务,模型训练系统可以将每个训练任务分配给一个工作节点进行执行。可选地,模型训练系统也可以将多个子学习模型分配给一个工作节点执行。The model training system can generate a training task for each sub-learning model when training the integrated learning model. For example, when the model training system receives a training request for the integrated training model, it can parse the integrated learning from the training request. The number of sub-learning models included in the model and generating the same number of training tasks, the model training system can assign each training task to a worker node for execution. Optionally, the model training system can also assign multiple sub-learning models to one worker node for execution.
在一种分配训练任务的示例中,模型训练系统可以是根据工作节点的负载,将训练任务分配给相应的工作节点。例如,模型训练系统(具体可以是该模型训练系统中的控制节点100)获取各个工作节点的负载,并对各个工作节点的负载进行排序;然后,模型训练系统可以训练任务优先分配给负载较小的工作节点,而负载较大的工作节点可以不分配训练任务,如此,可以在模型训练系统中实现负载的均衡化。当然,在其它可能的示例中,模型训练系统也可以直接按照工作节点的编号顺序,依次下发训练任务等,本实施例中,对于每个训练任务被下发至哪个工作节点中进行执行的具体实现方式并不进行限定。In an example of allocating training tasks, the model training system may allocate training tasks to corresponding worker nodes according to the load of the worker nodes. For example, the model training system (specifically, the control node 100 in the model training system) obtains the load of each working node, and sorts the load of each working node; then, the model training system can preferentially allocate training tasks to those with smaller loads The worker nodes with a large load can not allocate training tasks, so that the load balance can be achieved in the model training system. Of course, in other possible examples, the model training system can also directly issue the training tasks in sequence according to the numbering sequence of the worker nodes. The specific implementation is not limited.
示例性地,集成学习模型中的子学习模型可以是决策树模型。其中,决策树模型是指包括决策点、策略点(事件点)及结果构成的树形图,可以应用于序列决策中。实际应用时,可以基于最大收益期望值或最低期望成本等作为决策准则,并通过图解方式求解在不同条件下各类方案的决策结果,进而通过比较各类方案的决策结果,给出最终的决策结果。当然,该子学习模型也可以是其它具有树形结构的模型等,本实施例对此并不进行限定。Exemplarily, the sub-learning model in the ensemble learning model may be a decision tree model. Among them, the decision tree model refers to a tree diagram composed of decision points, strategy points (event points) and results, and can be applied to sequence decision-making. In practical application, it can be based on the maximum expected benefit value or the lowest expected cost as the decision criterion, and the decision results of various schemes under different conditions can be solved graphically, and then the final decision results can be given by comparing the decision results of various schemes. . Of course, the sub-learning model may also be other models having a tree structure, etc., which is not limited in this embodiment.
下面,结合图3进一步介绍本申请提供的集成学习模型的训练方法,图3为本申请实施例提供的一种集成学习模型的训练方法的流程示意图,为便于理解,下面在描述集成学习模型的训练过程中结合子学习模型具体为决策树模型为例,该方法可以应用于上述图1所示的模型训练系统,或者其它可适用的模型训练系统中。该方法具体可以包括:Below, the training method of the integrated learning model provided by the present application is further introduced in conjunction with FIG. 3 . FIG. 3 is a schematic flowchart of the training method of the integrated learning model provided by the embodiment of the present application. For ease of understanding, the following describes the training method of the integrated learning model. In the training process, the sub-learning model is specifically a decision tree model as an example, the method can be applied to the model training system shown in FIG. 1 above, or other applicable model training systems. Specifically, the method may include:
S301:控制节点100获取集成学习模型的训练请求。S301: The control node 100 obtains a training request of the ensemble learning model.
本实施例中,模型训练系统可以是在接收到针对集成学习模型的训练请求时,触发集成学习模型的训练过程。作为一种示例,模型训练系统可以与用户终端存在通信连接,而用户可以在该用户终端上执行训练集成学习模型的触发操作,并由用户终端基于该操作生成相应的集成学习模型的训练请求,再将其发送给模型训练系统,以使得模型训练系统中的控制节点100获取到该训练请求并触发执行后续的模型训练过程。In this embodiment, the model training system may trigger the training process of the ensemble learning model when receiving a training request for the ensemble learning model. As an example, the model training system may have a communication connection with a user terminal, and the user may perform a trigger operation for training the integrated learning model on the user terminal, and the user terminal generates a training request for the corresponding integrated learning model based on the operation, It is then sent to the model training system, so that the control node 100 in the model training system obtains the training request and triggers the execution of the subsequent model training process.
S302:控制节点100根据接收到的训练请求生成训练任务集合,其中,该训练任务集合包括多个训练任务,并且该多个训练任务中每个训练任务由一个工作节点执行,每个训练任务用于训练集成学习模型中的至少一个子学习模型,并且,不同训练任务用于训练不同的子学习模型。S302: The control node 100 generates a training task set according to the received training request, wherein the training task set includes multiple training tasks, and each training task in the multiple training tasks is executed by a worker node, and each training task uses for training at least one sub-learning model in the ensemble learning model, and different training tasks are used for training different sub-learning models.
模型训练系统中的控制节点100在接收到训练请求后,可以获取用于训练集成学习模型中的多个子学习模型的训练样本,其中,样本的数量通常为多个;并且,控制节点100可以进一步根据该训练样本生成多个训练样本集,不同训练样本集所包括的样本可以存在差异。After receiving the training request, the control node 100 in the model training system may obtain training samples for training multiple sub-learning models in the integrated learning model, wherein the number of samples is usually multiple; and the control node 100 may further Multiple training sample sets are generated according to the training samples, and the samples included in different training sample sets may be different.
在一种示例中,控制节点100在获取训练样本(例如可以是由用户提供给控制节点100等)后,可以通过有放回抽样的方式对训练样本进行抽样,得到的P个训练样本集。当然,控制节点100也可以是通过其它方式生成多个训练样本集,本实施例对此并不进行限定。In an example, after the control node 100 obtains the training samples (for example, it may be provided to the control node 100 by the user, etc.), the control node 100 may sample the training samples by means of sampling with replacement to obtain P training sample sets. Certainly, the control node 100 may also generate multiple training sample sets in other manners, which is not limited in this embodiment.
其中,控制节点100抽样得到的训练样本集的数量(P)可以由用户进行确定。比如,控制节点可以向用户呈现如图4所示的配置界面,以便用户在该配置界面上配置模型训练的并行参数为P,即同时执行P个模型训练过程,则控制节点100可以基于用户的配置生成P个训练样本集,每个训练样本集可以支持一个子学习模型的训练,如图4所示。实际应用时,执行训练过程的节点具体可以是工作节点,因此,控制节点100可以将生成的多个训练样本集分别下发至不同的工作节点,比如,每个训练样本集可以被分发至一个工作节点等。The number (P) of training sample sets sampled by the control node 100 may be determined by the user. For example, the control node can present the configuration interface as shown in FIG. 4 to the user, so that the user can configure the parallel parameter of model training to be P on the configuration interface, that is, to execute P model training processes at the same time, the control node 100 can be based on the user's The configuration generates P training sample sets, each of which can support the training of a sub-learning model, as shown in Figure 4. In practical applications, the nodes that execute the training process may specifically be working nodes. Therefore, the control node 100 can distribute the generated multiple training sample sets to different working nodes respectively. For example, each training sample set can be distributed to a worker nodes, etc.
同时,控制节点100还可以基于接收到的训练请求生成训练任务集合,该训练任务中包括多个训练任务,每个训练任务可以用于指示工作节点基于接收到的训练样本集执行一个子学习模型的训练过程。作为一种示例,控制节点100所接收到的训练请求中,具体可以包括训练集成训练模型的指示以及该集成学习模型中包括的子学习模型的数量,从而使得控制节点100在接收到该训练请求后,可以基于该训练请求中的指示,启动集成学习模型的训练过程,并根据子学习模型的数量,生成同等数量的训练任务,每个训练任务用于指示工作节点完成一个子学习模型的训练过程,并且不同训练任务所指示训练的子学习模型不同。At the same time, the control node 100 can also generate a training task set based on the received training request, the training task includes a plurality of training tasks, and each training task can be used to instruct the working node to execute a sub-learning model based on the received training sample set the training process. As an example, the training request received by the control node 100 may specifically include an instruction for training the integrated training model and the number of sub-learning models included in the integrated learning model, so that the control node 100 receives the training request After that, the training process of the integrated learning model can be started based on the instruction in the training request, and an equal number of training tasks can be generated according to the number of sub-learning models, and each training task is used to instruct the worker node to complete the training of a sub-learning model. process, and the sub-learning models trained by different training tasks are different.
S303:控制节点100向工作节点集合中多个工作节点分别发送该训练任务集合中训练任务。S303: The control node 100 respectively sends the training tasks in the training task set to the plurality of working nodes in the working node set.
控制节点100在生成训练任务集合后,可以将该训练任务集合中的训练任务下发给各个工作节点,并且每个训练任务可以被下发给一个工作节点进行执行,训练任务对应的子学习模型的训练结果可以全部由一个工作节点执行。实际应用中,控制节点100可以是基于各个工作节点的负载选择执行训练任务的工作节点。例如,对于模型训练系统中的工作节点集合,控制节点100可以根据该工作节点集合中各个工作节点的负载排序,确定工作节点集合中的第一部分工作节点以及第二部分工作节点,其中,该第一部分工作节点中每个工作节点的负载,均小于或等于第二部分工作节点中每个工作节点的负载;然后,控制节点100可以将训练任务逐个下发给该第一部分工作节点中的各个工作节点,而针对于负载较高的第二部分工作节点,控制节点100可以不为其分配训练任务,如此,可以尽可能避免模型训练系统中出现部分工作节点的负载过高而影响模型训练效率以及成功率,实现工作节点的负载均衡化。当然,在其它可能的示例中,模型训练系统也可以是随机选择执行训练任务的工作节点、或者按照工作节点的编号等顺序依次下发训练任务等,本实施例中,对于每个训练任务被下发至哪个工作节点中进行执行的具体实现方式并不进行限定。After the control node 100 generates the training task set, it can issue the training tasks in the training task set to each worker node, and each training task can be issued to a worker node for execution, and the sub-learning model corresponding to the training task The training results of can all be executed by one worker node. In practical applications, the control node 100 may be a worker node that selects a training task based on the load of each worker node. For example, for the set of working nodes in the model training system, the control node 100 may determine the first part of the working nodes and the second part of the working nodes in the set of working nodes according to the load order of each working node in the set of working nodes, wherein the first part of the working nodes The load of each work node in a part of the work nodes is less than or equal to the load of each work node in the second part of the work nodes; then, the control node 100 can issue the training tasks to each work node in the first part of the work nodes one by one For the second part of the worker nodes with high load, the control node 100 may not assign training tasks to them. In this way, it can be avoided as much as possible that the load of some worker nodes in the model training system is too high and the model training efficiency is affected. Success rate, to achieve load balancing of worker nodes. Of course, in other possible examples, the model training system may also randomly select worker nodes to perform training tasks, or sequentially issue training tasks in the order of the number of the worker nodes, etc. In this embodiment, for each training task The specific implementation of the delivery to which worker node for execution is not limited.
实际应用的一些场景中,每个工作节点中可以包括多个执行器,该执行器例可以是工作节点中的逻辑执行单元,执行器用于执行工作节点所需执行的训练任务。基于此,在一种可能的实施方式中,每个训练任务中可以包括多个子任务(task),每个子任务可以用于指示工作节点执行一个子学习模型的部分模型训练过程,从而控制节点100在将训练任务下发至工作节点后,该工作节点上的多个执行器可以用于执行该训练任务中的不同子任务。如此,通过多个执行器并行执行多个不同子任务,可以提高工作节点训练单个子学习模型的效率。In some practical application scenarios, each worker node may include multiple executors, and the executor example may be a logic execution unit in the worker node, and the executor is used to execute the training task required by the worker node. Based on this, in a possible implementation manner, each training task may include multiple subtasks (tasks), and each subtask may be used to instruct a worker node to execute a partial model training process of a sublearning model, thereby controlling the node 100 After the training task is delivered to the worker node, multiple executors on the worker node can be used to execute different subtasks in the training task. In this way, by executing multiple different sub-tasks in parallel by multiple executors, the efficiency of training a single sub-learning model by worker nodes can be improved.
作为一种示例,当子学习模型具体为决策树模型时,如图5所示,工作节点上的每个执 行器,可以分别利用训练样本集1中的部分样本,对决策树模型1的树节点进行训练,每个执行器在训练该树节点时所采用的分块样本并不相同。这样,在一个工作节点上,可以得到基于每个分块样本对决策树模型1的树节点的训练结果,也即得到利用训练样本集1对决策树模型1的完整训练结果。As an example, when the sub-learning model is specifically a decision tree model, as shown in Figure 5, each executor on the work node can use some samples in the training sample set 1, respectively, The node is trained, and each executor uses different block samples when training the tree node. In this way, on a working node, the training result of the tree node of the decision tree model 1 based on each block sample can be obtained, that is, the complete training result of the decision tree model 1 using the training sample set 1 can be obtained.
其中,训练任务的数量,可以与训练样本集的数量相同,从而多个工作节点在训练子学习模型过程中,每个工作节点可以利用接收到的训练样本集、并根据单个训练任务所包括的任务执行相应的模型训练过程;而每个训练任务中的子任务数量,可以是根据对训练样本集中的训练样本进行分块的数量进行确定,比如,当训练样本集中的训练样本被划分为数量相等的5部分时,控制节点100可以根据该5部分训练样本生成5个子任务,每个子任务对应于其中一个分块样本,不同子任务所对应的分块样本不同,并且生成的5个子任务可以构成一个训练任务,如此,控制节点100基于不同训练样本集可以生成不同的训练任务。其中,针对于每个训练样本集的分块数量,可以是由用户在图4所示的配置界面上进行配置,当然,也可以是由技术人员预先设定的默认值等,本实施例对此并不进行限定。Among them, the number of training tasks can be the same as the number of training sample sets, so that in the process of training the sub-learning model for multiple working nodes, each working node can use the received training sample set, and according to the training sample set included in a single training task The task performs the corresponding model training process; and the number of subtasks in each training task can be determined according to the number of training samples in the training sample set. For example, when the training samples in the training sample set are divided into the number of When there are equal 5 parts, the control node 100 can generate 5 subtasks according to the 5 parts training samples, each subtask corresponds to one of the block samples, the block samples corresponding to different subtasks are different, and the generated 5 subtasks can be A training task is constituted, so that the control node 100 can generate different training tasks based on different training sample sets. The number of blocks for each training sample set can be configured by the user on the configuration interface shown in This is not limited.
示例性的,训练任务的数量与工作节点的数量可以相同,也可以是不同。例如,当训练任务的数量与工作节点相同时,每个工作节点可以执行一个训练任务中的所有任务。而当训练任务的数量与工作节点不同时,一个工作节点可以执行多个训练任务中的所有子任务,即一个工作节点可以完成多个子学习模型的训练。Exemplarily, the number of training tasks and the number of worker nodes may be the same or different. For example, when the number of training tasks is the same as worker nodes, each worker node can execute all the tasks in one training task. When the number of training tasks is different from that of worker nodes, one worker node can execute all subtasks in multiple training tasks, that is, one worker node can complete the training of multiple sub-learning models.
值得注意的是,模型训练系统在训练集成学习模型时,通常会执行多轮的迭代训练过程。并且每一轮的模型训练过程中,控制节点100都可以重新生成多个训练任务并将其下发至工作节点进行执行。在多轮迭代训练过程中,用于训练每个子学习模型的训练样本集可以不变,而每轮训练一个子学习模型的训练任务中所包括的子任务的内容可以与上一轮训练该子学习模型的训练任务中所包括的子任务存在差异。例如,当子学习模型具体为决策树模型时,上一轮训练中的子任务用于对决策树模型中的树节点1进行训练,而当轮训练中的子任务用于对该决策树模型中的树节点2和树节点3进行训练等。其中,当轮训练是指,在对集成学习模型进行训练的过程中,模型训练系统正在对集成学习模型执行训练的轮次,比如,当模型训练系统正在对集成学习模型进行第二轮模型训练时,该第二轮模型训练,即为当轮训练。It is worth noting that the model training system usually performs multiple rounds of iterative training process when training the ensemble learning model. And during each round of model training, the control node 100 can regenerate multiple training tasks and send them to the worker nodes for execution. During multiple rounds of iterative training, the training sample set used to train each sub-learning model can remain unchanged, and the content of the sub-tasks included in the training task of training a sub-learning model in each round can be the same as the previous round of training the sub-task. There are differences in the subtasks included in the training task of the learned model. For example, when the sub-learning model is a decision tree model, the sub-task in the previous round of training is used to train tree node 1 in the decision tree model, and the sub-task in the current round of training is used for the decision tree model. tree node 2 and tree node 3 in the training, etc. Among them, the current round of training refers to the rounds in which the model training system is performing training on the ensemble learning model in the process of training the ensemble learning model. For example, when the model training system is performing the second round of model training on the ensemble learning model , the second round of model training is the current round of training.
在每轮的模型训练过程中,若所要生成的训练任务的数量固定(例如可以是根据工作节点的数量进行确定),则控制节点100可以首先为每个训练任务分配一个子学习模型,即每个训练任务中的子任务用于训练一个子学习模型。当子学习模型的数量大于训练任务的数量时,控制节点100在为每个训练任务分配一个子学习模型后,继续从剩余的子学习模型中为每个训练任务再分配一个子学习模型,此时,一个训练任务中的子任务可以用于训练多个子学习模型。In the model training process of each round, if the number of training tasks to be generated is fixed (for example, it may be determined according to the number of working nodes), the control node 100 may first allocate a sub-learning model for each training task, that is, each training task Subtasks in a training task are used to train a sublearning model. When the number of sub-learning models is greater than the number of training tasks, after assigning a sub-learning model to each training task, the control node 100 continues to assign another sub-learning model to each training task from the remaining sub-learning models. , subtasks in one training task can be used to train multiple sub-learning models.
在一种示例中,当子学习模型具体为决策树模型时,训练任务中的子任务,具体可以是用于确定决策树模型中树节点(tree node)的最佳分割点。其中,确定最佳分割点,是指在在决策树模型中确定适合进行样本分割的树节点,并且该树节点分割后所得到的两个子节点中包含的训练样本在某个属性上分别属于不同范围。In an example, when the sub-learning model is specifically a decision tree model, the sub-task in the training task may specifically be used to determine the best split point of a tree node in the decision tree model. Among them, determining the best split point refers to determining a tree node suitable for sample splitting in the decision tree model, and the training samples contained in the two sub-nodes obtained after the tree node split belong to different attributes respectively. scope.
在首轮训练决策树模型时,控制节点100可以创建待训练的树节点列表(tree node list),并且将该列表中的树节点初始化为各个决策树模型的根节点,即当决策树模型的数量为x时,该树节点列表中的包括x个根节点。During the first round of training the decision tree model, the control node 100 may create a tree node list to be trained, and initialize the tree nodes in the list as the root nodes of each decision tree model, that is, when the When the number is x, the tree node list includes x root nodes.
由于决策树模型会经过多轮迭代训练,并且每轮训练决策树模型的任务,都是对决策树模型中的树节点进行分裂,因此,控制节点100在每轮的模型训练过程中,可以从待训练的 树节点列表中选择树节点加入至当轮训练树节点列表(cur-tree node list)。其中,控制节点100具体可以是按照树节点的索引编号顺序将待训练的树节点列表中的树节点逐个添加至当轮训练树节点列表中。可选地,在经过多轮训练后,决策树模型中可能会分裂出较多的节点,为避免添加至当轮训练树节点列表中的节点数量过多而导致工作节点分裂节点的负载过大。在一种可能的实施方式中,控制节点100可以限制向当轮训练节点列表中添加的节点数量不超过节点数量阈值,也即限制当轮训练节点列表中的长度,而待训练的树节点列表中未参加当轮训练的节点,可以参与下一轮的模型训练,从而在待训练的树节点列表中节点数量过多时,控制节点100可以控制对其进行分批次训练。Since the decision tree model will undergo multiple rounds of iterative training, and the task of training the decision tree model in each round is to split the tree nodes in the decision tree model, therefore, the control node 100 can be trained from Select a tree node from the list of tree nodes to be trained and add it to the current round of training tree node list (cur-tree node list). The control node 100 may specifically add the tree nodes in the tree node list to be trained to the current round of training tree node list one by one according to the order of the index numbers of the tree nodes. Optionally, after multiple rounds of training, more nodes may be split in the decision tree model, in order to avoid too many nodes added to the current round of training tree node list and the workload of the split node of the worker node is too large. . In a possible implementation, the control node 100 may limit the number of nodes added to the current round of training node list not to exceed a node number threshold, that is, limit the length of the current round of training node list, while the tree node list to be trained is limited. The nodes that have not participated in the current round of training can participate in the next round of model training, so that when there are too many nodes in the tree node list to be trained, the control node 100 can control the training in batches.
然后,控制节点100可以生成子任务与待训练的树节点的映射关系,并将其广播至各个工作节点。并且,控制节点100可以进一步将生成的训练任务中的子任务下发至工作节点,如通过轮询(round-robin)机制下发子任务至每个工作节点等。Then, the control node 100 can generate the mapping relationship between the subtask and the tree node to be trained, and broadcast it to each worker node. In addition, the control node 100 may further deliver the subtasks in the generated training task to the working nodes, for example, through a round-robin mechanism to deliver the subtasks to each working node and the like.
工作节点可以根据接收到的训练任务,利用该训练任务所对应的训练样本集执行相应的模型训练任务。具体实现时,每个工作节点上的执行器可以根据控制节点100广播的子任务与待训练的树节点的映射关系,以及控制节点100下发的子任务,确定所要进行训练的树节点,并利用该子任务所对应的训练样本对该树节点进行训练,具体可以是先确定用于分裂该树节点的样本属性,如年龄等,再确定该子任务对应的分块样本中针对该样本属性的样本分布,即确定该分块样本中哪些训练样本归为一类,如年龄属性的属性值大于23训练样本的归为一类,哪些训练样本归为另一类,如年龄属性的属性值小于23的训练样本归为另一类。其中,该样本属性可以通过预设的随机算法进行选择,或者通过其它方式进行选择等。如此,工作节点上的多个执行器分别完成模型训练后,工作节点可以获取整个训练样本集中的所有训练样本针对于该树节点的训练结果,该训练结果例如可以是训练样本针对于该树节点的分布直方图等。The worker node can perform the corresponding model training task by using the training sample set corresponding to the training task according to the received training task. During specific implementation, the executor on each work node can determine the tree node to be trained according to the mapping relationship between the subtask broadcasted by the control node 100 and the tree node to be trained, and the subtask issued by the control node 100, and Use the training sample corresponding to the subtask to train the tree node, specifically, first determine the sample attribute for splitting the tree node, such as age, etc., and then determine the sample attribute for the sample in the block sample corresponding to the subtask. That is, to determine which training samples in the block sample are classified into one class, such as the attribute value of the age attribute greater than 23, the training samples are classified into one class, and which training samples are classified into another class, such as the attribute value of the age attribute Training samples smaller than 23 are classified into another class. The sample attribute may be selected by a preset random algorithm, or selected by other methods. In this way, after the multiple executors on the working node have completed the model training respectively, the working node can obtain the training results of all training samples in the entire training sample set for the tree node, and the training result can be, for example, the training samples for the tree node distribution histogram, etc.
其中,当训练任务中的所有子任务均是由一个工作节点上的执行器完成时,该工作节点可以直接获得当轮训练针对于树节点的训练结果,而无需从其它工作节点获得,使得不同工作节点之间可以无需交互各自对于决策树模型的训练结果。如此,可以有效减少集成学习模型的训练过程中工作节点之间通信的数据量。Among them, when all the subtasks in the training task are completed by the executor on one work node, the work node can directly obtain the training results of the current round of training for the tree nodes, without needing to obtain from other work nodes, so that different There is no need to interact between the working nodes of their training results for the decision tree model. In this way, the amount of data communicated between working nodes during the training process of the ensemble learning model can be effectively reduced.
工作节点在获得当轮训练过程中针对于树节点的完整训练结果后,可以计算出对该树节点的最佳分割点,该最佳分割点用于对该树节点所包含的训练样本进行分裂(对于根节点而言,根节点对应的训练样本即为整个训练样本集中的所有训练样本)。具体实现时,该完整训练结果可以指示各个训练样本针对于预先确定的样本属性的样本值(也即属性值),而工作节点可以根据各个训练样本针对于该样本属性的样本值,确定所有样本值中能够达到最大信息增益的样本值,从而以确定出的样本值为分界,将训练样本划分为两部分,而达到最大信息增益的样本值即为最佳分割点。After the worker node obtains the complete training results for the tree node in the current round of training, it can calculate the best split point for the tree node, and the best split point is used to split the training samples contained in the tree node. (For the root node, the training samples corresponding to the root node are all training samples in the entire training sample set). In specific implementation, the complete training result may indicate the sample value (ie, the attribute value) of each training sample for a predetermined sample attribute, and the working node may determine all samples according to the sample value of each training sample for the sample attribute The sample value that can achieve the maximum information gain in the value, so as to determine the sample value as the boundary, divide the training sample into two parts, and the sample value that achieves the maximum information gain is the best segmentation point.
作为一种确定最佳分割点的示例,工作节点根据获得的完整训练结果,可以确定出训练样本集中所有训练样本具有的上述属性特征对应的样本值,并从该样本值中通过遍历计算的方式确定出作为最佳分割点的样本值。具体的,假设最佳分割点为变量s,该s的取值为上述所有样本值中的任意一个值。根据该变量s的取值,将大小为N的训练样本集D划分为两个集合,分别为左训练样本集D left以及右训练样本集D right,例如将训练样本集中样本值小于该变量s的训练样本划入左训练样本集,而将训练样本集中样本值大于或等于该变量s的训练样本划入右训练样本集等。然后,工作节点可以计算变量s的取值所具有的信息增益IG(D,s),其中,该信息增益IG例如可以通过如下公式(1)、(2)计算得到: As an example of determining the best split point, the worker node can determine the sample value corresponding to the above-mentioned attribute feature of all the training samples in the training sample set according to the obtained complete training result, and traverse the calculation method from the sample value. Determine the sample value as the best split point. Specifically, it is assumed that the optimal split point is a variable s, and the value of s is any one of the above-mentioned sample values. According to the value of the variable s, the training sample set D of size N is divided into two sets, namely the left training sample set D left and the right training sample set D right , for example, the sample value in the training sample set is smaller than the variable s The training samples of s are classified into the left training sample set, and the training samples whose sample values in the training sample set are greater than or equal to the variable s are classified into the right training sample set, etc. Then, the working node can calculate the information gain IG(D, s) possessed by the value of the variable s, wherein, the information gain IG can be calculated by the following formulas (1) and (2), for example:
Figure PCTCN2021142240-appb-000001
Figure PCTCN2021142240-appb-000001
Figure PCTCN2021142240-appb-000002
Figure PCTCN2021142240-appb-000002
其中,Impurity是指训练样本集中包含杂质的指标,也可以称之为“不纯度指标”,K为训练样本集中样本类别数量,p i为训练样本集中第i个样本类别的概率。值得注意的是,公式(2)中的Impurity是以基尼为例,实际应用时,该Impurity还可以熵、方差等,本实施例对此并不进行限定。 Among them, Impurity refers to the index of impurities in the training sample set, which can also be called "impurity index", K is the number of sample categories in the training sample set, and p i is the probability of the ith sample category in the training sample set. It is worth noting that the Impurity in the formula (2) is based on Gini as an example. In practical application, the Impurity may also include entropy, variance, etc., which is not limited in this embodiment.
通过上述公式(1)以及公式(2),工作节点可以遍历计算出变量s的每个取值分别对应的信息增益,从而可以将最大信息增益对应的s取值作为最佳分割点。Through the above formula (1) and formula (2), the working node can traverse and calculate the information gain corresponding to each value of the variable s, so that the value of s corresponding to the maximum information gain can be used as the best segmentation point.
工作节点将计算出的最佳分割点作为当轮训练的最终训练结果反馈给控制节点100,从而控制节点100可以从多个工作节点上获取多个决策树模型的当轮训练结果,也即首轮训练过程中针对于每个决策树模型的根节点的最佳分割点。The working node feeds back the calculated optimal split point as the final training result of the current round of training to the control node 100, so that the control node 100 can obtain the current training results of multiple decision tree models from multiple working nodes, that is, the first round of training results. The best split point for the root node of each decision tree model during round training.
针对于每个决策树模型,控制节点100可以根据该决策树模型对应的最佳分割点,计算出分裂节点,如将根节点分裂为左节点以及右节点,其中,左节点对应的训练样本可以是上述左训练样本集D left中的训练样本,而右节点对应的训练样本可以是上述右训练样本集D right中的训练样本。如此,基于上述过程,可以完成针对于每个决策树模型的一轮训练过程。 For each decision tree model, the control node 100 may calculate a split node according to the optimal split point corresponding to the decision tree model, such as splitting the root node into a left node and a right node, wherein the training sample corresponding to the left node may be is the training sample in the above left training sample set D left , and the training sample corresponding to the right node may be the training sample in the above right training sample set D right . In this way, based on the above process, one round of training process for each decision tree model can be completed.
然后,控制节点100可以判断经过当轮训练后所得到的决策树模型是否满足训练终止条件,若满足,则集成学习模型中的多个决策树模型即为当轮训练得到的决策树模型,而若不满足,则控制节点100可以继续执行对决策树模型的下一轮训练。Then, the control node 100 can judge whether the decision tree model obtained after the current round of training satisfies the training termination condition, if so, the multiple decision tree models in the integrated learning model are the decision tree models obtained by the current round of training, and If not, the control node 100 may continue to perform the next round of training on the decision tree model.
在一些示例中,训练终止条件包括以下方式中至少一种:In some examples, the training termination condition includes at least one of the following:
方式1,决策树模型的叶子节点对应的训练样本数量均小于数量阈值。In Mode 1, the number of training samples corresponding to the leaf nodes of the decision tree model are all smaller than the number threshold.
方式2,用于训练子学习模型的训练样本集的不纯度指标小于不纯度阈值。Mode 2, the impurity index of the training sample set used for training the sub-learning model is smaller than the impurity threshold.
方式3,子学习模型中节点的深度大于等于深度阈值。In mode 3, the depth of the nodes in the sub-learning model is greater than or equal to the depth threshold.
当集成学习模型中的每个决策树模型,均满足上述训练终止条件中的任意一种或多种时,控制节点100可以停止模型训练过程,即完成对于集成学习模型的训练。当然,上述训练终止条件仅作为一种示例,实际应用时,该训练终止条件也可以是采用其它方式进行实现。When each decision tree model in the ensemble learning model satisfies any one or more of the above training termination conditions, the control node 100 can stop the model training process, that is, complete the training of the ensemble learning model. Of course, the above training termination condition is only an example, and in practical application, the training termination condition may also be implemented in other ways.
当确定当轮训练得到的各个决策树模型不满足训练终止条件时,控制节点100可以继续执行下一轮针对于各个决策树模型的训练过程。When it is determined that each decision tree model obtained by the round of training does not meet the training termination condition, the control node 100 may continue to perform the next round of training process for each decision tree model.
具体的,控制节点100可以清空待训练的树节点列表以及当轮训练树节点列表中已经完成分裂的树节点,并将上一轮对于各个决策树模型的分裂节点添加至该树节点列表中。然后,控制节点100再将当前待训练的树节点列表中的树节点添加至当轮训练树节点列表,并基于上述类似过程,利用工作节点对当轮训练树节点列表中的树节点再次进行分裂,如此,通过多次迭代训练集成学习模型的各个决策树模型,直至各个决策树模型均满足训练终止条件时完成集成学习模型的训练。Specifically, the control node 100 may clear the tree node list to be trained and the tree nodes that have been split in the current round of training tree node list, and add the split nodes for each decision tree model in the previous round to the tree node list. Then, the control node 100 adds the tree nodes in the current to-be-trained tree node list to the current round of training tree nodes list, and based on the above-mentioned similar process, uses the working node to split the tree nodes in the current round of training tree node list again. , in this way, each decision tree model of the ensemble learning model is trained through multiple iterations until the training of the ensemble learning model is completed when each decision tree model satisfies the training termination condition.
本实施例中,由于在训练集成学习模型时,该集成学习模型中的每个子学习模型均利用一个工作节点进行训练,这样,每个子学习模型的训练结果可以全部位于一个工作节点,从而工作节点在完成对于子学习模型的训练后,可以不用从其它工作节点获取针对于该子学习模型的训练结果。如此,可以有效减少训练子学习模型过程中各个工作节点之间所需通信的数据量,从而不仅能够减少训练集成学习模型中所需的资源消耗,而且,也能够有效提高对于集成学习模型的训练效率以及成功率。In this embodiment, when training the ensemble learning model, each sub-learning model in the ensemble learning model is trained by using a working node, so that the training results of each sub-learning model can all be located in one working node, so that the working node After the training of the sub-learning model is completed, the training results for the sub-learning model may not be obtained from other working nodes. In this way, the amount of data required to communicate between each working node in the process of training the sub-learning model can be effectively reduced, thereby not only reducing the resource consumption required for training the ensemble learning model, but also effectively improving the training of the ensemble learning model. efficiency and success rate.
值得注意的是,上述实施例中,是以控制节点与工作节点部署于包括多个服务器的集群为例进行示例性说明,在其它可能的实施例中,上述针对于集成学习模型的训练过程,也可 以是由云数据中心提供的云服务实现。It is worth noting that, in the above-mentioned embodiment, the control node and the working node are deployed in a cluster including multiple servers as an example for illustration. In other possible embodiments, the above-mentioned training process for the integrated learning model, It can also be implemented by a cloud service provided by a cloud data center.
具体的,用户可以通过相应的终端设备向云数据中心发送针对集成学习模型的训练请求,以请求云数据中心训练出该集成学习模型并反馈给用户。云数据中心在接收到该训练请求后,可以调用相应的计算资源完成集成学习模型的训练,具体可以是调用部分计算资源(如支持该云服务的一台服务器等)实现上述控制节点100所实现的功能,以及调用另一部分计算资源(如支持该云服务的多台服务器等)实现上述多个工作节点的功能。其中,云数据中心基于调用的计算资源完成集成学习模型的训练过程,可以参见上述实施例中的相关之处描述,在此不做赘述。云数据中心在完成对于集成学习模型的训练后,可以将训练得到的集成学习模型发送给用户侧的终端设备,以使得用户能够获得其所需的集成学习模型。Specifically, the user can send a training request for the integrated learning model to the cloud data center through a corresponding terminal device, so as to request the cloud data center to train the integrated learning model and feed it back to the user. After receiving the training request, the cloud data center can call the corresponding computing resources to complete the training of the integrated learning model. Specifically, the cloud data center can call some computing resources (such as a server supporting the cloud service, etc.) to realize the above control node 100. function, and invoking another part of computing resources (such as multiple servers supporting the cloud service, etc.) to realize the functions of the above-mentioned multiple working nodes. Wherein, the cloud data center completes the training process of the integrated learning model based on the invoked computing resources, and reference may be made to the relevant descriptions in the above embodiments, which will not be repeated here. After the cloud data center completes the training of the ensemble learning model, it can send the ensemble learning model obtained by training to the terminal device on the user side, so that the user can obtain the required ensemble learning model.
值得注意的是,上述实施例仅作为对于本申请技术方案的示例性说明,本领域的技术人员根据以上描述的内容,能够想到的其他合理的步骤组合,也属于本申请的保护范围内。其次,本领域技术人员也应该熟悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请所必须的。It is worth noting that the above-mentioned embodiments are only illustrative of the technical solutions of the present application, and other reasonable step combinations that those skilled in the art can think of based on the above descriptions also fall within the protection scope of the present application. Secondly, those skilled in the art should also be familiar with that, the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the present application.
上文中结合图1至图5,详细描述了本申请所提供的集成学习模型的训练方法,下面将结合图6至图7,描述根据本申请所提供的集成学习模型的训练装置和用于训练集成学习模型的设备。The training method of the integrated learning model provided by the present application is described in detail above with reference to Fig. 1 to Fig. 5 , and the training device of the integrated learning model provided by the present application and the training device for training the integrated learning model provided by the present application will be described below in conjunction with Fig. 6 to Fig. 7 Devices that integrate learning models.
图6为本申请提供的一种集成学习模型的训练装置的结构示意图,该集成学习模型的训练装置600可以应用于模型训练系统中的控制节点,并且,该模型训练系统还包括工作节点。其中,该装置600包括:FIG. 6 is a schematic structural diagram of an integrated learning model training device provided by the present application. The integrated learning model training device 600 can be applied to a control node in a model training system, and the model training system further includes a working node. Wherein, the device 600 includes:
获取模块601,用于获取集成学习模型的训练请求;Obtaining module 601, for obtaining the training request of the integrated learning model;
生成模块602,用于根据所述训练请求生成训练任务集合,所述训练任务集合包括多个训练任务,所述多个训练任务中每个训练任务由一个工作节点执行,每个所述训练任务用于训练所述集成学习模型中的至少一个子学习模型,不同训练任务用于训练不同的子学习模型;A generating module 602, configured to generate a training task set according to the training request, the training task set includes multiple training tasks, each training task in the multiple training tasks is executed by a worker node, and each training task for training at least one sub-learning model in the integrated learning model, and different training tasks are used for training different sub-learning models;
通信模块603,用于向工作节点集合中多个工作节点分别发送所述训练任务集合中训练任务。The communication module 603 is configured to send the training tasks in the training task set to a plurality of working nodes in the working node set respectively.
应理解的是,本申请实施例的装置600可以通过中央处理器(central processing unit,CPU),也可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图3所示的集成学习模型的训练方法时,装置600及其各个模块也可以为软件模块。It should be understood that the apparatus 600 in this embodiment of the present application may be implemented by a central processing unit (central processing unit, CPU), an application-specific integrated circuit (application-specific integrated circuit, ASIC), or a programmable logic device (programmable logic device). device, PLD) implementation, the above-mentioned PLD can be a complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL) or its random combination. When the training method of the integrated learning model shown in FIG. 3 can also be implemented by software, the apparatus 600 and its respective modules can also be software modules.
可选地,所述训练请求包括训练所述集成学习模型的指示以及所述集成学习模型中子学习模型的数量。Optionally, the training request includes an instruction to train the ensemble learning model and the number of sub-learning models in the ensemble learning model.
可选地,所述生成模块602,具体用于根据所述训练请求包括的子学习模型的数量生成所述训练任务集合,所述训练任务集合包括的训练任务的数量与所述子学习模型的数量相等。Optionally, the generating module 602 is specifically configured to generate the training task set according to the number of sub-learning models included in the training request, where the number of training tasks included in the training task set is the same as the number of the sub-learning models. equal in quantity.
可选地,所述通信模块603,具体包括:Optionally, the communication module 603 specifically includes:
负载获取单元,用于获取所述工作节点集合中各个工作节点的负载;a load obtaining unit, configured to obtain the load of each working node in the set of working nodes;
发送单元,用于根据所述各个工作节点的负载,向第一部分工作节点中的每个工作节点发送一个训练任务,所述工作节点集合包括所述第一部分工作节点以及第二部分工作节点,所述第一部分工作节点中每个工作节点的负载小于所述第二部分工作节点中每个工作节点的负载。The sending unit is configured to send a training task to each working node in the first part of working nodes according to the load of each working node, the set of working nodes includes the first part of working nodes and the second part of working nodes, so The load of each working node in the first part of the working nodes is smaller than the load of each working node in the second part of the working nodes.
可选地,所述子学习模型包括决策树模型。Optionally, the sub-learning model includes a decision tree model.
可选地,所述子学习模型的训练终止条件包括如下条件中的至少一种:Optionally, the training termination condition of the sub-learning model includes at least one of the following conditions:
子学习模型的叶子节点对应的训练样本数量均小于数量阈值;或者,The number of training samples corresponding to the leaf nodes of the sub-learning model is less than the number threshold; or,
用于训练子学习模型的训练样本集的不纯度指标小于不纯度阈值;或者,The impurity index of the training sample set used to train the sub-learning model is less than the impurity threshold; or,
子学习模型中节点的深度大于等于深度阈值。The depth of the nodes in the sub-learning model is greater than or equal to the depth threshold.
由于装置600在训练集成学习模型时,该集成学习模型中的每个子学习模型均利用一个工作节点进行训练,这样,每个子学习模型的训练结果可以全部位于一个工作节点,从而工作节点在完成对于子学习模型的训练后,可以不用从其它工作节点获取针对于该子学习模型的训练结果。如此,可以有效减少训练子学习模型过程中各个工作节点之间所需通信的数据量,从而不仅能够减少训练集成学习模型中所需的资源消耗,而且,也能够有效提高对于集成学习模型的训练效率以及成功率。When the apparatus 600 trains the ensemble learning model, each sub-learning model in the ensemble learning model is trained by using a working node, so that the training results of each sub-learning model can all be located in one working node, so that the working node is completing the After the sub-learning model is trained, the training results for the sub-learning model may not be obtained from other working nodes. In this way, the amount of data required to communicate between each working node in the process of training the sub-learning model can be effectively reduced, thereby not only reducing the resource consumption required for training the ensemble learning model, but also effectively improving the training of the ensemble learning model. efficiency and success rate.
根据本申请实施例的集成学习模型的训练装置600可对应于执行本申请实施例中描述的方法,并且集成学习模型的训练装置600的各个模块的上述和其它操作和/或功能分别为了实现图3中的各个方法的相应流程,为了简洁,在此不再赘述。The training device 600 for the integrated learning model according to the embodiment of the present application may correspond to executing the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of the respective modules of the training device for the integrated learning model 600 are for the purpose of realizing Fig. For the sake of brevity, the corresponding processes of each method in 3 will not be repeated here.
图7为本申请提供的一种设备700的示意图。如图7所示,所述设备700包括处理器701、存储器702、通信接口703。其中,处理器701、存储器702、通信接口703通过总线704进行通信,也可以通过无线传输等其他手段实现通信。该存储器702用于存储指令,该处理器701用于执行该存储器702存储的指令。进一步的,设备700还可以包括内存单元705,还内存单元705可以通过总线704与处理器701、存储介质702以及通信接口703连接。其中,该存储器702存储程序代码,且处理器701可以调用存储器702中存储的程序代码执行以下操作:FIG. 7 is a schematic diagram of a device 700 provided by this application. As shown in FIG. 7 , the device 700 includes a processor 701 , a memory 702 , and a communication interface 703 . The processor 701 , the memory 702 , and the communication interface 703 communicate through the bus 704 , and can also communicate through other means such as wireless transmission. The memory 702 is used for storing instructions, and the processor 701 is used for executing the instructions stored in the memory 702 . Further, the device 700 may further include a memory unit 705 , and the memory unit 705 may be connected to the processor 701 , the storage medium 702 and the communication interface 703 through the bus 704 . Wherein, the memory 702 stores program codes, and the processor 701 can call the program codes stored in the memory 702 to perform the following operations:
获取集成学习模型的训练请求;Get the training request of the ensemble learning model;
根据所述训练请求生成训练任务集合,所述训练任务集合包括多个训练任务,所述多个训练任务中每个训练任务由一个工作节点执行,每个所述训练任务用于训练所述集成学习模型中的至少一个子学习模型,不同训练任务用于训练不同的子学习模型;A training task set is generated according to the training request, the training task set includes a plurality of training tasks, each training task in the plurality of training tasks is executed by a worker node, and each training task is used to train the ensemble at least one sub-learning model in the learning model, and different training tasks are used to train different sub-learning models;
向工作节点集合中多个工作节点分别发送所述训练任务集合中训练任务。The training tasks in the training task set are respectively sent to a plurality of working nodes in the working node set.
应理解,在本申请实施例中,该处理器701可以是CPU,该处理器701还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立器件组件、图形处理器(graphics processing unit,GPU)、神经网络处理单元(neural processing unit,NPU)、张量处理器(tensor processing unit,TPU)、人工智能(artificial intelligent)芯片等至少一种。通用处理器可以是微处理器或者是任何常规的处理器等。It should be understood that in this embodiment of the present application, the processor 701 may be a CPU, and the processor 701 may also be other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete device components, graphics processing unit (GPU), neural network processing unit (NPU), tensor processor (tensor) At least one of processing unit, TPU), artificial intelligence (artificial intelligent) chip, etc. A general purpose processor may be a microprocessor or any conventional processor or the like.
该存储器702可以包括只读存储器和随机存取存储器,并向处理器701提供指令和数据。存储器702还可以包括非易失性随机存取存储器。例如,存储器702还可以存储设备类型的信息。The memory 702 , which may include read-only memory and random access memory, provides instructions and data to the processor 701 . Memory 702 may also include non-volatile random access memory. For example, memory 702 may also store device type information.
该存储器702可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、 动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。The memory 702 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM).
该通信接口703用于与设备700连接的其它设备进行通信。该总线704除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线704。The communication interface 703 is used to communicate with other devices connected to the device 700 . In addition to the data bus, the bus 704 may also include a power bus, a control bus, a status signal bus, and the like. For clarity, however, the various buses are labeled as bus 704 in the figure.
应理解,根据本申请实施例的设备700可对应于本申请实施例中的集成学习模型的训练装置600,并可以对应于执行根据本申请实施例中图3所示方法中的控制节点100,并且设备700所实现的上述和其它操作和/或功能分别为了实现图3中的各个方法的相应流程,为了简洁,在此不再赘述。It should be understood that the device 700 according to the embodiment of the present application may correspond to the training apparatus 600 of the integrated learning model in the embodiment of the present application, and may correspond to the control node 100 that executes the method shown in FIG. 3 according to the embodiment of the present application, And the above and other operations and/or functions implemented by the device 700 are respectively to implement the corresponding processes of the respective methods in FIG. 3 , and are not repeated here for brevity.
作为一种可能的实施例,本申请所提供的设备也可以由多个如图7所示的设备构成,多个设备之间通过网络进行通信,该设备用于实现上述图3中的各个方法的相应流程,为了简洁,在此不再赘述。As a possible embodiment, the device provided by the present application may also be composed of multiple devices as shown in FIG. 7 , and the multiple devices communicate through the network, and the device is used to implement each method in the above-mentioned FIG. 3 The corresponding process, for the sake of brevity, will not be repeated here.
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)The above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media. The semiconductor medium may be a solid state drive (SSD)
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in the present application. Modifications or substitutions shall be covered by the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (15)

  1. 一种集成学习模型的训练方法,其特征在于,所述方法应用于模型训练系统,所述模型训练系统包括控制节点以及工作节点,所述方法包括:A training method for an integrated learning model, wherein the method is applied to a model training system, the model training system includes a control node and a working node, and the method includes:
    所述控制节点获取集成学习模型的训练请求;The control node obtains the training request of the integrated learning model;
    所述控制节点根据所述训练请求生成训练任务集合,所述训练任务集合包括多个训练任务,所述多个训练任务中每个训练任务由一个工作节点执行,每个所述训练任务用于训练所述集成学习模型中的至少一个子学习模型,不同训练任务用于训练不同的子学习模型;The control node generates a training task set according to the training request, the training task set includes multiple training tasks, each training task in the multiple training tasks is executed by a worker node, and each training task is used for training at least one sub-learning model in the integrated learning model, and different training tasks are used to train different sub-learning models;
    所述控制节点向工作节点集合中多个工作节点分别发送所述训练任务集合中训练任务。The control node sends the training tasks in the training task set to a plurality of working nodes in the working node set respectively.
  2. 根据权利要求1所述的方法,其特征在于,所述训练请求包括训练所述集成学习模型的指示以及所述集成学习模型中子学习模型的数量。The method of claim 1, wherein the training request includes an indication to train the ensemble learning model and a number of sub-learning models in the ensemble learning model.
  3. 根据权利要求2所述的方法,其特征在于,所述控制节点根据所述训练请求生成训练任务集合,包括:The method according to claim 2, wherein the control node generates a training task set according to the training request, comprising:
    所述控制节点根据所述训练请求包括的子学习模型的数量生成所述训练任务集合,所述训练任务集合包括的训练任务的数量与所述子学习模型的数量相等。The control node generates the training task set according to the number of sub-learning models included in the training request, where the number of training tasks included in the training task set is equal to the number of the sub-learning models.
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述控制节点向工作节点集合中多个工作节点分别发送所述训练任务集合中训练任务,包括:The method according to any one of claims 1 to 3, wherein the control node sends the training tasks in the training task set to a plurality of working nodes in the working node set respectively, comprising:
    所述控制节点获取所述工作节点集合中各个工作节点的负载;obtaining, by the control node, the load of each worker node in the worker node set;
    所述控制节点根据所述各个工作节点的负载,向第一部分工作节点中每个工作节点发送一个训练任务,所述工作节点集合包括所述第一部分工作节点以及第二部分工作节点,所述第一部分工作节点中每个工作节点的负载小于所述第二部分工作节点中每个工作节点的负载。The control node sends a training task to each of the first part of the work nodes according to the load of the respective work nodes, the set of work nodes includes the first part of the work nodes and the second part of the work nodes, the first part of the work nodes. The load of each worker node in the part of the worker nodes is smaller than the load of each worker node in the second part of the worker nodes.
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述子学习模型包括决策树模型。The method according to any one of claims 1 to 4, wherein the sub-learning model comprises a decision tree model.
  6. 根据权利要求5所述的方法,其特征在于,所述子学习模型的训练终止条件包括如下条件中的至少一种:The method according to claim 5, wherein the training termination condition of the sub-learning model comprises at least one of the following conditions:
    子学习模型的叶子节点对应的训练样本数量均小于数量阈值;或者,The number of training samples corresponding to the leaf nodes of the sub-learning model is less than the number threshold; or,
    用于训练子学习模型的训练样本集的不纯度指标小于不纯度阈值;或者,The impurity index of the training sample set used to train the sub-learning model is less than the impurity threshold; or,
    子学习模型中节点的深度大于等于深度阈值。The depth of the nodes in the sub-learning model is greater than or equal to the depth threshold.
  7. 一种集成学习模型的训练装置,其特征在于,所述装置应用于模型训练系统中的控制节点,所述模型训练系统还包括工作节点,所述装置包括:A training device for an integrated learning model, characterized in that the device is applied to a control node in a model training system, the model training system further includes a working node, and the device includes:
    获取模块,用于获取集成学习模型的训练请求;The acquisition module is used to acquire the training request of the ensemble learning model;
    生成模块,用于根据所述训练请求生成训练任务集合,所述训练任务集合包括多个训练任务,所述多个训练任务中每个训练任务由一个工作节点执行,每个所述训练任务用于训练所述集成学习模型中的至少一个子学习模型,不同训练任务用于训练不同的子学习模型;The generation module is configured to generate a training task set according to the training request, the training task set includes a plurality of training tasks, each training task in the plurality of training tasks is executed by a worker node, and each training task uses a For training at least one sub-learning model in the integrated learning model, different training tasks are used to train different sub-learning models;
    通信模块,用于向工作节点集合中多个工作节点分别发送所述训练任务集合中训练任务。The communication module is used for respectively sending the training tasks in the training task set to a plurality of working nodes in the working node set.
  8. 根据权利要求7所述的装置,其特征在于,所述训练请求包括训练所述集成学习模型的指示以及所述集成学习模型中子学习模型的数量。8. The apparatus of claim 7, wherein the training request includes an instruction to train the ensemble learning model and a number of sub-learning models in the ensemble learning model.
  9. 根据权利要求8所述的装置,其特征在于,所述生成模块,具体用于根据所述训练请求包括的子学习模型的数量生成所述训练任务集合,所述训练任务集合包括的训练任务的数量与所述子学习模型的数量相等。The apparatus according to claim 8, wherein the generating module is specifically configured to generate the training task set according to the number of sub-learning models included in the training request, and the training task set includes a number of training tasks. The number is equal to the number of the sub-learning models.
  10. 根据权利要求7至9任一项所述的装置,其特征在于,所述通信模块,具体包括:The device according to any one of claims 7 to 9, wherein the communication module specifically comprises:
    负载获取单元,用于获取所述工作节点集合中各个工作节点的负载;a load obtaining unit, configured to obtain the load of each working node in the set of working nodes;
    发送单元,用于根据所述各个工作节点的负载,向第一部分工作节点中的每个工作节点发送一个训练任务,所述工作节点集合包括所述第一部分工作节点以及第二部分工作节点,所述第一部分工作节点中每个工作节点的负载小于所述第二部分工作节点中每个工作节点的负载。The sending unit is configured to send a training task to each working node in the first part of working nodes according to the load of each working node, the set of working nodes includes the first part of working nodes and the second part of working nodes, so The load of each working node in the first part of the working nodes is smaller than the load of each working node in the second part of the working nodes.
  11. 根据权利要求7至10任一项所述的装置,其特征在于,所述子学习模型包括决策树模型。The apparatus according to any one of claims 7 to 10, wherein the sub-learning model comprises a decision tree model.
  12. 根据权利要求11所述的装置,其特征在于,所述子学习模型的训练终止条件包括如下条件中的至少一种:The apparatus according to claim 11, wherein the training termination condition of the sub-learning model comprises at least one of the following conditions:
    子学习模型的叶子节点对应的训练样本数量均小于数量阈值;或者,The number of training samples corresponding to the leaf nodes of the sub-learning model is less than the number threshold; or,
    用于训练子学习模型的训练样本集的不纯度指标小于不纯度阈值;或者,The impurity index of the training sample set used to train the sub-learning model is less than the impurity threshold; or,
    子学习模型中节点的深度大于等于深度阈值。The depth of the nodes in the sub-learning model is greater than or equal to the depth threshold.
  13. 一种设备,其特征在于,包括处理器和存储器;A device, characterized in that it includes a processor and a memory;
    所述存储器,用于存储计算机指令;the memory for storing computer instructions;
    所述处理器,用于根据所述计算机指令执行如权利要求1至6任一项所述方法的操作步骤。The processor is configured to execute the operation steps of the method according to any one of claims 1 to 6 according to the computer instructions.
  14. 一种模型训练系统,其特征在于,所述模型训练系统包括如权利要求1至6任一项所述的控制节点以及工作节点。A model training system, characterized in that the model training system comprises the control node and the working node according to any one of claims 1 to 6.
  15. 一种计算机可读存储介质,其特征在于,包括指令,所述指令用于实现如权利要求1至6中任一项所述方法的操作步骤。A computer-readable storage medium, characterized by comprising instructions for implementing the operation steps of the method according to any one of claims 1 to 6.
PCT/CN2021/142240 2021-01-28 2021-12-28 Training method, apparatus and system for integrated learning model, and related device WO2022161081A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110121743.4 2021-01-28
CN202110121743.4A CN114819195A (en) 2021-01-28 2021-01-28 Training method, device and system of ensemble learning model and related equipment

Publications (1)

Publication Number Publication Date
WO2022161081A1 true WO2022161081A1 (en) 2022-08-04

Family

ID=82526675

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/142240 WO2022161081A1 (en) 2021-01-28 2021-12-28 Training method, apparatus and system for integrated learning model, and related device

Country Status (2)

Country Link
CN (1) CN114819195A (en)
WO (1) WO2022161081A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024031986A1 (en) * 2022-08-12 2024-02-15 华为云计算技术有限公司 Model management method and related device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815644A (en) * 2017-01-26 2017-06-09 北京航空航天大学 Machine learning method and from node
US20180374105A1 (en) * 2017-05-26 2018-12-27 Get Attached, Inc. Leveraging an intermediate machine learning analysis
CN109409738A (en) * 2018-10-25 2019-03-01 平安科技(深圳)有限公司 Method, the electronic device of deep learning are carried out based on block platform chain
CN111444019A (en) * 2020-03-31 2020-07-24 中国科学院自动化研究所 Cloud-end-collaborative deep learning model distributed training method and system
CN111768006A (en) * 2020-06-24 2020-10-13 北京金山云网络技术有限公司 Artificial intelligence model training method, device, equipment and storage medium
CN111860835A (en) * 2020-07-17 2020-10-30 苏州浪潮智能科技有限公司 Neural network model training method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815644A (en) * 2017-01-26 2017-06-09 北京航空航天大学 Machine learning method and from node
US20180374105A1 (en) * 2017-05-26 2018-12-27 Get Attached, Inc. Leveraging an intermediate machine learning analysis
CN109409738A (en) * 2018-10-25 2019-03-01 平安科技(深圳)有限公司 Method, the electronic device of deep learning are carried out based on block platform chain
CN111444019A (en) * 2020-03-31 2020-07-24 中国科学院自动化研究所 Cloud-end-collaborative deep learning model distributed training method and system
CN111768006A (en) * 2020-06-24 2020-10-13 北京金山云网络技术有限公司 Artificial intelligence model training method, device, equipment and storage medium
CN111860835A (en) * 2020-07-17 2020-10-30 苏州浪潮智能科技有限公司 Neural network model training method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SONG CHUANGCHUANG, FANG YONG; HUANG CHENG; LIU LIANG: "Password strength estimation model based on ensemble learning", JOURNAL OF COMPUTER APPLICATIONS, JISUANJI YINGYONG, CN, vol. 38, no. 5, 10 May 2018 (2018-05-10), CN , pages 1383 - 1388, XP055954980, ISSN: 1001-9081, DOI: 10.11772/j.issn.1001-9081.2017102516 *

Also Published As

Publication number Publication date
CN114819195A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
US20180144251A1 (en) Server and cloud computing resource optimization method thereof for cloud big data computing architecture
CN111625331B (en) Task scheduling method, device, platform, server and storage medium
US20200351207A1 (en) Method and system of limiting traffic
WO2022151668A1 (en) Data task scheduling method and apparatus, storage medium, and scheduling tool
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
CN108270805B (en) Resource allocation method and device for data processing
CN111190703B (en) Real-time data processing method and device, computer equipment and storage medium
CN110347515B (en) Resource optimization allocation method suitable for edge computing environment
WO2023087658A1 (en) Task scheduling method, apparatus and device, and readable storage medium
WO2023066084A1 (en) Computing power distribution method and apparatus, and computing power server
CN111638948B (en) Multi-channel high-availability big data real-time decision making system and decision making method
CN112463390A (en) Distributed task scheduling method and device, terminal equipment and storage medium
WO2022161081A1 (en) Training method, apparatus and system for integrated learning model, and related device
CN112367363A (en) Information sharing method, device, server and storage medium
US11675515B2 (en) Intelligent partitioning engine for cluster computing
WO2021147815A1 (en) Data calculation method and related device
CN112631754A (en) Data processing method, data processing device, storage medium and electronic device
CN116775041A (en) Big data real-time decision engine based on stream computing framework and RETE algorithm
CN106502842A (en) Data reconstruction method and system
CN115658311A (en) Resource scheduling method, device, equipment and medium
US20220229692A1 (en) Method and device for data task scheduling, storage medium, and scheduling tool
CN112732451A (en) Load balancing system in cloud environment
CN117667602B (en) Cloud computing-based online service computing power optimization method and device
CN114443258B (en) Resource scheduling method, device, equipment and storage medium for virtual machine
CN117891584B (en) Task parallelism scheduling method, medium and device based on DAG grouping

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21922670

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21922670

Country of ref document: EP

Kind code of ref document: A1