WO2024021555A1 - Resource examination and approval method and device, and random forest model training method and device - Google Patents

Resource examination and approval method and device, and random forest model training method and device Download PDF

Info

Publication number
WO2024021555A1
WO2024021555A1 PCT/CN2023/074133 CN2023074133W WO2024021555A1 WO 2024021555 A1 WO2024021555 A1 WO 2024021555A1 CN 2023074133 W CN2023074133 W CN 2023074133W WO 2024021555 A1 WO2024021555 A1 WO 2024021555A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
resource
decision tree
approval
training set
Prior art date
Application number
PCT/CN2023/074133
Other languages
French (fr)
Chinese (zh)
Inventor
常三强
胡成倩
张麒
韩冬
Original Assignee
京东科技信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东科技信息技术有限公司 filed Critical 京东科技信息技术有限公司
Publication of WO2024021555A1 publication Critical patent/WO2024021555A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database

Definitions

  • the present disclosure relates to the field of cloud computing technology, and in particular to resource approval methods, random forest model training methods and devices, and computer-readable storage media.
  • Private cloud provides services to organizations within the enterprise and can provide a variety of cloud products to enterprise users, thus forming a complex cloud ecological chain. It has the characteristics of high data security and strong controllability of IT infrastructure.
  • Enterprise-level users usually have complex multi-layered internal organizational structures.
  • users at different levels within the enterprise will be given differentiated permissions, and the resources that users with different permissions can use also have different specifications.
  • users need to use resources beyond their own permissions they need to submit an application and wait for the application to be approved before they can use the product normally.
  • a resource approval method including:
  • the approval result is predicted, where the approval result indicates whether to release the resource requested by the user;
  • the method of synthesizing the approval results of each decision tree to determine whether to release the resources requested by the user includes:
  • determining whether to release the resource requested by the user is based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, including:
  • determining whether to release the resource requested by the user is based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, including:
  • the ratio of the number of decision trees that generate the same approval result to the total number of decision trees does not exceed the first preset threshold, it is determined whether to release the resource requested by the user based on multiple characteristics.
  • the user's request for resource usage also includes the user's historical usage request for the resource.
  • the characteristics of the user's resource usage request include: the type of resource the user requests to use, the specifications of the resource the user requests to use, the number of resources the user requests to use, the user's resource usage permissions and the user's request. At least one reason for using the resource.
  • a training method for a random forest model including:
  • the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to release the resources requested by the user;
  • Each decision tree is trained based on the values of its candidate features, as well as the labels of the samples.
  • training each decision tree based on the value of the candidate feature of each decision tree and the label of the sample includes:
  • the feature corresponding to the current node in the training set sample corresponding to the child node of the current node, and the label of the sample, select the feature corresponding to the child node of the current node from the remaining candidate features;
  • the child nodes of the current node include a first child node of the current node and a second child node of the current node, the value of the feature corresponding to the current node according to the sample in the training set corresponding to the current node, and The label of the sample determines the training set corresponding to the child node of the current node, including:
  • the value of the feature corresponding to the current node in the sample in the training set corresponding to the current node, and the label of the sample select a feature value from the value range of the feature corresponding to the current node as the first dividing line between the current node and the current node.
  • the split point that divides the training set corresponding to the first sub-node of the current node and the training set corresponding to the second sub-node of the current node it is determined to divide the samples in the training set corresponding to the current node into the first sub-node.
  • the training set is also the training set of the second child node.
  • the cutoff conditions include that there are no remaining candidate features, the number of samples in the training set corresponding to the current node is less than a second preset threshold, and the Gini coefficient of the training set corresponding to the current node is less than a third preset threshold.
  • Set at least one threshold.
  • training each decision tree based on the value of the candidate feature of each decision tree and the label of the sample includes:
  • the decision tree is trained based on the values of the candidate features corresponding to the decision tree in the samples in the training set of the decision tree and the labels indicating whether to release the resources requested by the user.
  • the multiple characteristics of determining the user's resource usage request include:
  • the value of the missing feature of the sample is determined.
  • a resource approval device including:
  • the acquisition module is configured to obtain the user's request for resource use
  • a first determination module configured to determine a plurality of characteristics of the user's request for resource use
  • the selection module is configured to, for each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features;
  • the prediction module is configured to predict the approval result based on the value of the feature corresponding to each decision tree, where, approval The result indicates whether the resources requested by the user are released;
  • the second determination module is configured to synthesize the approval results of each decision tree and determine whether to release the resources requested by the user.
  • a training device for a random forest model including:
  • the acquisition module is configured to obtain a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to release the resources requested by the user;
  • a determining module configured to determine a plurality of characteristics of the user's request for resource usage
  • the extraction module is configured to extract some features from multiple features for each decision tree in the random forest model as candidate features for the decision tree;
  • the training module is configured to train each decision tree based on the values of the candidate features of each decision tree and the labels of the samples.
  • an electronic device including:
  • a processor coupled to the memory, the processor being configured to execute the resource approval method according to any embodiment of the present disclosure, or the resource approval method according to any embodiment of the present disclosure based on instructions stored in the memory.
  • a computer-storable medium is provided, with computer program instructions stored thereon.
  • the instructions are executed by a processor, the resource approval method according to any embodiment of the present disclosure is implemented, or the resource approval method is implemented according to any embodiment of the present disclosure.
  • the training method of the random forest model according to any embodiment of the present disclosure.
  • Figure 1 shows a flow chart of a resource approval method according to some embodiments of the present disclosure
  • Figure 2 shows a schematic diagram of a decision tree according to some embodiments of the present disclosure
  • Figure 3 shows a schematic diagram of a random forest model determining whether to release resources according to some embodiments of the present disclosure
  • Figure 4 shows a flowchart of resource approval according to some embodiments of the present disclosure
  • Figure 5 shows a flow chart of a training method of a random forest model according to some embodiments of the present disclosure
  • Figure 6 shows a schematic diagram of pruning a decision tree according to some embodiments of the present disclosure
  • Figure 7 shows a block diagram of a resource approval device according to some embodiments of the present disclosure
  • Figure 8 shows a block diagram of a training device for a random forest model according to some embodiments of the present disclosure
  • Figure 9 shows a block diagram of an electronic device according to other embodiments of the present disclosure.
  • Figure 10 illustrates a block diagram of a computer system for implementing some embodiments of the present disclosure.
  • any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.
  • the nodes in the approval flow are often responsible persons at all levels of enterprises or institutions. Each node needs to approve the resource use requests of many users. It is inevitable that erroneous approval operations will occur, reducing the accuracy of approval.
  • the approval process often goes through multiple nodes, and the approval time at each node depends on the situation of the person in charge of the node. Any obstruction at any node will cause the entire approval flow to stagnate, reducing the efficiency of approval completion.
  • some embodiments of the present disclosure provide a resource approval method, a random forest training method and device, and a computer-readable storage medium.
  • Figure 1 shows a flow chart of a resource approval method according to some embodiments of the present disclosure.
  • the resource approval method includes steps S110 to S150.
  • the following resource approval method is executed by the resource approval device.
  • resource approval devices include input and output devices and processors.
  • the resource approval method includes obtaining the user's resource usage request through input and output devices (such as interactive panels, etc.); using the processor to determine multiple characteristics of the user's resource usage request; using the processor to target each user in the random forest model.
  • a decision tree is used to select the feature corresponding to the decision tree from multiple features; the decision tree algorithm is used to predict the approval result based on the value of the feature corresponding to each decision tree, where the approval result indicates whether to issue the request requested by the user.
  • resources the decision tree algorithm is executed by the processor; the processor is used to synthesize the approval results of each decision tree to determine whether to release the resources requested by the user.
  • Steps S110 to S150 will be introduced in detail below.
  • step S110 the user's request for resource use is obtained.
  • the user when the user needs to use resources beyond their own permissions, the user is prompted and the resource usage request filled in by the user on the page is obtained.
  • step S120 multiple characteristics of the user's request for resource usage are determined.
  • the characteristics of the user's request to use resources include: the type of resource the user requests to use, the specifications of the resource the user requests to use, the number of resources the user requests to use, the user's resource use permissions and the user's request to use the resource. at least one of the reasons.
  • multiple features such as the user's position, rank, and responsible work are extracted as multiple features of the user's resource use request.
  • step S130 for each decision tree in the random forest model, a feature corresponding to the decision tree is selected from a plurality of features.
  • a random forest model consists of multiple decision trees, each of which makes predictions based on only a subset of the multiple features requested using the request.
  • Figure 2 shows a schematic diagram of a decision tree according to some embodiments of the present disclosure.
  • each decision tree includes multiple nodes, and one node of the decision tree is a feature. Therefore, with The features corresponding to the decision tree are also the features corresponding to multiple nodes of the decision tree.
  • the characteristics corresponding to the decision tree include: whether the user has the same historical usage request, the user's rank, the permission group to which the user belongs, the value of the resources applied by the user, and the number of resources applied by the user.
  • the authority group is a set of scope and degree of decision-making on a certain matter that the incumbent must have in order to ensure the effective performance of duties.
  • each decision tree Since each decision tree generates approval results based on a part of the features, that is, assigning multiple features to multiple decision trees for processing, the number of features that each decision tree needs to process is compared to the original total number of features in the use request. Less, so that it can handle high-dimensional (that is, with multiple features) usage requests without the need for feature selection in advance or feature dimensionality reduction. It can solve the approval problem of complex (multi-feature) user requests and improve Improve the accuracy and efficiency of approval of resource usage requests.
  • step S140 the approval result is predicted based on the value of the feature corresponding to each decision tree, where the approval result indicates whether to release the resource requested by the user.
  • the values of multiple features corresponding to the decision tree include: there is no identical historical use request, the user belongs to a high-authority group, the user rank is PY, the value of the resource requested by the user is C, The number of resources requested by the user is D, among which PY has a higher level than PX, C>A, D>B.
  • the input of the decision tree is the characteristic values of multiple features corresponding to the decision tree.
  • the decision tree first determines whether "the same historical usage request exists”. If the judgment result is no, then it enters "whether the user has high authority". Group?" node. At the "Does the user belong to the high-privilege group?" node, if the judgment result is yes, then the judgment of "Is the resource value greater than A?” is entered. At the node "Is the resource value greater than A?", the judgment result is yes, then the final approval result of this decision tree is to decide to release resources to the user.
  • the approval results are predicted based on the values of the features corresponding to each decision tree, including multiple decision trees generating the approval results in parallel.
  • multiple decision trees of a random forest can be made into a parallel method to make predictions independently, thereby increasing the speed of approval.
  • step S150 the approval results of each decision tree are integrated to determine whether to release the resources requested by the user.
  • Figure 3 shows a schematic diagram of a random forest model determining whether to release resources according to some embodiments of the present disclosure.
  • each decision tree generates an approval result based on some characteristics of the user's request for resource use.
  • the results are then jointly decided by multiple trees, and based on the majority voting mechanism, it is decided whether to release the resources requested by the user.
  • combining the approval results of each decision tree to determine whether to release the resources requested by the user includes: based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees, and the first preset Threshold to determine whether to release the resources requested by the user.
  • determining whether to release the resource requested by the user is determined based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, including: When the ratio of the number of decision trees to the total number of decision trees exceeds the first preset threshold, it is determined whether to release the resources requested by the user based on the approval result.
  • the random forest algorithm predicts results by constructing a large number of independent decision trees.
  • the number of decision trees predicting the same result needs to reach a preset threshold before the result will be accepted.
  • the voting mechanism can be a one-vote veto system, a minority obeys the majority, Weighted majority, etc., the approval flow will automatically release resources and end the approval immediately.
  • the first preset threshold be 0.8
  • the random forest consists of n decision trees.
  • the approval result of m decision trees is "release resources to users.” then in In this case, the random forest determines the final result to issue the resources requested by the user.
  • determining whether to release the resource requested by the user is determined based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, including: in the decision making that generates the same approval result When the ratio of the number of trees to the total number of decision trees does not exceed the first preset threshold, it is determined whether to release the resources requested by the user based on multiple characteristics.
  • the random forest cannot automatically determine whether the approval is passed, and the approval form will be transferred to other approval methods, such as the type of resources used by the approver according to the user's request. , the specifications of the resources requested by the user, the number of resources requested by the user, the user's resource usage rights and the reasons for the user's request to use the resources, etc., to determine whether to release the resources requested by the user.
  • This disclosure integrates the results of all decision trees to determine whether to release the resources requested by the user, reduces the impact of errors of a single decision tree on the final result, and improves the accuracy of approval.
  • Figure 4 illustrates a flowchart of resource approval according to some embodiments of the present disclosure.
  • the approval flow starts. At this time, the approval flow automatically enters the Listen state, waiting for the user to enter approval information at the front desk.
  • the user's request for resource usage also includes the user's historical usage request for the resource.
  • information such as the specifications of the resources that the user applies for use will be automatically obtained and filled in with the specifications previously selected by the user or the specifications commonly used by the user. The applicant only needs to add the scenarios and reasons for the requested resources.
  • the status of the approval flow Listen lasts for 30 minutes. If the user does not fill out and submit the approval form within 30 minutes, the approval form will be automatically closed.
  • the process engine After the user fills in the approval form information, the first node reached by the approval flow is the process engine.
  • a certain number of preset rules will be built into the Process engine. These rules support customization.
  • the administrator of the platform can customize which permission groups (such as which departments and positions) are suitable for the company according to the needs of the company's organizational structure. Which resource calls can be exempted from approval.
  • the process engine stipulates that testers can be exempted from approval when they apply for cloud hosts that exceed the specified specifications within a certain range for link stress testing.
  • Some rules are preset based on hard conditions such as security requirements, company rules and regulations, and management requirements.
  • the process engine uses these rules to directly filter out which usage requests need to enter other approval methods, such as for some cross-border requests.
  • these special resource usage requests are divided into other approval methods. The remaining resource usage requests enter the processing flow of the random forest algorithm.
  • the system will capture the relevant features of the applicant's usage request, such as the applicant's position, rank, responsible work and other factors that will affect the approval results. These features will be input into the decision tree as Basis for prediction. Finally, a random forest formed by multiple decision trees determines whether to pass the user's request.
  • the results of random forest predictions need to reach a certain threshold before they will be accepted. If the prediction results do not reach the threshold, it will not be possible to automatically determine whether the approval has been passed, and the approval form will be transferred to other approval methods. For resource usage requests that pass the approval, the resources will be automatically released to the user and the approval process will be completed. For resource usage requests that fail, the Listen state will be returned, waiting for the user to modify the information.
  • Resource usage requests that enter other approval methods will also have two statuses: passed and failed. For resource usage requests that pass the approval, the resources will be released to the user and the approval process will end. For resource usage requests that fail, the Listen state will be returned, waiting for the user to modify the information.
  • Figure 5 shows a flow chart of a training method of a random forest model according to some embodiments of the present disclosure.
  • the training method of the random forest model includes steps S210-S240.
  • the training method of the random forest model is executed by a training device of the random forest model.
  • step S210 a training set is obtained, where the training set includes samples of user requests for resource use. This also includes a label indicating whether to release the resource requested by the user.
  • a usage request sample includes a user-filled request for resource usage, as well as a label indicating whether to release the resource requested by the user.
  • step S220 multiple characteristics of the user's resource usage request are determined.
  • multiple features such as the user's position, rank, and responsible work are extracted as multiple features of the user's resource use request, and the values of these features are determined.
  • step S230 for each decision tree in the random forest model, some features are extracted from multiple features as candidate features of the decision tree.
  • a part of features are randomly extracted and used as candidate features for decision tree training.
  • the sample has Y features, and T (T ⁇ Y) features are randomly selected from all features of the sample as candidate features for a decision tree.
  • Random sampling with replacement makes the probability of each sample being drawn conform to a uniform distribution.
  • training each decision tree according to the value of the candidate feature of each decision tree and the label of the sample includes: for each decision tree, extracting multiple samples from the training set as training for the decision tree Set; train the decision tree based on the values of candidate features corresponding to the decision tree in the training set of the decision tree and the label indicating whether to release the resource requested by the user.
  • random sampling with replacement is also used to extract samples from the training set.
  • S (S ⁇ X) samples are randomly sampled from the data set with replacement as a training set for a decision tree.
  • the training set is divided into multiple subsets.
  • Each decision tree is constructed using a subset as a training set.
  • multiple trained decision trees form a forest.
  • rows and columns are randomly selected, which can truly randomly divide the entire data table into multiple parts, and use one part for each decision tree.
  • the number of decision trees is Enough, there is always a decision tree that can capture the value of the data set to the greatest extent, thereby improving the accuracy of resource approval of the random forest model.
  • step S240 each decision tree is trained according to the value of the candidate feature of each decision tree and the label of the sample.
  • the following describes the training method of a single decision tree.
  • training each decision tree according to the value of the candidate feature of each decision tree and the label of the sample includes: taking the root node of the decision tree as the current node, and selecting from the candidate features according to the training set. root node pair According to the characteristic value of the sample in the training set corresponding to the current node and the label of the sample, determine the training set corresponding to the child node of the current node; according to the training set corresponding to the child node of the current node Concentrate the value of the feature of the sample corresponding to the current node, as well as the label of the sample, and select the feature corresponding to the child node of the current node from the remaining candidate features; use the child node of the current child node as the current node, and loop to determine the child node of the current node.
  • the child nodes of the current node include a first child node of the current node and a second child node of the current node, according to the value of the feature corresponding to the current node according to the sample in the training set corresponding to the current node, and the value of the sample.
  • Label determines the training set corresponding to the child node of the current node, including: the value of the feature corresponding to the current node based on the sample in the training set corresponding to the current node, and the label of the sample, from the value of the feature corresponding to the current node Select the value of a feature in the range as the split point to divide the training set corresponding to the first sub-node of the current node and the training set corresponding to the second sub-node of the current node; according to the division and the first sub-node of the current node
  • the corresponding training set and the split point of the training set corresponding to the second child node of the current node are used to determine whether the samples in the training set corresponding to the current node are divided into the training set of the first child node or the training set of the second child node.
  • the S samples use the extracted T features to train the decision tree model.
  • the features corresponding to the root node first select the features corresponding to the root node.
  • CART classification and regression tree
  • CART is a binary tree that classifies infinitely downward from the origin. That is to say, its nodes have only two choices, 'yes' and 'no'. Through Continuously divide the feature space into a limited number of units, and determine the predicted probability distribution on these units.
  • the Gini coefficient represents the impurity of the model. The smaller the Gini coefficient, the lower the impurity, and the better the feature.
  • the purity of data set D can be measured by the Gini value. Assuming that there are K samples in the set, the calculation formula of the Gini coefficient is as follows:
  • Gini(D) reflects the probability that two samples randomly selected from the data set D have inconsistent class labels. Therefore, the smaller Gini(D), the higher the purity of data set D.
  • Calculate the Gini coefficient of each feature value of each existing feature of the current node to the data set D For example, first determine the value range of feature A based on the samples of the training set. Taking the current node as the root node as an example, at this time, the root node corresponds to The training set is D. From the value range of A, select the value a. According to whether the feature A takes the value a, the training set D is divided into two parts. For example, when the value of the feature A of the sample is a, then the value of the feature A is a. The sample is divided into the training set D 1 , otherwise, it is divided into the training set D 2.
  • the formula for calculating the Gini coefficient of the feature value A and the cut point a for the data set D is as follows:
  • Gini(D 1 ) represents the Gini coefficient of data set D1.
  • the decision tree can handle both continuous values and discrete values. For example, for continuous values, assuming that the continuous feature A of m samples has m values, arranged from small to large, then CART takes the average of two adjacent sample values as the dividing point. There are m-1 dividing points in total, and they are calculated separately. The Gini coefficient when these m-1 points are used as binary classification points. Select the point with the smallest Gini coefficient as the cutting point of the continuous feature. For example, the point with the smallest Gini coefficient is a, then the value less than a is category 1, and the value greater than a is category 2. This achieves the discretization of continuous features.
  • CART uses the cyclic bisection method. CART divides the value of feature A into three situations: (a 1 , a 2 a 3 ) or (a 1 a 2 , a 3 ) or (a 2 , a 1 a 3 ), and finds the combination with the smallest Gini coefficient, such as ( a 2 , a 1 a 3 ), and then establish a binary tree node. One node is the sample corresponding to a 2 , and the other node is the sample corresponding to a 1 and a 3 . Because the values of feature A are not completely separated this time.
  • the samples in the training set corresponding to the current node After determining the split point that divides the training set corresponding to the first sub-node of the current node and the training set corresponding to the second sub-node of the current node, determine the samples in the training set corresponding to the current node based on the split point Whether it is divided into the first sub-node or the second sub-node, thereby generating a training set of the first sub-node and the second sub-node.
  • the child node uses the child node as the current node, loop the above steps of determining the characteristics of the current node based on the training set of the current node, and determining the training set of the child node based on the characteristics of the current node, until the cutoff condition is reached, then return to the decision subtree, and the current node stops Recurse, and finally build the entire decision tree.
  • the interaction between different features can be measured. For example, if the training set in the same decision tree is split into two child nodes according to a certain feature M, and it is easier to split on feature J, then features M and J are interactive.
  • the cutoff conditions include that there are no remaining candidate features, the number of samples in the training set corresponding to the current node is less than a second preset threshold, and the Gini coefficient of the training set corresponding to the current node is less than a third preset threshold. value of at least one.
  • the decision tree subtree is returned and the current node stops recursing.
  • This disclosure automatically determines the importance of features in a user's resource usage request based on the Gini coefficient. In addition, it can measure the interactivity between different features, build a decision tree and generate resource approval results without the need for dimensionality reduction or feature selection, which improves Accuracy and efficiency in approving resource usage requests.
  • multiple decision trees are constructed to finally form a random forest model.
  • determining multiple characteristics of the user's request for resource use includes, in the case where a value of a characteristic of a sample of the user's request for resource use is missing, calculating a path of the sample and other samples through the node in the decision tree. The similarity of the sample and the path through the node in the decision tree between the sample and other samples is used to determine the value of the missing feature of the sample.
  • the missing values in the sample For example, first, preset some estimates for the missing values in the sample. For numeric variables, select the median or mode of the remaining data as the estimate of the current missing value. If it is a numeric variable, a new estimate is obtained through a weighted average. Then, based on the estimated values, build a random forest and put all the data into the random forest and run it again. Record the step-by-step classification path of each group of data in the decision tree, determine which group of data is most similar to the missing data path, and introduce a similarity matrix to record the similarity between the data. For example, if there are N groups of data, the similarity matrix The size is N*N. If the missing value is a categorical variable, a new estimated value is obtained through weighted voting, and so on until a stable estimated value is obtained.
  • the filled data is random and uncertain, and can better reflect the true distribution of these unknown data.
  • each node uses random partial features instead of all the features of the training set, it can be well applied to filling high-dimensional data. Therefore, the present disclosure can reduce the interference of missing values on resource approval and improve the accuracy of approval of resource use requests.
  • training each decision tree includes pruning the decision tree based on the values of the candidate features of each decision tree and the labels of the samples.
  • Figure 6 shows a schematic diagram of pruning a decision tree according to some embodiments of the present disclosure.
  • the post-pruning method is used, that is, a decision tree is first generated, and then all pruned CART trees are generated based on the generated decision trees, and then cross-validation is used to test the effect of pruning, and general The pruning strategy with the best performance.
  • the loss function of the subtree T t is:
  • the loss function of the root node is as follows:
  • is the regularization parameter (the same as the regularization of linear regression)
  • C(T t ) is the prediction error of the verification data (that is, the Gini coefficient of the verification data)
  • T t is the number of leaf nodes of the subtree T.
  • each decision tree is trained based on the values of the candidate features of each decision tree and the labels of the samples, including training multiple decision trees in parallel. For example, training multiple decision trees of a random forest in parallel and independently can improve the training speed of the random forest model.
  • Figure 7 shows a block diagram of a resource approval device according to some embodiments of the present disclosure.
  • the resource approval device 7 includes an acquisition module 71 , a first determination module 72 , a selection module 73 , a prediction module 74 , and a second determination module 75 .
  • the obtaining module 71 is configured to obtain the user's request for resource use, for example, performing step S110 as shown in Figure 1 .
  • the first determination module 72 is configured to determine multiple characteristics of the user's resource usage request, for example, perform step S120 as shown in FIG. 1 .
  • the selection module 73 is configured to, for each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features, for example, perform step S130 as shown in FIG. 1 .
  • the prediction module 74 is configured to predict the approval result according to the value of the feature corresponding to each decision tree, where the approval result indicates whether to release the resource requested by the user, for example, perform step S140 as shown in FIG. 1 .
  • the second determination module 75 is configured to synthesize the approval results of each decision tree and determine whether to release the resources requested by the user, for example, performing step S150 as shown in FIG. 1 .
  • Figure 8 shows a block diagram of a training device for a random forest model according to some embodiments of the present disclosure.
  • the training device of the random forest model includes an acquisition module 81, a determination module 82, an extraction module 83, and a training module 84.
  • the acquisition module 81 is configured to acquire a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to issue the resources requested by the user. For example, step S210 shown in Figure 5 is performed.
  • the determination module 82 is configured to determine multiple characteristics of the user's resource usage request, for example, perform step S220 as shown in FIG. 5 .
  • the extraction module 83 is configured to, for each decision tree in the random forest model, extract some features from multiple features as candidate features of the decision tree, for example, perform step S230 as shown in FIG. 5 .
  • the training module 84 is configured to train each decision tree according to the value of the candidate feature of each decision tree and the label of the sample, for example, perform step S240 as shown in FIG. 5 .
  • Figure 9 shows a block diagram of an electronic device according to other embodiments of the present disclosure.
  • the electronic device 9 includes a memory 91; and a processor 92 coupled to the memory 91.
  • the memory 91 is used to store instructions for executing corresponding embodiments of the resource approval method or the training method of the random forest model.
  • the processor 92 is configured to execute the resource approval method or the random forest model training method in any embodiment of the present disclosure based on instructions stored in the memory 91 .
  • Figure 10 illustrates a block diagram of a computer system for implementing some embodiments of the present disclosure.
  • Computer system 100 may be embodied in the form of a general purpose computing device.
  • Computer system 100 includes memory 1010, a processor 1020, and a bus 1000 that connects various system components.
  • the memory 1010 may include, for example, system memory, non-volatile storage media, and the like.
  • System memory stores, for example, operating systems, applications, boot loaders, and other programs.
  • System memory may include volatile storage media such as random access memory (RAM) and/or cache memory.
  • RAM random access memory
  • the non-volatile storage medium stores, for example, instructions for executing corresponding embodiments of at least one of the resource approval methods or the random forest model training methods in any embodiments of the present disclosure.
  • Non-volatile storage media includes but is not limited to disk storage, optical storage, flash memory, etc.
  • the processor 1020 may be implemented as a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gates or transistors and other discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • each module such as the judgment module and the determination module, can be implemented by instructions executing corresponding steps in a central processing unit (CPU) running memory, or by dedicated circuits executing corresponding steps.
  • Bus 1000 may use any of a variety of bus structures.
  • bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • PCI Peripheral Component Interconnect
  • the computer system 100 may also include an input/output interface 1030, a network interface 1040, a storage interface 1050, and the like. These interfaces 1030, 1040, 1050, the memory 1010 and the processor 1020 may be connected through a bus 1000.
  • the input and output interface 1030 can provide a connection interface for input and output devices such as a monitor, mouse, and keyboard.
  • Network interface 1040 provides connection interfaces for various networked devices.
  • the storage interface 1050 provides a connection interface for external storage devices such as floppy disks, USB disks, and SD cards.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces implementations in one or more blocks of the flowcharts and/or block diagrams.
  • a device with specified functions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces implementations in one or more blocks of the flowcharts and/or block diagrams.
  • Computer-readable program instructions which may also be stored in computer-readable memory, cause the computer to operate in a specific manner to produce an article of manufacture, including implementing the functions specified in one or more blocks of the flowcharts and/or block diagrams. instructions.
  • the disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects.

Abstract

The present disclosure relates to the technical field of cloud computing, and relates to a resource examination and approval method and device, and a random forest model training method and device. The resource examination and approval method comprises: acquiring a resource usage request of a user; determining a plurality of features of the resource usage request of the user; for each decision tree in a random forest model, selecting, from the plurality of features, a feature corresponding to the decision tree; predicting an examination and approval result according to the value of the feature corresponding to each decision tree, wherein the examination and approval result represents whether a resource requested by the user is issued; and in view of the examination and approval results of the decision trees, determining whether to issue the resource requested by the user.

Description

资源审批方法、随机森林模型的训练方法及装置Resource approval method, random forest model training method and device
相关申请的交叉引用Cross-references to related applications
本申请是以中国申请号为202210905742.3,申请日为2022年7月29日的申请为基础,并主张其优先权,该中国申请的公开内容在此作为整体引入本申请中。This application is based on the application with Chinese application number 202210905742.3 and the filing date is July 29, 2022, and claims its priority. The disclosure content of the Chinese application is hereby incorporated into this application as a whole.
技术领域Technical field
本公开涉及云计算技术领域,特别涉及资源审批方法、随机森林模型的训练方法及装置、计算机可读存储介质。The present disclosure relates to the field of cloud computing technology, and in particular to resource approval methods, random forest model training methods and devices, and computer-readable storage media.
背景技术Background technique
私有云面向企业内部的组织提供服务,能够向企业用户提供多种云产品,从而构成复杂的云生态链,具有数据安全性高,IT基础架构可控性强等特点。Private cloud provides services to organizations within the enterprise and can provide a variety of cloud products to enterprise users, thus forming a complex cloud ecological chain. It has the characteristics of high data security and strong controllability of IT infrastructure.
企业级用户通常具有复杂的多层内部组织架构,在私有云的场景下,企业内不同层级的用户会被赋予差异化的权限,不同权限的用户所能使用的资源的规格也不同。当用户有超出自身权限使用资源的需求时,需要提交申请,等待申请被审批通过后才能正常使用产品。Enterprise-level users usually have complex multi-layered internal organizational structures. In a private cloud scenario, users at different levels within the enterprise will be given differentiated permissions, and the resources that users with different permissions can use also have different specifications. When users need to use resources beyond their own permissions, they need to submit an application and wait for the application to be approved before they can use the product normally.
发明内容Contents of the invention
根据本公开的第一方面,提供了一种资源审批方法,包括:According to a first aspect of the present disclosure, a resource approval method is provided, including:
获取用户对资源的使用请求;Obtain the user's request for resource use;
确定用户对资源的使用请求的多个特征;Determine multiple characteristics of a user's request for resource usage;
针对随机森林模型中的每个决策树,从多个特征中,选择与该决策树对应的特征;For each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features;
根据与每个决策树对应的特征的值,预测审批结果,其中,审批结果表示是否发放用户所请求的资源;According to the value of the feature corresponding to each decision tree, the approval result is predicted, where the approval result indicates whether to release the resource requested by the user;
综合每个决策树的审批结果,确定是否发放用户所请求的资源。Based on the approval results of each decision tree, it is determined whether to release the resources requested by the user.
在一些实施例中,所述综合每个决策树的审批结果,确定是否发放用户所请求的资源,包括:In some embodiments, the method of synthesizing the approval results of each decision tree to determine whether to release the resources requested by the user includes:
根据生成相同的审批结果的决策树的数量占决策树的总数的比例、和第一预设阈值, 确定是否发放用户所请求的资源。According to the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, Determine whether to release the resource requested by the user.
在一些实施例中,所述根据生成相同的审批结果的决策树的数量占决策树的总数的比例、和第一预设阈值,确定是否发放用户所请求的资源,包括:In some embodiments, determining whether to release the resource requested by the user is based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, including:
在生成相同的审批结果的决策树的数量占决策树的总数的比例超过第一预设阈值的情况下,根据该审批结果,确定是否发放用户所请求的资源。When the ratio of the number of decision trees that generate the same approval result to the total number of decision trees exceeds the first preset threshold, it is determined whether to release the resource requested by the user based on the approval result.
在一些实施例中,所述根据生成相同的审批结果的决策树的数量占决策树的总数的比例和第一预设阈值,确定是否发放用户所请求的资源,包括:In some embodiments, determining whether to release the resource requested by the user is based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, including:
在生成相同的审批结果的决策树的数量占决策树的总数的比例不超过第一预设阈值的情况下,根据多个特征,确定是否发放用户所请求的资源。When the ratio of the number of decision trees that generate the same approval result to the total number of decision trees does not exceed the first preset threshold, it is determined whether to release the resource requested by the user based on multiple characteristics.
在一些实施例中,所述用户对资源的使用请求还包括用户对资源的历史使用请求。In some embodiments, the user's request for resource usage also includes the user's historical usage request for the resource.
在一些实施例中,所述用户对资源的使用请求的特征包括:用户请求使用的资源的类型、用户请求使用的资源的规格、用户请求使用的资源的数量、用户的资源使用权限和用户请求使用资源的原因的至少一个。In some embodiments, the characteristics of the user's resource usage request include: the type of resource the user requests to use, the specifications of the resource the user requests to use, the number of resources the user requests to use, the user's resource usage permissions and the user's request. At least one reason for using the resource.
根据本公开的第二方面,提供了一种随机森林模型的训练方法,包括:According to a second aspect of the present disclosure, a training method for a random forest model is provided, including:
获取训练集,其中,训练集包括用户对资源的使用请求的样本,样本还包括表示是否发放用户所请求的资源的标签;Obtain a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to release the resources requested by the user;
确定用户对资源的使用请求的多个特征;Determine multiple characteristics of a user's request for resource usage;
针对随机森林模型中的每个决策树,从多个特征中抽取部分特征,作为该决策树的候选特征;For each decision tree in the random forest model, some features are extracted from multiple features as candidate features for the decision tree;
根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树。Each decision tree is trained based on the values of its candidate features, as well as the labels of the samples.
在一些实施例中,所述根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树,包括:In some embodiments, training each decision tree based on the value of the candidate feature of each decision tree and the label of the sample includes:
将决策树的根节点作为当前节点,根据训练集,从候选特征中选择与根节点对应的特征;Use the root node of the decision tree as the current node, and select the feature corresponding to the root node from the candidate features based on the training set;
根据与当前节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,确定与当前节点的子节点对应的训练集;Determine the training set corresponding to the child node of the current node based on the value of the characteristic of the sample in the training set corresponding to the current node and the label of the sample;
根据与当前节点的子节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,从剩余的候选特征中选择与当前节点的子节点对应的特征;According to the value of the feature corresponding to the current node in the training set sample corresponding to the child node of the current node, and the label of the sample, select the feature corresponding to the child node of the current node from the remaining candidate features;
将当前子节点的子节点作为当前节点,循环确定与当前节点的子节点对应的训练集、 从剩余的候选特征中选择与当前节点的子节点对应的特征的步骤,直至达到截止条件。Use the child node of the current child node as the current node, and loop to determine the training set corresponding to the child node of the current node. The step of selecting features corresponding to the child nodes of the current node from the remaining candidate features until a cutoff condition is reached.
在一些实施例中,当前节点的子节点包括当前节点的第一子节点和当前节点的第二子节点,所述根据与当前节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,确定与当前节点的子节点对应的训练集,包括:In some embodiments, the child nodes of the current node include a first child node of the current node and a second child node of the current node, the value of the feature corresponding to the current node according to the sample in the training set corresponding to the current node, and The label of the sample determines the training set corresponding to the child node of the current node, including:
根据与当前节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,从与当前节点对应的特征的取值范围中选择一个特征的值,作为划分与当前节点的第一子节点对应的训练集和与当前节点的第二子节点对应的训练集的切分点;According to the value of the feature corresponding to the current node in the sample in the training set corresponding to the current node, and the label of the sample, select a feature value from the value range of the feature corresponding to the current node as the first dividing line between the current node and the current node. The split point between the training set corresponding to the child node and the training set corresponding to the second child node of the current node;
根据划分与当前节点的第一子节点对应的训练集和与当前节点的第二子节点对应的训练集的切分点,判断将与当前节点对应的训练集中的样本划分到第一子节点的训练集还是第二子节点的训练集。According to the split point that divides the training set corresponding to the first sub-node of the current node and the training set corresponding to the second sub-node of the current node, it is determined to divide the samples in the training set corresponding to the current node into the first sub-node. The training set is also the training set of the second child node.
在一些实施例中,所述截止条件包括不存在剩余的候选特征、与当前节点对应的训练集中样本的数量小于第二预设阈值,以及与当前节点对应的训练集的基尼系数小于第三预设阈值的至少一个。In some embodiments, the cutoff conditions include that there are no remaining candidate features, the number of samples in the training set corresponding to the current node is less than a second preset threshold, and the Gini coefficient of the training set corresponding to the current node is less than a third preset threshold. Set at least one threshold.
在一些实施例中,所述根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树,包括:In some embodiments, training each decision tree based on the value of the candidate feature of each decision tree and the label of the sample includes:
针对每个决策树,从训练集中抽取多个样本,作为该决策树的训练集;For each decision tree, multiple samples are extracted from the training set as the training set for the decision tree;
根据决策树的训练集中样本的与该决策树对应的候选特征的值、和表示是否发放用户所请求的资源的标签,训练决策树。The decision tree is trained based on the values of the candidate features corresponding to the decision tree in the samples in the training set of the decision tree and the labels indicating whether to release the resources requested by the user.
在一些实施例中,所述确定用户对资源的使用请求的多个特征,包括:In some embodiments, the multiple characteristics of determining the user's resource usage request include:
在用户对资源的使用请求的样本缺失特征的值的情况下,计算该样本和其他样本在决策树中经过节点的路径的相似度;When the sample requested by the user for resource usage is missing the value of the feature, calculate the similarity between the sample and other samples passing through the node in the decision tree;
根据样本和其他样本在决策树中经过节点的路径的相似度,确定该样本缺失的特征的值。Based on the similarity of the path between the sample and other samples passing through the node in the decision tree, the value of the missing feature of the sample is determined.
根据本公开的第三方面,提供了一种资源审批装置,包括:According to a third aspect of the present disclosure, a resource approval device is provided, including:
获取模块,被配置为获取用户对资源的使用请求;The acquisition module is configured to obtain the user's request for resource use;
第一确定模块,被配置为确定用户对资源的使用请求的多个特征;A first determination module configured to determine a plurality of characteristics of the user's request for resource use;
选择模块,被配置为针对随机森林模型中的每个决策树,从多个特征中,选择与该决策树对应的特征;The selection module is configured to, for each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features;
预测模块,被配置为根据与每个决策树对应的特征的值,预测审批结果,其中,审批 结果表示是否发放用户所请求的资源;The prediction module is configured to predict the approval result based on the value of the feature corresponding to each decision tree, where, approval The result indicates whether the resources requested by the user are released;
第二确定模块,被配置为综合每个决策树的审批结果,确定是否发放用户所请求的资源。The second determination module is configured to synthesize the approval results of each decision tree and determine whether to release the resources requested by the user.
根据本公开的第四方面,提供了一种随机森林模型的训练装置,包括:According to a fourth aspect of the present disclosure, a training device for a random forest model is provided, including:
获取模块,被配置为获取训练集,其中,训练集包括用户对资源的使用请求的样本,样本还包括表示是否发放用户所请求的资源的标签;The acquisition module is configured to obtain a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to release the resources requested by the user;
确定模块,被配置为确定用户对资源的使用请求的多个特征;a determining module configured to determine a plurality of characteristics of the user's request for resource usage;
抽取模块,被配置为针对随机森林模型中的每个决策树,从多个特征中抽取部分特征,作为该决策树的候选特征;The extraction module is configured to extract some features from multiple features for each decision tree in the random forest model as candidate features for the decision tree;
训练模块,被配置为根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树。The training module is configured to train each decision tree based on the values of the candidate features of each decision tree and the labels of the samples.
根据本公开的第五方面,提供了一种电子设备,包括:According to a fifth aspect of the present disclosure, an electronic device is provided, including:
存储器;以及memory; and
耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器的指令,执行根据本公开任一实施例所述的资源审批方法,或根据本公开任一实施例所述的随机森林模型的训练方法。A processor coupled to the memory, the processor being configured to execute the resource approval method according to any embodiment of the present disclosure, or the resource approval method according to any embodiment of the present disclosure based on instructions stored in the memory. The training method of the random forest model described above.
根据本公开的第六方面,提供了一种计算机可存储介质,其上存储有计算机程序指令,该指令被处理器执行时,实现根据本公开任一实施例所述的资源审批方法,或根据本公开任一实施例所述的随机森林模型的训练方法。According to a sixth aspect of the present disclosure, a computer-storable medium is provided, with computer program instructions stored thereon. When the instructions are executed by a processor, the resource approval method according to any embodiment of the present disclosure is implemented, or the resource approval method is implemented according to any embodiment of the present disclosure. The training method of the random forest model according to any embodiment of the present disclosure.
附图说明Description of drawings
构成说明书的一部分的附图描述了本公开的实施例,并且连同说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and, together with the description, serve to explain principles of the disclosure.
参照附图,根据下面的详细描述,可以更加清楚地理解本公开,其中:The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:
图1示出了根据本公开一些实施例的资源审批方法的流程图;Figure 1 shows a flow chart of a resource approval method according to some embodiments of the present disclosure;
图2示出了根据本公开一些实施例的决策树的示意图;Figure 2 shows a schematic diagram of a decision tree according to some embodiments of the present disclosure;
图3示出了根据本公开一些实施例的随机森林模型确定是否发放资源的示意图;Figure 3 shows a schematic diagram of a random forest model determining whether to release resources according to some embodiments of the present disclosure;
图4示出了根据本公开一些实施例的资源审批的流程图; Figure 4 shows a flowchart of resource approval according to some embodiments of the present disclosure;
图5示出了根据本公开一些实施例的随机森林模型的训练方法的流程图;Figure 5 shows a flow chart of a training method of a random forest model according to some embodiments of the present disclosure;
图6示出了根据本公开一些实施例的对决策树进行剪枝的示意图;Figure 6 shows a schematic diagram of pruning a decision tree according to some embodiments of the present disclosure;
图7示出了根据本公开一些实施例的资源审批装置的框图;Figure 7 shows a block diagram of a resource approval device according to some embodiments of the present disclosure;
图8示出了根据本公开一些实施例的随机森林模型的训练装置的框图;Figure 8 shows a block diagram of a training device for a random forest model according to some embodiments of the present disclosure;
图9示出了根据本公开另一些实施例的电子设备的框图;Figure 9 shows a block diagram of an electronic device according to other embodiments of the present disclosure;
图10示出了用于实现本公开一些实施例的计算机系统的框图。Figure 10 illustrates a block diagram of a computer system for implementing some embodiments of the present disclosure.
具体实施方式Detailed ways
现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of components and steps, numerical expressions, and numerical values set forth in these examples do not limit the scope of the disclosure unless otherwise specifically stated.
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。At the same time, it should be understood that, for convenience of description, the dimensions of various parts shown in the drawings are not drawn according to actual proportional relationships.
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses.
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods and devices should be considered a part of the specification.
在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。In all examples shown and discussed herein, any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that similar reference numerals and letters refer to similar items in the following figures, so that once an item is defined in one figure, it does not need further discussion in subsequent figures.
相关技术中,在用户需要跨权限使用产品时,一般需要通过多个节点的层层审批,这一方式存在下列问题。In related technologies, when users need to use products across permissions, they generally need to go through layers of approval from multiple nodes. This method has the following problems.
首先,审批流中的节点往往为企业或机构的各级负责人,每个节点都需要审批众多用户的资源使用请求,难免出现错误的审批操作,降低审批的准确率。First of all, the nodes in the approval flow are often responsible persons at all levels of enterprises or institutions. Each node needs to approve the resource use requests of many users. It is inevitable that erroneous approval operations will occur, reducing the accuracy of approval.
其次,审批流程往往要经历多个节点,每个节点审批通过的时间要根据节点负责人自身情况而定,任意一个节点阻碍都会导致整个审批流停滞,降低了审批完成的效率。Secondly, the approval process often goes through multiple nodes, and the approval time at each node depends on the situation of the person in charge of the node. Any obstruction at any node will cause the entire approval flow to stagnate, reducing the efficiency of approval completion.
最后,这种审批方式难以适应复杂的业务需求。有时用户需要结合不同的云产品资源, 才能完成目标任务,但是传统的审批方式的节点是固定的,难以根据用户的差异化的需求做出及时调整。Finally, this approval method is difficult to adapt to complex business needs. Sometimes users need to combine different cloud product resources, In order to complete the target tasks, the nodes of the traditional approval method are fixed, making it difficult to make timely adjustments according to the differentiated needs of users.
为了解决上述问题,本公开的一些实施例提供了一种资源审批方法、随机森林训练方法及装置、计算机可读存储介质。In order to solve the above problems, some embodiments of the present disclosure provide a resource approval method, a random forest training method and device, and a computer-readable storage medium.
图1示出根据本公开一些实施例的资源审批方法的流程图。Figure 1 shows a flow chart of a resource approval method according to some embodiments of the present disclosure.
如图1所示,资源审批方法包括步骤S110-步骤S150。在一些实施例中,下列资源审批方法由资源审批装置执行。As shown in Figure 1, the resource approval method includes steps S110 to S150. In some embodiments, the following resource approval method is executed by the resource approval device.
例如,资源审批装置包括输入输出设备和处理器。资源审批方法包括通过输入输出设备(如交互面板等),获取用户对资源的使用请求;利用处理器,确定用户对资源的使用请求的多个特征;利用处理器,针对随机森林模型中的每个决策树,从多个特征中,选择与该决策树对应的特征;利用决策树算法,根据与每个决策树对应的特征的值,预测审批结果,其中,审批结果表示是否发放用户所请求的资源,决策树算法由处理器执行;利用处理器,综合每个决策树的审批结果,确定是否发放用户所请求的资源。For example, resource approval devices include input and output devices and processors. The resource approval method includes obtaining the user's resource usage request through input and output devices (such as interactive panels, etc.); using the processor to determine multiple characteristics of the user's resource usage request; using the processor to target each user in the random forest model. A decision tree is used to select the feature corresponding to the decision tree from multiple features; the decision tree algorithm is used to predict the approval result based on the value of the feature corresponding to each decision tree, where the approval result indicates whether to issue the request requested by the user. resources, the decision tree algorithm is executed by the processor; the processor is used to synthesize the approval results of each decision tree to determine whether to release the resources requested by the user.
下面将详细介绍步骤S110-步骤S150。Steps S110 to S150 will be introduced in detail below.
在步骤S110中,获取用户对资源的使用请求。In step S110, the user's request for resource use is obtained.
例如,在需要使用超出自身权限范围的资源时,对用户进行提示,并获取用户在页面中填写的对资源的使用请求。For example, when the user needs to use resources beyond their own permissions, the user is prompted and the resource usage request filled in by the user on the page is obtained.
在步骤S120中,确定用户对资源的使用请求的多个特征。In step S120, multiple characteristics of the user's request for resource usage are determined.
在一些实施例中,用户对资源的使用请求的特征包括:用户请求使用的资源的类型、用户请求使用的资源的规格、用户请求使用的资源的数量、用户的资源使用权限和用户请求使用资源的原因的至少一个。In some embodiments, the characteristics of the user's request to use resources include: the type of resource the user requests to use, the specifications of the resource the user requests to use, the number of resources the user requests to use, the user's resource use permissions and the user's request to use the resource. at least one of the reasons.
例如,从用户提交的对资源的使用请求中,提取出用户的职位、职级、负责的工作等多个特征,作为用户对资源的使用请求的多个特征。For example, from the resource use request submitted by the user, multiple features such as the user's position, rank, and responsible work are extracted as multiple features of the user's resource use request.
在步骤S130中,针对随机森林模型中的每个决策树,从多个特征中,选择与该决策树对应的特征。In step S130, for each decision tree in the random forest model, a feature corresponding to the decision tree is selected from a plurality of features.
随机森林模型包括多个决策树,每个决策树仅根据使用请求的多个特征中的一部分进行预测。A random forest model consists of multiple decision trees, each of which makes predictions based on only a subset of the multiple features requested using the request.
图2示出了根据本公开一些实施例的决策树的示意图。Figure 2 shows a schematic diagram of a decision tree according to some embodiments of the present disclosure.
如图2所示,每个决策树包括多个节点,决策树的一个节点就是一个特征。因此,与 决策树对应的特征也即与决策树的多个节点对应的特征。例如,决策树对应的特征包括:该用户是否存在相同的历史使用请求、用户职级、用户所属权限组、用户申请的资源价值、用户申请的资源数量。其中,权限组是一组为了保证职责的有效履行,任职者必须具备的,对某事项进行决策的范围和程度的集合。As shown in Figure 2, each decision tree includes multiple nodes, and one node of the decision tree is a feature. Therefore, with The features corresponding to the decision tree are also the features corresponding to multiple nodes of the decision tree. For example, the characteristics corresponding to the decision tree include: whether the user has the same historical usage request, the user's rank, the permission group to which the user belongs, the value of the resources applied by the user, and the number of resources applied by the user. Among them, the authority group is a set of scope and degree of decision-making on a certain matter that the incumbent must have in order to ensure the effective performance of duties.
对于训练好的决策树,其节点对应于哪个特征是已知的,也即每个决策树需要使用多个特征中的哪些特征是已知的,因此,在利用决策树进行预测时,只需要将步骤S120中得到的特征,根据决策树的节点,划分给每个训练好的决策树即可。For a trained decision tree, it is known which feature its node corresponds to, that is, which of the multiple features each decision tree needs to use is known. Therefore, when using a decision tree for prediction, only Just divide the features obtained in step S120 into each trained decision tree according to the nodes of the decision tree.
由于每个决策树分别根据一部分特征生成审批结果,即,将多个特征分给多个决策树去处理,相比于使用请求中原有的总特征数量,每个决策树需要处理的特征的数量更少,从而能够处理高维度(即,具有多种特征)的使用请求,且不需要提前做特征选择,也不需要特征降维,能够解决复杂(多特征)的用户请求的审批问题,提高了对资源的使用请求的审批的准确度和效率。Since each decision tree generates approval results based on a part of the features, that is, assigning multiple features to multiple decision trees for processing, the number of features that each decision tree needs to process is compared to the original total number of features in the use request. Less, so that it can handle high-dimensional (that is, with multiple features) usage requests without the need for feature selection in advance or feature dimensionality reduction. It can solve the approval problem of complex (multi-feature) user requests and improve Improve the accuracy and efficiency of approval of resource usage requests.
在步骤S140中,根据与每个决策树对应的特征的值,预测审批结果,其中,审批结果表示是否发放用户所请求的资源。In step S140, the approval result is predicted based on the value of the feature corresponding to each decision tree, where the approval result indicates whether to release the resource requested by the user.
例如,用户对资源的使用请求中,与该决策树对应的多个特征的值包括,不存在相同的历史使用请求、用户属于高权限组、用户职级为PY、用户申请的资源价值为C、用户申请的资源数量为D,其中,PY的级别高于PX,C>A,D>B。For example, in the user's request for resource use, the values of multiple features corresponding to the decision tree include: there is no identical historical use request, the user belongs to a high-authority group, the user rank is PY, the value of the resource requested by the user is C, The number of resources requested by the user is D, among which PY has a higher level than PX, C>A, D>B.
如图2所示,决策树的输入是与决策树对应的多个特征的特征值,决策树首先判断是否“存在相同的历史使用请求”,判断结果为否,则进入“用户是否属于高权限组?”节点。在“用户是否属于高权限组?”节点,判断结果为是,则进入“资源价值是否大于A?”的判断。在“资源价值是否大于A?”的节点,判断结果为是,则这一颗决策树的最终审批结果为决定向用户发放资源。As shown in Figure 2, the input of the decision tree is the characteristic values of multiple features corresponding to the decision tree. The decision tree first determines whether "the same historical usage request exists". If the judgment result is no, then it enters "whether the user has high authority". Group?" node. At the "Does the user belong to the high-privilege group?" node, if the judgment result is yes, then the judgment of "Is the resource value greater than A?" is entered. At the node "Is the resource value greater than A?", the judgment result is yes, then the final approval result of this decision tree is to decide to release resources to the user.
在一些实施例中,根据与每个决策树对应的特征的值,预测审批结果,包括多个决策树并行生成审批结果。In some embodiments, the approval results are predicted based on the values of the features corresponding to each decision tree, including multiple decision trees generating the approval results in parallel.
例如,将随机森林的多个决策树做成并行方法,分别独立地进行预测,从而可以提高审批速度。For example, multiple decision trees of a random forest can be made into a parallel method to make predictions independently, thereby increasing the speed of approval.
在步骤S150中,综合每个决策树的审批结果,确定是否发放用户所请求的资源。In step S150, the approval results of each decision tree are integrated to determine whether to release the resources requested by the user.
图3示出了根据本公开一些实施例的随机森林模型确定是否发放资源的示意图。Figure 3 shows a schematic diagram of a random forest model determining whether to release resources according to some embodiments of the present disclosure.
如图3所示,每个决策树分别根据用户对资源的使用请求中的部分特征,生成审批结 果,然后通过多个树共同决策,基于多数投票机制决定是否发放用户所请求的资源。As shown in Figure 3, each decision tree generates an approval result based on some characteristics of the user's request for resource use. The results are then jointly decided by multiple trees, and based on the majority voting mechanism, it is decided whether to release the resources requested by the user.
在一些实施例中,综合每个决策树的审批结果,确定是否发放用户所请求的资源,包括:根据生成相同的审批结果的决策树的数量占决策树的总数的比例、和第一预设阈值,确定是否发放用户所请求的资源。In some embodiments, combining the approval results of each decision tree to determine whether to release the resources requested by the user includes: based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees, and the first preset Threshold to determine whether to release the resources requested by the user.
在一些实施例中,根据生成相同的审批结果的决策树的数量占决策树的总数的比例、和第一预设阈值,确定是否发放用户所请求的资源,包括:在生成相同的审批结果的决策树的数量占决策树的总数的比例超过第一预设阈值的情况下,根据该审批结果,确定是否发放用户所请求的资源。In some embodiments, determining whether to release the resource requested by the user is determined based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, including: When the ratio of the number of decision trees to the total number of decision trees exceeds the first preset threshold, it is determined whether to release the resources requested by the user based on the approval result.
随机森林算法通过构建大量独立的决策树对结果进行预测,预测得到同一结果的决策树的数量需要达到预设的阈值,结果才会被采信,投票机制可以是一票否决制、少数服从多数、加权多数等,审批通过的审批流将会自动发放资源,随即结束该审批。The random forest algorithm predicts results by constructing a large number of independent decision trees. The number of decision trees predicting the same result needs to reach a preset threshold before the result will be accepted. The voting mechanism can be a one-vote veto system, a minority obeys the majority, Weighted majority, etc., the approval flow will automatically release resources and end the approval immediately.
例如,令第一预设阈值为0.8,随机森林由n个决策树构成,在n个决策树中,有m个决策树的审批结果是“向用户发放资源”。则在的情况下,随机森林确定最终的结果为发放用户所请求的资源。通过设定阈值,排除了不可信的预测结果,从而提高了预测准确率。For example, let the first preset threshold be 0.8, and the random forest consists of n decision trees. Among the n decision trees, the approval result of m decision trees is "release resources to users." then in In this case, the random forest determines the final result to issue the resources requested by the user. By setting a threshold, untrustworthy prediction results are excluded, thereby improving the prediction accuracy.
在一些实施例中,根据生成相同的审批结果的决策树的数量占决策树的总数的比例和第一预设阈值,确定是否发放用户所请求的资源,包括:在生成相同的审批结果的决策树的数量占决策树的总数的比例不超过第一预设阈值的情况下,根据多个特征,确定是否发放用户所请求的资源。In some embodiments, determining whether to release the resource requested by the user is determined based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, including: in the decision making that generates the same approval result When the ratio of the number of trees to the total number of decision trees does not exceed the first preset threshold, it is determined whether to release the resources requested by the user based on multiple characteristics.
例如,在随机森林得到的预测结果没能达到阈值的情况下,随机森林无法自动得出审批是否通过,审批单将会转到其他审批方式,例如,由审批人根据用户请求使用的资源的类型、用户请求使用的资源的规格、用户请求使用的资源的数量、用户的资源使用权限和用户请求使用资源的原因等特征,确定是否发放用户所请求的资源。For example, when the prediction results obtained by the random forest fail to reach the threshold, the random forest cannot automatically determine whether the approval is passed, and the approval form will be transferred to other approval methods, such as the type of resources used by the approver according to the user's request. , the specifications of the resources requested by the user, the number of resources requested by the user, the user's resource usage rights and the reasons for the user's request to use the resources, etc., to determine whether to release the resources requested by the user.
本公开综合所有决策树的结果,确定是否发放用户所请求的资源,减少了单个决策树的误差对最终结果的影响,提高了审批的准确率。This disclosure integrates the results of all decision trees to determine whether to release the resources requested by the user, reduces the impact of errors of a single decision tree on the final result, and improves the accuracy of approval.
图4示出根据本公开一些实施例的资源审批的流程图。Figure 4 illustrates a flowchart of resource approval according to some embodiments of the present disclosure.
如图4所示,当用户出现跨越权限组调用资源的需求时,开始审批流,此时审批流自动进入Listen(监听)状态,等待用户在前台输入审批信息。As shown in Figure 4, when a user needs to call resources across permission groups, the approval flow starts. At this time, the approval flow automatically enters the Listen state, waiting for the user to enter approval information at the front desk.
在一些实施例中,用户对资源的使用请求还包括用户对资源的历史使用请求。例如, 对于资源使用请求的审批单中,用户申请使用的资源的规格等信息,自动获取用户之前选取的规格或用户常用的规格进行填充,申请者只需要补充申请资源所用于的场景和原因。In some embodiments, the user's request for resource usage also includes the user's historical usage request for the resource. For example, In the approval form of the resource usage request, information such as the specifications of the resources that the user applies for use will be automatically obtained and filled in with the specifications previously selected by the user or the specifications commonly used by the user. The applicant only needs to add the scenarios and reasons for the requested resources.
审批流Listen的状态持续时长为30分钟,如果用户在30分钟内没有填写审批单并提交,则审批单自动关闭。The status of the approval flow Listen lasts for 30 minutes. If the user does not fill out and submit the approval form within 30 minutes, the approval form will be automatically closed.
在用户填写完审批单信息后,审批流第一个到达的节点是流程引擎(Process engine)。在Process engine(流程引擎)中会内置一定数量的预先设定好的规则,这些规则支持自定义,平台的管理者可以根据公司组织架构的需求自定义哪些权限组(例如哪些部门、职位)对于哪些资源的调用是可以免审批的。例如,流程引擎中规定:测试人员申请超过限定规格一定范围内的云主机用于链路压测,可以免审批。After the user fills in the approval form information, the first node reached by the approval flow is the process engine. A certain number of preset rules will be built into the Process engine. These rules support customization. The administrator of the platform can customize which permission groups (such as which departments and positions) are suitable for the company according to the needs of the company's organizational structure. Which resource calls can be exempted from approval. For example, the process engine stipulates that testers can be exempted from approval when they apply for cloud hosts that exceed the specified specifications within a certain range for link stress testing.
对于免审批的资源使用请求,会自动发放资源给用户,随后结束审批流。For resource usage requests that are exempt from approval, the resources will be automatically released to the user, and then the approval process will be completed.
结合安全需求、公司规章制度和管理层要求等硬性条件预设一些规则,对于需要审批的资源使用请求,流程引擎通过这些规则直接过滤出哪些使用请求是需要进入其他审批方式的,比如对于一些跨越职级过高的资源需求、或者对公司资源使用的稳定性有较大影响的审批,为了提高审批的准确率,将这些特殊的资源使用请求分到其他审批方式中。而剩余的资源使用请求,则进入随机森林算法的处理流程。Some rules are preset based on hard conditions such as security requirements, company rules and regulations, and management requirements. For resource usage requests that require approval, the process engine uses these rules to directly filter out which usage requests need to enter other approval methods, such as for some cross-border requests. In order to improve the accuracy of approval for resource requirements that are too high in rank or have a greater impact on the stability of the company's resource use, these special resource usage requests are divided into other approval methods. The remaining resource usage requests enter the processing flow of the random forest algorithm.
在随机森林算法的处理流程中,系统会抓取申请者的使用请求的相关特征,例如申请者的职位、职级、负责的工作等会影响审批结果的因素,这些特征将输入到决策树中作为预测依据。最终,多颗决策树形成的随机森林决定是否通过用户的请求。In the processing process of the random forest algorithm, the system will capture the relevant features of the applicant's usage request, such as the applicant's position, rank, responsible work and other factors that will affect the approval results. These features will be input into the decision tree as Basis for prediction. Finally, a random forest formed by multiple decision trees determines whether to pass the user's request.
随机森林预测的结果需要达到一定的阈值才会被采信,如果得到的预测结果没能达到阈值,无法自动得出审批是否通过,审批单将会转到其他审批方式。对于审批通过的资源使用请求,将会自动发放资源给用户并结束该审批流。对于不通过的资源使用请求,会返回Listen状态,等待用户修改信息。The results of random forest predictions need to reach a certain threshold before they will be accepted. If the prediction results do not reach the threshold, it will not be possible to automatically determine whether the approval has been passed, and the approval form will be transferred to other approval methods. For resource usage requests that pass the approval, the resources will be automatically released to the user and the approval process will be completed. For resource usage requests that fail, the Listen state will be returned, waiting for the user to modify the information.
进入其他审批方式的资源使用请求,同样会有通过和不通过两个状态,对于审批通过的资源使用请求,将会发放资源给用户并结束该审批流。对于不通过的资源使用请求,会返回Listen状态,等待用户修改信息。Resource usage requests that enter other approval methods will also have two statuses: passed and failed. For resource usage requests that pass the approval, the resources will be released to the user and the approval process will end. For resource usage requests that fail, the Listen state will be returned, waiting for the user to modify the information.
图5示出了根据本公开一些实施例的随机森林模型的训练方法的流程图。Figure 5 shows a flow chart of a training method of a random forest model according to some embodiments of the present disclosure.
如图5所示,随机森林模型的训练方法包括步骤S210-S240。在一些实施例中,随机森林模型的训练方法由随机森林模型的训练装置执行。As shown in Figure 5, the training method of the random forest model includes steps S210-S240. In some embodiments, the training method of the random forest model is executed by a training device of the random forest model.
在步骤S210中,获取训练集,其中,训练集包括用户对资源的使用请求的样本,样 本还包括表示是否发放用户所请求的资源的标签。In step S210, a training set is obtained, where the training set includes samples of user requests for resource use. This also includes a label indicating whether to release the resource requested by the user.
例如,一个使用请求的样本包含了用户填写的对资源的使用请求,以及标注的是否发放用户所请求的资源的标签。For example, a usage request sample includes a user-filled request for resource usage, as well as a label indicating whether to release the resource requested by the user.
在步骤S220中,确定用户对资源的使用请求的多个特征。In step S220, multiple characteristics of the user's resource usage request are determined.
例如,从用户提交的对资源的使用请求的样本中,提取出用户的职位、职级、负责的工作等多个特征,作为用户对资源的使用请求的多个特征,并确定这些特征的值。For example, from the sample of resource use request submitted by the user, multiple features such as the user's position, rank, and responsible work are extracted as multiple features of the user's resource use request, and the values of these features are determined.
在步骤S230中,针对随机森林模型中的每个决策树,从多个特征中抽取部分特征,作为该决策树的候选特征。In step S230, for each decision tree in the random forest model, some features are extracted from multiple features as candidate features of the decision tree.
例如,为每个决策时,都随机抽取一部分特征,用作决策树训练的候选特征。样本有Y个特征,从样本的所有特征中随机选择T(T<Y)个特征,作为一个决策树的候选特征。For example, for each decision-making time, a part of features are randomly extracted and used as candidate features for decision tree training. The sample has Y features, and T (T < Y) features are randomly selected from all features of the sample as candidate features for a decision tree.
随机抽取特征可以采用有放回的随机采样方法,例如bagging(装袋)方法,每次抽到一个特征,就将该特征放回,然后再抽取,而不是一次性抽10个,再放回。有放回的随机抽样使得每一个样本被抽中的概率符合均匀分布。To randomly extract features, you can use a random sampling method with replacement, such as the bagging method. Each time a feature is extracted, it is replaced and then extracted again, instead of sampling 10 at a time and then replacing it. . Random sampling with replacement makes the probability of each sample being drawn conform to a uniform distribution.
在一些实施例中,根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树,包括:针对每个决策树,从训练集中抽取多个样本,作为该决策树的训练集;根据决策树的训练集中样本的与该决策树对应的候选特征的值、和表示是否发放用户所请求的资源的标签,训练决策树。In some embodiments, training each decision tree according to the value of the candidate feature of each decision tree and the label of the sample includes: for each decision tree, extracting multiple samples from the training set as training for the decision tree Set; train the decision tree based on the values of candidate features corresponding to the decision tree in the training set of the decision tree and the label indicating whether to release the resource requested by the user.
例如,与对特征的随机抽样类似的,也采用有放回的随机抽样对训练集中的样本进行抽取。训练集中有X个样本,从数据集中有放回地随机抽样S(S<X)个样本,作为一个决策树的训练集。通过为每个决策树抽取训练集,将训练集划分为多个子集,每个决策树分别用一个子集作为训练集进行构建,最终多个训练好的决策树构成一座森林。For example, similar to random sampling of features, random sampling with replacement is also used to extract samples from the training set. There are X samples in the training set, and S (S<X) samples are randomly sampled from the data set with replacement as a training set for a decision tree. By extracting a training set for each decision tree, the training set is divided into multiple subsets. Each decision tree is constructed using a subset as a training set. Finally, multiple trained decision trees form a forest.
本公开的随机森林模型在训练时,行和列(样本和特征)都随机抽取,能够做到真正的把整个数据表随机切分成多份,每个决策树使用一份,只要决策树的数量足够,总有决策树能够在最大程度上获取数据集的价值,从而提高随机森林模型的资源审批的准确率。When training the random forest model of this disclosure, rows and columns (samples and features) are randomly selected, which can truly randomly divide the entire data table into multiple parts, and use one part for each decision tree. As long as the number of decision trees is Enough, there is always a decision tree that can capture the value of the data set to the greatest extent, thereby improving the accuracy of resource approval of the random forest model.
在步骤S240中,根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树。In step S240, each decision tree is trained according to the value of the candidate feature of each decision tree and the label of the sample.
下面介绍单个决策树的训练方法。The following describes the training method of a single decision tree.
在一些实施例中,根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树,包括:将决策树的根节点作为当前节点,根据训练集,从候选特征中选择与根节点对 应的特征;根据与当前节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,确定与当前节点的子节点对应的训练集;根据与当前节点的子节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,从剩余的候选特征中选择与当前节点的子节点对应的特征;将当前子节点的子节点作为当前节点,循环确定与当前节点的子节点对应的训练集、从剩余的候选特征中选择与当前节点的子节点对应的特征的步骤,直至达到截止条件。In some embodiments, training each decision tree according to the value of the candidate feature of each decision tree and the label of the sample includes: taking the root node of the decision tree as the current node, and selecting from the candidate features according to the training set. root node pair According to the characteristic value of the sample in the training set corresponding to the current node and the label of the sample, determine the training set corresponding to the child node of the current node; according to the training set corresponding to the child node of the current node Concentrate the value of the feature of the sample corresponding to the current node, as well as the label of the sample, and select the feature corresponding to the child node of the current node from the remaining candidate features; use the child node of the current child node as the current node, and loop to determine the child node of the current node. The training set corresponding to the child node of , and the step of selecting the feature corresponding to the child node of the current node from the remaining candidate features until the cutoff condition is reached.
在一些实施例中,当前节点的子节点包括当前节点的第一子节点和当前节点的第二子节点,根据与当前节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,确定与当前节点的子节点对应的训练集,包括:根据与当前节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,从与当前节点对应的特征的取值范围中选择一个特征的值,作为划分与当前节点的第一子节点对应的训练集和与当前节点的第二子节点对应的训练集的切分点;根据划分与当前节点的第一子节点对应的训练集和与当前节点的第二子节点对应的训练集的切分点,判断将与当前节点对应的训练集中的样本划分到第一子节点的训练集还是第二子节点的训练集。In some embodiments, the child nodes of the current node include a first child node of the current node and a second child node of the current node, according to the value of the feature corresponding to the current node according to the sample in the training set corresponding to the current node, and the value of the sample. Label, determines the training set corresponding to the child node of the current node, including: the value of the feature corresponding to the current node based on the sample in the training set corresponding to the current node, and the label of the sample, from the value of the feature corresponding to the current node Select the value of a feature in the range as the split point to divide the training set corresponding to the first sub-node of the current node and the training set corresponding to the second sub-node of the current node; according to the division and the first sub-node of the current node The corresponding training set and the split point of the training set corresponding to the second child node of the current node are used to determine whether the samples in the training set corresponding to the current node are divided into the training set of the first child node or the training set of the second child node. .
例如,对抽样的S个样本,利用抽取的T个特征进行决策树模型训练,从根节点开始,先选择与根节点对应的特征。以CART(classification and regression tree,分类和回归数)分类树为例,CART是一个从原点开始无限向下分类的二叉树,也就是说其节点只有两种选择,‘是’和‘否’,通过不断的划分,将特征空间划分为有限个单元,并在这些单元上确定预测的概率分布。For example, for the S samples, use the extracted T features to train the decision tree model. Starting from the root node, first select the features corresponding to the root node. Take the CART (classification and regression tree) classification tree as an example. CART is a binary tree that classifies infinitely downward from the origin. That is to say, its nodes have only two choices, 'yes' and 'no'. Through Continuously divide the feature space into a limited number of units, and determine the predicted probability distribution on these units.
使用基尼系数衡量特征的重要性,基尼系数代表了模型的不纯度,基尼系数越小,不纯度越低,特征越好。数据集D的纯度可用基尼值来度量,假设集合中有K类样本,基尼系数的计算公式如下:
Use the Gini coefficient to measure the importance of features. The Gini coefficient represents the impurity of the model. The smaller the Gini coefficient, the lower the impurity, and the better the feature. The purity of data set D can be measured by the Gini value. Assuming that there are K samples in the set, the calculation formula of the Gini coefficient is as follows:
|Ck|表示标签属于K类的样本数,训练集D的样本的个数为|D|,表示样本的标签属于K类别的概率。Gini(D)反映了从数据集D中随机抽取两个样本,其类别标记不一致的概率。因此,Gini(D)越小,则数据集D的纯度越高。|C k | represents the number of samples whose labels belong to K category. The number of samples in the training set D is |D|. Represents the probability that the label of the sample belongs to the K category. Gini(D) reflects the probability that two samples randomly selected from the data set D have inconsistent class labels. Therefore, the smaller Gini(D), the higher the purity of data set D.
计算当前节点现有的各个特征的各个特征值对数据集D的基尼系数。例如,先根据训练集的样本,确定特征A的取值范围。以当前节点为根节点为例,此时,根节点对应的 训练集为D,从A的取值范围中,选择取值a,根据特征A是否取值a,把训练集D分成两部分,例如,当样本的特征A的取值为a,则将该样本划分到训练集D1中,否则,划分到训练集D2中,计算特征值A和切分点a对数据集D的基尼系数的公式如下:
Calculate the Gini coefficient of each feature value of each existing feature of the current node to the data set D. For example, first determine the value range of feature A based on the samples of the training set. Taking the current node as the root node as an example, at this time, the root node corresponds to The training set is D. From the value range of A, select the value a. According to whether the feature A takes the value a, the training set D is divided into two parts. For example, when the value of the feature A of the sample is a, then the value of the feature A is a. The sample is divided into the training set D 1 , otherwise, it is divided into the training set D 2. The formula for calculating the Gini coefficient of the feature value A and the cut point a for the data set D is as follows:
其中,Gini(D1)表示数据集D1的基尼系数。Among them, Gini(D 1 ) represents the Gini coefficient of data set D1.
在计算出来的各个特征的各个值的基尼系数中,选择基尼系数最小的特征作为当前节点对应的特征,将该特征的基尼系数最小的取值作为划分与当前节点的第一子节点对应的训练集和与当前节点的第二子节点对应的训练集的切分点。Among the calculated Gini coefficients of each value of each feature, select the feature with the smallest Gini coefficient as the feature corresponding to the current node, and use the smallest value of the Gini coefficient of this feature as the training corresponding to the first child node of the current node. Set and the split point of the training set corresponding to the second child node of the current node.
在确定切分点时,决策树能够对连续值和离散值进行处理。例如,对于连续值,假设m个样本的连续特征A有m个值,从小到大排列,则CART取相邻两样本值的平均数做划分点,一共有m-1个划分点,分别计算以这m-1个点作为二元分类点时的基尼系数。选择基尼系数最小的点为该连续特征的切分点。比如取到的基尼系数最小的点为a,则小于a的值为类别1,大于a的值为类别2,这样就做到了连续特征的离散化。When determining the split point, the decision tree can handle both continuous values and discrete values. For example, for continuous values, assuming that the continuous feature A of m samples has m values, arranged from small to large, then CART takes the average of two adjacent sample values as the dividing point. There are m-1 dividing points in total, and they are calculated separately. The Gini coefficient when these m-1 points are used as binary classification points. Select the point with the smallest Gini coefficient as the cutting point of the continuous feature. For example, the point with the smallest Gini coefficient is a, then the value less than a is category 1, and the value greater than a is category 2. This achieves the discretization of continuous features.
对于离散值,CART采用的是循环的二分法。CART把特征A的取值分成(a1,a2a3)或(a1a2,a3)或(a2,a1a3)三种情况,找到基尼系数最小的组合,比如(a2,a1a3),然后建立二叉树节点,一个节点是a2对应的样本,另一个节点是对a1和a3对应的样本。由于这次没有把特征A的取值完全分开。For discrete values, CART uses the cyclic bisection method. CART divides the value of feature A into three situations: (a 1 , a 2 a 3 ) or (a 1 a 2 , a 3 ) or (a 2 , a 1 a 3 ), and finds the combination with the smallest Gini coefficient, such as ( a 2 , a 1 a 3 ), and then establish a binary tree node. One node is the sample corresponding to a 2 , and the other node is the sample corresponding to a 1 and a 3 . Because the values of feature A are not completely separated this time.
在确定划分与当前节点的第一子节点对应的训练集和与当前节点的第二子节点对应的训练集的切分点之后,根据切分点,判断将与当前节点对应的训练集中的样本划分到第一子节点还是第二子节点,从而生成第一子节点和第二子节点的训练集。After determining the split point that divides the training set corresponding to the first sub-node of the current node and the training set corresponding to the second sub-node of the current node, determine the samples in the training set corresponding to the current node based on the split point Whether it is divided into the first sub-node or the second sub-node, thereby generating a training set of the first sub-node and the second sub-node.
然后将子节点作为当前节点,循环上述根据当前节点的训练集确定当前节点的特征、根据当前节点的特征确定子节点的训练集的步骤,直至达到截止条件,则返回决策子树,当前节点停止递归,最后建立整个决策树。Then use the child node as the current node, loop the above steps of determining the characteristics of the current node based on the training set of the current node, and determining the training set of the child node based on the characteristics of the current node, until the cutoff condition is reached, then return to the decision subtree, and the current node stops Recurse, and finally build the entire decision tree.
通过上述方法,能够衡量不同特征间的交互性。例如,如果同一个决策树中,按照某个特征M分裂为两个子节点的训练集,在特征J上更容易分裂,那么特征M与J具有交互性。Through the above method, the interaction between different features can be measured. For example, if the training set in the same decision tree is split into two child nodes according to a certain feature M, and it is easier to split on feature J, then features M and J are interactive.
在一些实施例中,截止条件包括不存在剩余的候选特征、与当前节点对应的训练集中样本的数量小于第二预设阈值,以及与当前节点对应的训练集的基尼系数小于第三预设阈 值的至少一个。In some embodiments, the cutoff conditions include that there are no remaining candidate features, the number of samples in the training set corresponding to the current node is less than a second preset threshold, and the Gini coefficient of the training set corresponding to the current node is less than a third preset threshold. value of at least one.
例如,如果D的样本个数小于阈值,或已经没有特征可供选择,或当前节点的训练集的基尼系数小于阈值,则返回决策树子树,当前节点停止递归。For example, if the number of samples in D is less than the threshold, or there are no features to choose from, or the Gini coefficient of the training set of the current node is less than the threshold, the decision tree subtree is returned and the current node stops recursing.
本公开根据基尼系数自动判断用户对资源的使用请求的特征的重要程度,此外,能够衡量不同特征间的交互性,构建决策树并生成资源审批结果,无需降维,无需做特征选择,提高了对资源的使用请求的审批的准确度和效率。This disclosure automatically determines the importance of features in a user's resource usage request based on the Gini coefficient. In addition, it can measure the interactivity between different features, build a decision tree and generate resource approval results without the need for dimensionality reduction or feature selection, which improves Accuracy and efficiency in approving resource usage requests.
按照上述方法,构建多个决策树,最终构成随机森林模型。According to the above method, multiple decision trees are constructed to finally form a random forest model.
在一些实施例中,确定用户对资源的使用请求的多个特征,包括在用户对资源的使用请求的样本缺失特征的值的情况下,计算该样本和其他样本在决策树中经过节点的路径的相似度;根据样本和其他样本在决策树中经过节点的路径的相似度,确定该样本缺失的特征的值。In some embodiments, determining multiple characteristics of the user's request for resource use includes, in the case where a value of a characteristic of a sample of the user's request for resource use is missing, calculating a path of the sample and other samples through the node in the decision tree. The similarity of the sample and the path through the node in the decision tree between the sample and other samples is used to determine the value of the missing feature of the sample.
例如,首先,给样本中的缺失值预设一些估计值。对于数值型变量,选择其余数据的中位数或众数作为当前缺失值的估计值,如果是数值型变量,通过加权平均得到新的估计值。然后,根据估计的数值,建立随机森林,把所有的数据放进随机森林里面跑一遍。记录每一组数据在决策树中一步一步分类的路径,判断哪组数据和缺失数据路径最相似,引入一个相似度矩阵,来记录数据之间的相似度,比如有N组数据,相似度矩阵大小就是N*N。如果缺失值是类别变量,通过权重投票得到新估计值,如此迭代,直到得到稳定的估计值。For example, first, preset some estimates for the missing values in the sample. For numeric variables, select the median or mode of the remaining data as the estimate of the current missing value. If it is a numeric variable, a new estimate is obtained through a weighted average. Then, based on the estimated values, build a random forest and put all the data into the random forest and run it again. Record the step-by-step classification path of each group of data in the decision tree, determine which group of data is most similar to the missing data path, and introduce a similarity matrix to record the similarity between the data. For example, if there are N groups of data, the similarity matrix The size is N*N. If the missing value is a categorical variable, a new estimated value is obtained through weighted voting, and so on until a stable estimated value is obtained.
通过构造多棵决策树对缺失值进行填补,使得填补得到的数据具有随机性和不确定性,更能反映出这些未知数据的真实分布。此外,由于在构造决策树过程中,每个节点使用的都是随机的部分特征而不是训练集的全部特征,所以能很好的应用到高维数据的填补。因此,本公开能够减少缺失值对资源审批的干扰,提高对资源的使用请求的审批的准确度。By constructing multiple decision trees to fill in missing values, the filled data is random and uncertain, and can better reflect the true distribution of these unknown data. In addition, since in the process of constructing the decision tree, each node uses random partial features instead of all the features of the training set, it can be well applied to filling high-dimensional data. Therefore, the present disclosure can reduce the interference of missing values on resource approval and improve the accuracy of approval of resource use requests.
在一些实施例中,根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树包括对决策树进行剪枝。In some embodiments, training each decision tree includes pruning the decision tree based on the values of the candidate features of each decision tree and the labels of the samples.
图6示出了根据本公开一些实施例的对决策树进行剪枝的示意图。Figure 6 shows a schematic diagram of pruning a decision tree according to some embodiments of the present disclosure.
如图6所示,采用后剪枝法,即先生成决策树,然后在已经生成的决策树的基础上,产生所有剪枝后的CART树,然后使用交叉验证检验剪枝的效果,选择泛化能力最好的剪枝策略。As shown in Figure 6, the post-pruning method is used, that is, a decision tree is first generated, and then all pruned CART trees are generated based on the generated decision trees, and then cross-validation is used to test the effect of pruning, and general The pruning strategy with the best performance.
对于位于节点t的任意一颗子树Tt,如果没有剪枝,则子树Tt的损失函数是: For any subtree T t located at node t, if there is no pruning, the loss function of the subtree T t is:
Cα(Tt)=C(Tt)+α|Tt|C α (T t )=C (T t )+α|T t |
如果将其剪掉,仅保留根节点,则根节点的损失函数如下:If it is cut off and only the root node is retained, the loss function of the root node is as follows:
Cα(T)=C(T)+αC α (T) = C (T) + α
其中,α为正则化参数(和线性回归的正则化一样),C(Tt)为验证数据的预测误差(即验证数据的基尼系数),|Tt|是子树T的叶子节点数量。Among them, α is the regularization parameter (the same as the regularization of linear regression), C(T t ) is the prediction error of the verification data (that is, the Gini coefficient of the verification data), |T t | is the number of leaf nodes of the subtree T.
按照损失函数最小原则,如果满足下式,则需要对子树T进行剪枝:
According to the principle of minimizing the loss function, if the following formula is satisfied, the subtree T needs to be pruned:
通过剪枝,能够砍掉决策树的冗余部分,避免对训练集过拟合,提升泛化能力。Through pruning, redundant parts of the decision tree can be cut off to avoid overfitting to the training set and improve generalization capabilities.
在一些实施例中,根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树,包括并行训练多个决策树。例如,将随机森林的多个决策树并行地、独立地训练,从而可以提高随机森林模型的训练速度。In some embodiments, each decision tree is trained based on the values of the candidate features of each decision tree and the labels of the samples, including training multiple decision trees in parallel. For example, training multiple decision trees of a random forest in parallel and independently can improve the training speed of the random forest model.
图7示出根据本公开一些实施例的资源审批装置的框图。Figure 7 shows a block diagram of a resource approval device according to some embodiments of the present disclosure.
如图7所示,资源审批装置7包括获取模块71、第一确定模块72、选择模块73、预测模块74、第二确定模块75。As shown in FIG. 7 , the resource approval device 7 includes an acquisition module 71 , a first determination module 72 , a selection module 73 , a prediction module 74 , and a second determination module 75 .
获取模块71,被配置为获取用户对资源的使用请求,例如执行如图1所示的步骤S110。The obtaining module 71 is configured to obtain the user's request for resource use, for example, performing step S110 as shown in Figure 1 .
第一确定模块72,被配置为确定用户对资源的使用请求的多个特征,例如执行如图1所示的步骤S120。The first determination module 72 is configured to determine multiple characteristics of the user's resource usage request, for example, perform step S120 as shown in FIG. 1 .
选择模块73,被配置为针对随机森林模型中的每个决策树,从多个特征中,选择与该决策树对应的特征,例如执行如图1所示的步骤S130。The selection module 73 is configured to, for each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features, for example, perform step S130 as shown in FIG. 1 .
预测模块74,被配置为根据与每个决策树对应的特征的值,预测审批结果,其中,审批结果表示是否发放用户所请求的资源,例如执行如图1所示的步骤S140。The prediction module 74 is configured to predict the approval result according to the value of the feature corresponding to each decision tree, where the approval result indicates whether to release the resource requested by the user, for example, perform step S140 as shown in FIG. 1 .
第二确定模块75,被配置为综合每个决策树的审批结果,确定是否发放用户所请求的资源,例如执行如图1所示的步骤S150。The second determination module 75 is configured to synthesize the approval results of each decision tree and determine whether to release the resources requested by the user, for example, performing step S150 as shown in FIG. 1 .
图8示出了根据本公开一些实施例的随机森林模型的训练装置的框图。Figure 8 shows a block diagram of a training device for a random forest model according to some embodiments of the present disclosure.
如图8所示,随机森林模型的训练装置包括获取模块81、确定模块82、抽取模块83、训练模块84。As shown in Figure 8, the training device of the random forest model includes an acquisition module 81, a determination module 82, an extraction module 83, and a training module 84.
获取模块81,被配置为获取训练集,其中,训练集包括用户对资源的使用请求的样本,样本还包括表示是否发放用户所请求的资源的标签,例如执行如图5所示的步骤S210。 The acquisition module 81 is configured to acquire a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to issue the resources requested by the user. For example, step S210 shown in Figure 5 is performed.
确定模块82,被配置为确定用户对资源的使用请求的多个特征,例如执行如图5所示的步骤S220。The determination module 82 is configured to determine multiple characteristics of the user's resource usage request, for example, perform step S220 as shown in FIG. 5 .
抽取模块83,被配置为针对随机森林模型中的每个决策树,从多个特征中抽取部分特征,作为该决策树的候选特征,例如执行如图5所示的步骤S230。The extraction module 83 is configured to, for each decision tree in the random forest model, extract some features from multiple features as candidate features of the decision tree, for example, perform step S230 as shown in FIG. 5 .
训练模块84,被配置为根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树,例如执行如图5所示的步骤S240。The training module 84 is configured to train each decision tree according to the value of the candidate feature of each decision tree and the label of the sample, for example, perform step S240 as shown in FIG. 5 .
图9示出根据本公开另一些实施例的电子设备的框图。Figure 9 shows a block diagram of an electronic device according to other embodiments of the present disclosure.
如图9所示,电子设备9包括存储器91;以及耦接至该存储器91的处理器92,存储器91用于存储执行资源审批方法或随机森林模型的训练方法对应实施例的指令。处理器92被配置为基于存储在存储器91中的指令,执行本公开中任意一些实施例中的资源审批方法或随机森林模型的训练方法。As shown in Figure 9, the electronic device 9 includes a memory 91; and a processor 92 coupled to the memory 91. The memory 91 is used to store instructions for executing corresponding embodiments of the resource approval method or the training method of the random forest model. The processor 92 is configured to execute the resource approval method or the random forest model training method in any embodiment of the present disclosure based on instructions stored in the memory 91 .
图10示出用于实现本公开一些实施例的计算机系统的框图。Figure 10 illustrates a block diagram of a computer system for implementing some embodiments of the present disclosure.
如图10所示,计算机系统100可以通用计算设备的形式表现。计算机系统100包括存储器1010、处理器1020和连接不同系统组件的总线1000。As shown in Figure 10, computer system 100 may be embodied in the form of a general purpose computing device. Computer system 100 includes memory 1010, a processor 1020, and a bus 1000 that connects various system components.
存储器1010例如可以包括系统存储器、非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)以及其他程序等。系统存储器可以包括易失性存储介质,例如随机存取存储器(RAM)和/或高速缓存存储器。非易失性存储介质例如存储有执行本公开中任意一些实施例中的资源审批方法或随机森林模型的训练方法中的至少一种的对应实施例的指令。非易失性存储介质包括但不限于磁盘存储器、光学存储器、闪存等。The memory 1010 may include, for example, system memory, non-volatile storage media, and the like. System memory stores, for example, operating systems, applications, boot loaders, and other programs. System memory may include volatile storage media such as random access memory (RAM) and/or cache memory. The non-volatile storage medium stores, for example, instructions for executing corresponding embodiments of at least one of the resource approval methods or the random forest model training methods in any embodiments of the present disclosure. Non-volatile storage media includes but is not limited to disk storage, optical storage, flash memory, etc.
处理器1020可以用通用处理器、数字信号处理器(DSP)、应用专用集成电路(ASIC)、现场可编程门阵列(FPGA)或其它可编程逻辑设备、分立门或晶体管等分立硬件组件方式来实现。相应地,诸如判断模块和确定模块的每个模块,可以通过中央处理器(CPU)运行存储器中执行相应步骤的指令来实现,也可以通过执行相应步骤的专用电路来实现。The processor 1020 may be implemented as a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gates or transistors and other discrete hardware components. accomplish. Correspondingly, each module, such as the judgment module and the determination module, can be implemented by instructions executing corresponding steps in a central processing unit (CPU) running memory, or by dedicated circuits executing corresponding steps.
总线1000可以使用多种总线结构中的任意总线结构。例如,总线结构包括但不限于工业标准体系结构(ISA)总线、微通道体系结构(MCA)总线、外围组件互连(PCI)总线。Bus 1000 may use any of a variety of bus structures. For example, bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.
计算机系统100还可以包括输入输出接口1030、网络接口1040、存储接口1050等。这些接口1030、1040、1050以及存储器1010和处理器1020之间可以通过总线1000连接。 输入输出接口1030可以为显示器、鼠标、键盘等输入输出设备提供连接接口。网络接口1040为各种联网设备提供连接接口。存储接口1050为软盘、U盘、SD卡等外部存储设备提供连接接口。The computer system 100 may also include an input/output interface 1030, a network interface 1040, a storage interface 1050, and the like. These interfaces 1030, 1040, 1050, the memory 1010 and the processor 1020 may be connected through a bus 1000. The input and output interface 1030 can provide a connection interface for input and output devices such as a monitor, mouse, and keyboard. Network interface 1040 provides connection interfaces for various networked devices. The storage interface 1050 provides a connection interface for external storage devices such as floppy disks, USB disks, and SD cards.
这里,参照根据本公开实施例的方法、装置和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个框以及各框的组合,都可以由计算机可读程序指令实现。Various aspects of the disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks, can be implemented by computer-readable program instructions.
这些计算机可读程序指令可提供到通用计算机、专用计算机或其他可编程装置的处理器,以产生一个机器,使得通过处理器执行指令产生实现在流程图和/或框图中一个或多个框中指定的功能的装置。These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces implementations in one or more blocks of the flowcharts and/or block diagrams. A device with specified functions.
这些计算机可读程序指令也可存储在计算机可读存储器中,这些指令使得计算机以特定方式工作,从而产生一个制造品,包括实现在流程图和/或框图中一个或多个框中指定的功能的指令。Computer-readable program instructions, which may also be stored in computer-readable memory, cause the computer to operate in a specific manner to produce an article of manufacture, including implementing the functions specified in one or more blocks of the flowcharts and/or block diagrams. instructions.
本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。The disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects.
通过上述实施例中的资源审批方法、随机森林模型的训练方法及装置、计算机可存储介质,提高了资源审批的效率和准确率。Through the resource approval method, random forest model training method and device, and computer storage medium in the above embodiments, the efficiency and accuracy of resource approval are improved.
至此,已经详细描述了根据本公开的资源审批方法、随机森林模型的训练方法及装置、计算机可存储介质。为了避免遮蔽本公开的构思,没有描述本领域所公知的一些细节。本领域技术人员根据上面的描述,完全可以明白如何实施这里公开的技术方案。 So far, the resource approval method, the training method and device of the random forest model, and the computer storage medium according to the present disclosure have been described in detail. To avoid obscuring the concepts of the present disclosure, some details that are well known in the art have not been described. Based on the above description, those skilled in the art can completely understand how to implement the technical solution disclosed here.

Claims (17)

  1. 一种资源审批方法,包括:A resource approval method including:
    获取用户对资源的使用请求;Obtain the user's request for resource use;
    确定用户对资源的使用请求的多个特征;Determine multiple characteristics of a user's request for resource usage;
    针对随机森林模型中的每个决策树,从多个特征中,选择与该决策树对应的特征;For each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features;
    根据与每个决策树对应的特征的值,预测审批结果,其中,审批结果表示是否发放用户所请求的资源;According to the value of the feature corresponding to each decision tree, the approval result is predicted, where the approval result indicates whether to release the resource requested by the user;
    综合每个决策树的审批结果,确定是否发放用户所请求的资源。Based on the approval results of each decision tree, it is determined whether to release the resources requested by the user.
  2. 根据权利要求1所述的资源审批方法,其中,所述综合每个决策树的审批结果,确定是否发放用户所请求的资源,包括:The resource approval method according to claim 1, wherein integrating the approval results of each decision tree to determine whether to release the resources requested by the user includes:
    根据生成相同的审批结果的决策树的数量占决策树的总数的比例、和第一预设阈值,确定是否发放用户所请求的资源。Based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, it is determined whether to release the resource requested by the user.
  3. 根据权利要求2所述的资源审批方法,其中,所述根据生成相同的审批结果的决策树的数量占决策树的总数的比例、和第一预设阈值,确定是否发放用户所请求的资源,包括:The resource approval method according to claim 2, wherein the determination of whether to release the resources requested by the user is based on the ratio of the number of decision trees that generate the same approval results to the total number of decision trees and the first preset threshold, include:
    在生成相同的审批结果的决策树的数量占决策树的总数的比例超过第一预设阈值的情况下,根据该审批结果,确定是否发放用户所请求的资源。When the ratio of the number of decision trees that generate the same approval result to the total number of decision trees exceeds the first preset threshold, it is determined whether to release the resource requested by the user based on the approval result.
  4. 根据权利要求3所述的资源审批方法,其中,所述根据生成相同的审批结果的决策树的数量占决策树的总数的比例、和第一预设阈值,确定是否发放用户所请求的资源,包括:The resource approval method according to claim 3, wherein the determination of whether to release the resources requested by the user is based on the ratio of the number of decision trees that generate the same approval results to the total number of decision trees and the first preset threshold, include:
    在生成相同的审批结果的决策树的数量占决策树的总数的比例不超过第一预设阈值的情况下,根据多个特征,确定是否发放用户所请求的资源。When the ratio of the number of decision trees that generate the same approval result to the total number of decision trees does not exceed the first preset threshold, it is determined whether to release the resource requested by the user based on multiple characteristics.
  5. 根据权利要求1所述的资源审批方法,其中,所述用户对资源的使用请求还包括用户对资源的历史使用请求。The resource approval method according to claim 1, wherein the user's resource usage request also includes the user's historical usage request for the resource.
  6. 根据权利要求1所述的资源审批方法,其中,所述用户对资源的使用请求的特征包括:用户请求使用的资源的类型、用户请求使用的资源的规格、用户请求使用的资源的数量、用户的资源使用权限和用户请求使用资源的原因的至少一个。The resource approval method according to claim 1, wherein the characteristics of the user's resource use request include: the type of resource requested by the user, specifications of the resource requested by the user, the number of resources requested by the user, At least one of the resource usage permissions and the reason why the user requested to use the resource.
  7. 一种随机森林模型的训练方法,包括: A training method for a random forest model, including:
    获取训练集,其中,训练集包括用户对资源的使用请求的样本,样本还包括表示是否发放用户所请求的资源的标签;Obtain a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to release the resources requested by the user;
    确定用户对资源的使用请求的多个特征;Determine multiple characteristics of a user's request for resource usage;
    针对随机森林模型中的每个决策树,从多个特征中抽取部分特征,作为该决策树的候选特征;For each decision tree in the random forest model, some features are extracted from multiple features as candidate features for the decision tree;
    根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树。Each decision tree is trained based on the values of its candidate features, as well as the labels of the samples.
  8. 根据权利要求7所述的随机森林模型的训练方法,其中,所述根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树,包括:The training method of a random forest model according to claim 7, wherein said training each decision tree according to the value of the candidate feature of each decision tree and the label of the sample includes:
    将决策树的根节点作为当前节点,根据训练集,从候选特征中选择与根节点对应的特征;Use the root node of the decision tree as the current node, and select the feature corresponding to the root node from the candidate features based on the training set;
    根据与当前节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,确定与当前节点的子节点对应的训练集;Determine the training set corresponding to the child node of the current node based on the value of the characteristic of the sample in the training set corresponding to the current node and the label of the sample;
    根据与当前节点的子节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,从剩余的候选特征中选择与当前节点的子节点对应的特征;According to the value of the feature corresponding to the current node in the training set sample corresponding to the child node of the current node, and the label of the sample, select the feature corresponding to the child node of the current node from the remaining candidate features;
    将当前子节点的子节点作为当前节点,循环确定与当前节点的子节点对应的训练集、从剩余的候选特征中选择与当前节点的子节点对应的特征的步骤,直至达到截止条件。Taking the child nodes of the current child node as the current node, iterate through the steps of determining the training set corresponding to the child node of the current node and selecting features corresponding to the child nodes of the current node from the remaining candidate features until the cutoff condition is reached.
  9. 根据权利要求8所述的随机森林模型的训练方法,其中,所述当前节点的子节点包括当前节点的第一子节点和当前节点的第二子节点,所述根据与当前节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,确定与当前节点的子节点对应的训练集,包括:The training method of a random forest model according to claim 8, wherein the child nodes of the current node include a first child node of the current node and a second child node of the current node, and the training set according to the training set corresponding to the current node The value of the characteristic of the sample corresponding to the current node, and the label of the sample, determine the training set corresponding to the child node of the current node, including:
    根据与当前节点对应的训练集中样本的与当前节点对应的特征的值,以及样本的标签,从与当前节点对应的特征的取值范围中选择一个特征的值,作为划分与当前节点的第一子节点对应的训练集和与当前节点的第二子节点对应的训练集的切分点;According to the value of the feature corresponding to the current node in the sample in the training set corresponding to the current node, and the label of the sample, select a feature value from the value range of the feature corresponding to the current node as the first dividing line between the current node and the current node. The split point between the training set corresponding to the child node and the training set corresponding to the second child node of the current node;
    根据划分与当前节点的第一子节点对应的训练集和与当前节点的第二子节点对应的训练集的切分点,判断将与当前节点对应的训练集中的样本划分到第一子节点的训练集还是第二子节点的训练集。According to the split point that divides the training set corresponding to the first sub-node of the current node and the training set corresponding to the second sub-node of the current node, it is determined whether the sample in the training set corresponding to the current node is divided into the first sub-node The training set is also the training set of the second child node.
  10. 根据权利要求8所述的随机森林模型的训练方法,其中,所述截止条件包括不存在剩余的候选特征、与当前节点对应的训练集中样本的数量小于第二预设阈值,以及与当前节点对应的训练集的基尼系数小于第三预设阈值的至少一个。 The training method of a random forest model according to claim 8, wherein the cut-off conditions include that there are no remaining candidate features, the number of samples in the training set corresponding to the current node is less than a second preset threshold, and the number of samples in the training set corresponding to the current node is The Gini coefficient of the training set is less than at least one of the third preset thresholds.
  11. 根据权利要求7所述的随机森林模型的训练方法,其中,所述根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树,包括:The training method of a random forest model according to claim 7, wherein said training each decision tree according to the value of the candidate feature of each decision tree and the label of the sample includes:
    针对每个决策树,从训练集中抽取多个样本,作为该决策树的训练集;For each decision tree, multiple samples are extracted from the training set as the training set for the decision tree;
    根据决策树的训练集中样本的与该决策树对应的候选特征的值、和表示是否发放用户所请求的资源的标签,训练决策树。The decision tree is trained based on the values of the candidate features corresponding to the decision tree in the samples in the training set of the decision tree and the labels indicating whether to release the resources requested by the user.
  12. 根据权利要求7所述的随机森林模型的训练方法,其中,所述确定用户对资源的使用请求的多个特征,包括:The training method of a random forest model according to claim 7, wherein the multiple characteristics of determining the user's resource usage request include:
    在用户对资源的使用请求的样本缺失特征的值的情况下,计算该样本和其他样本在决策树中经过节点的路径的相似度;When the sample requested by the user for resource usage is missing the value of the feature, calculate the similarity between the sample and other samples passing through the node in the decision tree;
    根据样本和其他样本在决策树中经过节点的路径的相似度,确定该样本缺失的特征的值。Based on the similarity of the path between the sample and other samples passing through the node in the decision tree, the value of the missing feature of the sample is determined.
  13. 一种资源审批装置,包括:A resource approval device, including:
    获取模块,被配置为获取用户对资源的使用请求;The acquisition module is configured to obtain the user's request for resource use;
    第一确定模块,被配置为确定用户对资源的使用请求的多个特征;A first determination module configured to determine a plurality of characteristics of the user's request for resource use;
    选择模块,被配置为针对随机森林模型中的每个决策树,从多个特征中,选择与该决策树对应的特征;The selection module is configured to, for each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features;
    预测模块,被配置为根据与每个决策树对应的特征的值,预测审批结果,其中,审批结果表示是否发放用户所请求的资源;The prediction module is configured to predict the approval result based on the value of the feature corresponding to each decision tree, where the approval result indicates whether to release the resources requested by the user;
    第二确定模块,被配置为综合每个决策树的审批结果,确定是否发放用户所请求的资源。The second determination module is configured to synthesize the approval results of each decision tree and determine whether to release the resources requested by the user.
  14. 一种随机森林模型的训练装置,包括:A training device for a random forest model, including:
    获取模块,被配置为获取训练集,其中,训练集包括用户对资源的使用请求的样本,样本还包括表示是否发放用户所请求的资源的标签;The acquisition module is configured to obtain a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to release the resources requested by the user;
    确定模块,被配置为确定用户对资源的使用请求的多个特征;a determining module configured to determine a plurality of characteristics of the user's request for resource usage;
    抽取模块,被配置为针对随机森林模型中的每个决策树,从多个特征中抽取部分特征,作为该决策树的候选特征;The extraction module is configured to extract some features from multiple features for each decision tree in the random forest model as candidate features for the decision tree;
    训练模块,被配置为根据每个决策树的候选特征的值,以及样本的标签,训练每个决策树。The training module is configured to train each decision tree based on the values of the candidate features of each decision tree and the labels of the samples.
  15. 一种电子设备,包括: An electronic device including:
    存储器;以及memory; and
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器的指令,执行根据权利要求1至6任一项所述的资源审批方法,或根据权利要求7至12任一项所述的随机森林模型的训练方法。A processor coupled to the memory, the processor being configured to execute the resource approval method according to any one of claims 1 to 6, or according to claims 7 to 12 based on instructions stored in the memory The training method of the random forest model described in any one of the above.
  16. 一种计算机可存储介质,其上存储有计算机程序指令,该指令被处理器执行时,实现根据权利要求1至6任一项所述的资源审批方法,或根据权利要求7至12任一项所述的随机森林模型的训练方法。A computer storage medium with computer program instructions stored thereon. When the instructions are executed by a processor, the resource approval method according to any one of claims 1 to 6 is implemented, or the resource approval method according to any one of claims 7 to 12 is implemented. The training method of the random forest model.
  17. 一种计算机程序,包括:A computer program consisting of:
    指令,所述指令当由处理器执行时使所述处理器执行根据权利要求1至6任一项所述的资源审批方法,或根据权利要求7至12任一项所述的随机森林模型的训练方法。 Instructions, which when executed by a processor cause the processor to execute the resource approval method according to any one of claims 1 to 6, or the random forest model according to any one of claims 7 to 12 Training methods.
PCT/CN2023/074133 2022-07-29 2023-02-01 Resource examination and approval method and device, and random forest model training method and device WO2024021555A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210905742.3A CN115147092A (en) 2022-07-29 2022-07-29 Resource approval method and training method and device of random forest model
CN202210905742.3 2022-07-29

Publications (1)

Publication Number Publication Date
WO2024021555A1 true WO2024021555A1 (en) 2024-02-01

Family

ID=83413509

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/074133 WO2024021555A1 (en) 2022-07-29 2023-02-01 Resource examination and approval method and device, and random forest model training method and device

Country Status (2)

Country Link
CN (1) CN115147092A (en)
WO (1) WO2024021555A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115147092A (en) * 2022-07-29 2022-10-04 京东科技信息技术有限公司 Resource approval method and training method and device of random forest model
CN115616204A (en) * 2022-12-21 2023-01-17 金发科技股份有限公司 Method and system for identifying polyethylene terephthalate reclaimed materials
CN116739719B (en) * 2023-08-14 2023-11-03 南京大数据集团有限公司 Flow configuration system and method of transaction platform

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110264342A (en) * 2019-06-19 2019-09-20 深圳前海微众银行股份有限公司 A kind of business audit method and device based on machine learning
CN111709828A (en) * 2020-06-12 2020-09-25 中国建设银行股份有限公司 Resource processing method, device, equipment and system
WO2021077011A1 (en) * 2019-10-18 2021-04-22 Solstice Initiative, Inc. Systems and methods for shared utility accessibility
CN113505936A (en) * 2021-07-26 2021-10-15 平安信托有限责任公司 Project approval result prediction method, device, equipment and storage medium
CN115147092A (en) * 2022-07-29 2022-10-04 京东科技信息技术有限公司 Resource approval method and training method and device of random forest model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110264342A (en) * 2019-06-19 2019-09-20 深圳前海微众银行股份有限公司 A kind of business audit method and device based on machine learning
WO2021077011A1 (en) * 2019-10-18 2021-04-22 Solstice Initiative, Inc. Systems and methods for shared utility accessibility
CN111709828A (en) * 2020-06-12 2020-09-25 中国建设银行股份有限公司 Resource processing method, device, equipment and system
CN113505936A (en) * 2021-07-26 2021-10-15 平安信托有限责任公司 Project approval result prediction method, device, equipment and storage medium
CN115147092A (en) * 2022-07-29 2022-10-04 京东科技信息技术有限公司 Resource approval method and training method and device of random forest model

Also Published As

Publication number Publication date
CN115147092A (en) 2022-10-04

Similar Documents

Publication Publication Date Title
WO2024021555A1 (en) Resource examination and approval method and device, and random forest model training method and device
US20230126005A1 (en) Consistent filtering of machine learning data
US20210374610A1 (en) Efficient duplicate detection for machine learning data sets
JP6771751B2 (en) Risk assessment method and system
US20200050968A1 (en) Interactive interfaces for machine learning model evaluations
US11182691B1 (en) Category-based sampling of machine learning data
EP3161635B1 (en) Machine learning service
US11100420B2 (en) Input processing for machine learning
WO2019218699A1 (en) Fraud transaction determining method and apparatus, computer device, and storage medium
US10891325B2 (en) Defect record classification
EP3991044A1 (en) Diagnosing &amp; triaging performance issues in large-scale services
WO2023056723A1 (en) Fault diagnosis method and apparatus, and electronic device and storage medium
US11860905B2 (en) Scanning for information according to scan objectives
US11567735B1 (en) Systems and methods for integration of multiple programming languages within a pipelined search query
CN116235158A (en) System and method for implementing automated feature engineering
US11500840B2 (en) Contrasting document-embedded structured data and generating summaries thereof
Thurow et al. Imputing missings in official statistics for general tasks–our vote for distributional accuracy
Perkins et al. Practical Data Science for Actuarial Tasks
CN112100165A (en) Traffic data processing method, system, device and medium based on quality evaluation
US11715037B2 (en) Validation of AI models using holdout sets
US20230010147A1 (en) Automated determination of accurate data schema
TWI755702B (en) Method for testing real data and computer-readable medium
CN112199603B (en) Information pushing method and device based on countermeasure network and computer equipment
US11573721B2 (en) Quality-performance optimized identification of duplicate data
CN110471962B (en) Method and system for generating active data report

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23844774

Country of ref document: EP

Kind code of ref document: A1