WO2024021555A1

WO2024021555A1 - Resource examination and approval method and device, and random forest model training method and device

Info

Publication number: WO2024021555A1
Application number: PCT/CN2023/074133
Authority: WO
Inventors: 常三强; 胡成倩; 张麒; 韩冬
Original assignee: 京东科技信息技术有限公司
Priority date: 2022-07-29
Filing date: 2023-02-01
Publication date: 2024-02-01
Also published as: CN115147092A

Abstract

The present disclosure relates to the technical field of cloud computing, and relates to a resource examination and approval method and device, and a random forest model training method and device. The resource examination and approval method comprises: acquiring a resource usage request of a user; determining a plurality of features of the resource usage request of the user; for each decision tree in a random forest model, selecting, from the plurality of features, a feature corresponding to the decision tree; predicting an examination and approval result according to the value of the feature corresponding to each decision tree, wherein the examination and approval result represents whether a resource requested by the user is issued; and in view of the examination and approval results of the decision trees, determining whether to issue the resource requested by the user.

Description

Resource approval method, random forest model training method and device

Cross-references to related applications

This application is based on the application with Chinese application number 202210905742.3 and the filing date is July 29, 2022, and claims its priority. The disclosure content of the Chinese application is hereby incorporated into this application as a whole.

Technical field

The present disclosure relates to the field of cloud computing technology, and in particular to resource approval methods, random forest model training methods and devices, and computer-readable storage media.

Background technique

Private cloud provides services to organizations within the enterprise and can provide a variety of cloud products to enterprise users, thus forming a complex cloud ecological chain. It has the characteristics of high data security and strong controllability of IT infrastructure.

Enterprise-level users usually have complex multi-layered internal organizational structures. In a private cloud scenario, users at different levels within the enterprise will be given differentiated permissions, and the resources that users with different permissions can use also have different specifications. When users need to use resources beyond their own permissions, they need to submit an application and wait for the application to be approved before they can use the product normally.

Contents of the invention

According to a first aspect of the present disclosure, a resource approval method is provided, including:

Obtain the user's request for resource use;

Determine multiple characteristics of a user's request for resource usage;

For each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features;

According to the value of the feature corresponding to each decision tree, the approval result is predicted, where the approval result indicates whether to release the resource requested by the user;

Based on the approval results of each decision tree, it is determined whether to release the resources requested by the user.

In some embodiments, the method of synthesizing the approval results of each decision tree to determine whether to release the resources requested by the user includes:

According to the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, Determine whether to release the resource requested by the user.

In some embodiments, determining whether to release the resource requested by the user is based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, including:

When the ratio of the number of decision trees that generate the same approval result to the total number of decision trees exceeds the first preset threshold, it is determined whether to release the resource requested by the user based on the approval result.

When the ratio of the number of decision trees that generate the same approval result to the total number of decision trees does not exceed the first preset threshold, it is determined whether to release the resource requested by the user based on multiple characteristics.

In some embodiments, the user's request for resource usage also includes the user's historical usage request for the resource.

In some embodiments, the characteristics of the user's resource usage request include: the type of resource the user requests to use, the specifications of the resource the user requests to use, the number of resources the user requests to use, the user's resource usage permissions and the user's request. At least one reason for using the resource.

According to a second aspect of the present disclosure, a training method for a random forest model is provided, including:

Obtain a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to release the resources requested by the user;

Determine multiple characteristics of a user's request for resource usage;

For each decision tree in the random forest model, some features are extracted from multiple features as candidate features for the decision tree;

Each decision tree is trained based on the values of its candidate features, as well as the labels of the samples.

In some embodiments, training each decision tree based on the value of the candidate feature of each decision tree and the label of the sample includes:

Use the root node of the decision tree as the current node, and select the feature corresponding to the root node from the candidate features based on the training set;

Determine the training set corresponding to the child node of the current node based on the value of the characteristic of the sample in the training set corresponding to the current node and the label of the sample;

According to the value of the feature corresponding to the current node in the training set sample corresponding to the child node of the current node, and the label of the sample, select the feature corresponding to the child node of the current node from the remaining candidate features;

Use the child node of the current child node as the current node, and loop to determine the training set corresponding to the child node of the current node. The step of selecting features corresponding to the child nodes of the current node from the remaining candidate features until a cutoff condition is reached.

In some embodiments, the child nodes of the current node include a first child node of the current node and a second child node of the current node, the value of the feature corresponding to the current node according to the sample in the training set corresponding to the current node, and The label of the sample determines the training set corresponding to the child node of the current node, including:

According to the value of the feature corresponding to the current node in the sample in the training set corresponding to the current node, and the label of the sample, select a feature value from the value range of the feature corresponding to the current node as the first dividing line between the current node and the current node. The split point between the training set corresponding to the child node and the training set corresponding to the second child node of the current node;

According to the split point that divides the training set corresponding to the first sub-node of the current node and the training set corresponding to the second sub-node of the current node, it is determined to divide the samples in the training set corresponding to the current node into the first sub-node. The training set is also the training set of the second child node.

In some embodiments, the cutoff conditions include that there are no remaining candidate features, the number of samples in the training set corresponding to the current node is less than a second preset threshold, and the Gini coefficient of the training set corresponding to the current node is less than a third preset threshold. Set at least one threshold.

For each decision tree, multiple samples are extracted from the training set as the training set for the decision tree;

The decision tree is trained based on the values of the candidate features corresponding to the decision tree in the samples in the training set of the decision tree and the labels indicating whether to release the resources requested by the user.

In some embodiments, the multiple characteristics of determining the user's resource usage request include:

When the sample requested by the user for resource usage is missing the value of the feature, calculate the similarity between the sample and other samples passing through the node in the decision tree;

Based on the similarity of the path between the sample and other samples passing through the node in the decision tree, the value of the missing feature of the sample is determined.

According to a third aspect of the present disclosure, a resource approval device is provided, including:

The acquisition module is configured to obtain the user's request for resource use;

A first determination module configured to determine a plurality of characteristics of the user's request for resource use;

The selection module is configured to, for each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features;

The prediction module is configured to predict the approval result based on the value of the feature corresponding to each decision tree, where, approval The result indicates whether the resources requested by the user are released;

The second determination module is configured to synthesize the approval results of each decision tree and determine whether to release the resources requested by the user.

According to a fourth aspect of the present disclosure, a training device for a random forest model is provided, including:

The acquisition module is configured to obtain a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to release the resources requested by the user;

a determining module configured to determine a plurality of characteristics of the user's request for resource usage;

The extraction module is configured to extract some features from multiple features for each decision tree in the random forest model as candidate features for the decision tree;

The training module is configured to train each decision tree based on the values of the candidate features of each decision tree and the labels of the samples.

According to a fifth aspect of the present disclosure, an electronic device is provided, including:

memory; and

A processor coupled to the memory, the processor being configured to execute the resource approval method according to any embodiment of the present disclosure, or the resource approval method according to any embodiment of the present disclosure based on instructions stored in the memory. The training method of the random forest model described above.

According to a sixth aspect of the present disclosure, a computer-storable medium is provided, with computer program instructions stored thereon. When the instructions are executed by a processor, the resource approval method according to any embodiment of the present disclosure is implemented, or the resource approval method is implemented according to any embodiment of the present disclosure. The training method of the random forest model according to any embodiment of the present disclosure.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and, together with the description, serve to explain principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

Figure 1 shows a flow chart of a resource approval method according to some embodiments of the present disclosure;

Figure 2 shows a schematic diagram of a decision tree according to some embodiments of the present disclosure;

Figure 3 shows a schematic diagram of a random forest model determining whether to release resources according to some embodiments of the present disclosure;

Figure 4 shows a flowchart of resource approval according to some embodiments of the present disclosure;

Figure 5 shows a flow chart of a training method of a random forest model according to some embodiments of the present disclosure;

Figure 6 shows a schematic diagram of pruning a decision tree according to some embodiments of the present disclosure;

Figure 7 shows a block diagram of a resource approval device according to some embodiments of the present disclosure;

Figure 8 shows a block diagram of a training device for a random forest model according to some embodiments of the present disclosure;

Figure 9 shows a block diagram of an electronic device according to other embodiments of the present disclosure;

Figure 10 illustrates a block diagram of a computer system for implementing some embodiments of the present disclosure.

Detailed ways

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of components and steps, numerical expressions, and numerical values set forth in these examples do not limit the scope of the disclosure unless otherwise specifically stated.

At the same time, it should be understood that, for convenience of description, the dimensions of various parts shown in the drawings are not drawn according to actual proportional relationships.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses.

Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods and devices should be considered a part of the specification.

In all examples shown and discussed herein, any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.

It should be noted that similar reference numerals and letters refer to similar items in the following figures, so that once an item is defined in one figure, it does not need further discussion in subsequent figures.

In related technologies, when users need to use products across permissions, they generally need to go through layers of approval from multiple nodes. This method has the following problems.

First of all, the nodes in the approval flow are often responsible persons at all levels of enterprises or institutions. Each node needs to approve the resource use requests of many users. It is inevitable that erroneous approval operations will occur, reducing the accuracy of approval.

Secondly, the approval process often goes through multiple nodes, and the approval time at each node depends on the situation of the person in charge of the node. Any obstruction at any node will cause the entire approval flow to stagnate, reducing the efficiency of approval completion.

Finally, this approval method is difficult to adapt to complex business needs. Sometimes users need to combine different cloud product resources, In order to complete the target tasks, the nodes of the traditional approval method are fixed, making it difficult to make timely adjustments according to the differentiated needs of users.

In order to solve the above problems, some embodiments of the present disclosure provide a resource approval method, a random forest training method and device, and a computer-readable storage medium.

Figure 1 shows a flow chart of a resource approval method according to some embodiments of the present disclosure.

As shown in Figure 1, the resource approval method includes steps S110 to S150. In some embodiments, the following resource approval method is executed by the resource approval device.

For example, resource approval devices include input and output devices and processors. The resource approval method includes obtaining the user's resource usage request through input and output devices (such as interactive panels, etc.); using the processor to determine multiple characteristics of the user's resource usage request; using the processor to target each user in the random forest model. A decision tree is used to select the feature corresponding to the decision tree from multiple features; the decision tree algorithm is used to predict the approval result based on the value of the feature corresponding to each decision tree, where the approval result indicates whether to issue the request requested by the user. resources, the decision tree algorithm is executed by the processor; the processor is used to synthesize the approval results of each decision tree to determine whether to release the resources requested by the user.

Steps S110 to S150 will be introduced in detail below.

In step S110, the user's request for resource use is obtained.

For example, when the user needs to use resources beyond their own permissions, the user is prompted and the resource usage request filled in by the user on the page is obtained.

In step S120, multiple characteristics of the user's request for resource usage are determined.

In some embodiments, the characteristics of the user's request to use resources include: the type of resource the user requests to use, the specifications of the resource the user requests to use, the number of resources the user requests to use, the user's resource use permissions and the user's request to use the resource. at least one of the reasons.

For example, from the resource use request submitted by the user, multiple features such as the user's position, rank, and responsible work are extracted as multiple features of the user's resource use request.

In step S130, for each decision tree in the random forest model, a feature corresponding to the decision tree is selected from a plurality of features.

A random forest model consists of multiple decision trees, each of which makes predictions based on only a subset of the multiple features requested using the request.

Figure 2 shows a schematic diagram of a decision tree according to some embodiments of the present disclosure.

As shown in Figure 2, each decision tree includes multiple nodes, and one node of the decision tree is a feature. Therefore, with The features corresponding to the decision tree are also the features corresponding to multiple nodes of the decision tree. For example, the characteristics corresponding to the decision tree include: whether the user has the same historical usage request, the user's rank, the permission group to which the user belongs, the value of the resources applied by the user, and the number of resources applied by the user. Among them, the authority group is a set of scope and degree of decision-making on a certain matter that the incumbent must have in order to ensure the effective performance of duties.

For a trained decision tree, it is known which feature its node corresponds to, that is, which of the multiple features each decision tree needs to use is known. Therefore, when using a decision tree for prediction, only Just divide the features obtained in step S120 into each trained decision tree according to the nodes of the decision tree.

Since each decision tree generates approval results based on a part of the features, that is, assigning multiple features to multiple decision trees for processing, the number of features that each decision tree needs to process is compared to the original total number of features in the use request. Less, so that it can handle high-dimensional (that is, with multiple features) usage requests without the need for feature selection in advance or feature dimensionality reduction. It can solve the approval problem of complex (multi-feature) user requests and improve Improve the accuracy and efficiency of approval of resource usage requests.

In step S140, the approval result is predicted based on the value of the feature corresponding to each decision tree, where the approval result indicates whether to release the resource requested by the user.

For example, in the user's request for resource use, the values of multiple features corresponding to the decision tree include: there is no identical historical use request, the user belongs to a high-authority group, the user rank is PY, the value of the resource requested by the user is C, The number of resources requested by the user is D, among which PY has a higher level than PX, C>A, D>B.

As shown in Figure 2, the input of the decision tree is the characteristic values of multiple features corresponding to the decision tree. The decision tree first determines whether "the same historical usage request exists". If the judgment result is no, then it enters "whether the user has high authority". Group?" node. At the "Does the user belong to the high-privilege group?" node, if the judgment result is yes, then the judgment of "Is the resource value greater than A?" is entered. At the node "Is the resource value greater than A?", the judgment result is yes, then the final approval result of this decision tree is to decide to release resources to the user.

In some embodiments, the approval results are predicted based on the values of the features corresponding to each decision tree, including multiple decision trees generating the approval results in parallel.

For example, multiple decision trees of a random forest can be made into a parallel method to make predictions independently, thereby increasing the speed of approval.

In step S150, the approval results of each decision tree are integrated to determine whether to release the resources requested by the user.

Figure 3 shows a schematic diagram of a random forest model determining whether to release resources according to some embodiments of the present disclosure.

As shown in Figure 3, each decision tree generates an approval result based on some characteristics of the user's request for resource use. The results are then jointly decided by multiple trees, and based on the majority voting mechanism, it is decided whether to release the resources requested by the user.

In some embodiments, combining the approval results of each decision tree to determine whether to release the resources requested by the user includes: based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees, and the first preset Threshold to determine whether to release the resources requested by the user.

In some embodiments, determining whether to release the resource requested by the user is determined based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, including: When the ratio of the number of decision trees to the total number of decision trees exceeds the first preset threshold, it is determined whether to release the resources requested by the user based on the approval result.

The random forest algorithm predicts results by constructing a large number of independent decision trees. The number of decision trees predicting the same result needs to reach a preset threshold before the result will be accepted. The voting mechanism can be a one-vote veto system, a minority obeys the majority, Weighted majority, etc., the approval flow will automatically release resources and end the approval immediately.

For example, let the first preset threshold be 0.8, and the random forest consists of n decision trees. Among the n decision trees, the approval result of m decision trees is "release resources to users." then in In this case, the random forest determines the final result to issue the resources requested by the user. By setting a threshold, untrustworthy prediction results are excluded, thereby improving the prediction accuracy.

In some embodiments, determining whether to release the resource requested by the user is determined based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, including: in the decision making that generates the same approval result When the ratio of the number of trees to the total number of decision trees does not exceed the first preset threshold, it is determined whether to release the resources requested by the user based on multiple characteristics.

For example, when the prediction results obtained by the random forest fail to reach the threshold, the random forest cannot automatically determine whether the approval is passed, and the approval form will be transferred to other approval methods, such as the type of resources used by the approver according to the user's request. , the specifications of the resources requested by the user, the number of resources requested by the user, the user's resource usage rights and the reasons for the user's request to use the resources, etc., to determine whether to release the resources requested by the user.

This disclosure integrates the results of all decision trees to determine whether to release the resources requested by the user, reduces the impact of errors of a single decision tree on the final result, and improves the accuracy of approval.

Figure 4 illustrates a flowchart of resource approval according to some embodiments of the present disclosure.

As shown in Figure 4, when a user needs to call resources across permission groups, the approval flow starts. At this time, the approval flow automatically enters the Listen state, waiting for the user to enter approval information at the front desk.

In some embodiments, the user's request for resource usage also includes the user's historical usage request for the resource. For example, In the approval form of the resource usage request, information such as the specifications of the resources that the user applies for use will be automatically obtained and filled in with the specifications previously selected by the user or the specifications commonly used by the user. The applicant only needs to add the scenarios and reasons for the requested resources.

The status of the approval flow Listen lasts for 30 minutes. If the user does not fill out and submit the approval form within 30 minutes, the approval form will be automatically closed.

After the user fills in the approval form information, the first node reached by the approval flow is the process engine. A certain number of preset rules will be built into the Process engine. These rules support customization. The administrator of the platform can customize which permission groups (such as which departments and positions) are suitable for the company according to the needs of the company's organizational structure. Which resource calls can be exempted from approval. For example, the process engine stipulates that testers can be exempted from approval when they apply for cloud hosts that exceed the specified specifications within a certain range for link stress testing.

For resource usage requests that are exempt from approval, the resources will be automatically released to the user, and then the approval process will be completed.

Some rules are preset based on hard conditions such as security requirements, company rules and regulations, and management requirements. For resource usage requests that require approval, the process engine uses these rules to directly filter out which usage requests need to enter other approval methods, such as for some cross-border requests. In order to improve the accuracy of approval for resource requirements that are too high in rank or have a greater impact on the stability of the company's resource use, these special resource usage requests are divided into other approval methods. The remaining resource usage requests enter the processing flow of the random forest algorithm.

In the processing process of the random forest algorithm, the system will capture the relevant features of the applicant's usage request, such as the applicant's position, rank, responsible work and other factors that will affect the approval results. These features will be input into the decision tree as Basis for prediction. Finally, a random forest formed by multiple decision trees determines whether to pass the user's request.

The results of random forest predictions need to reach a certain threshold before they will be accepted. If the prediction results do not reach the threshold, it will not be possible to automatically determine whether the approval has been passed, and the approval form will be transferred to other approval methods. For resource usage requests that pass the approval, the resources will be automatically released to the user and the approval process will be completed. For resource usage requests that fail, the Listen state will be returned, waiting for the user to modify the information.

Resource usage requests that enter other approval methods will also have two statuses: passed and failed. For resource usage requests that pass the approval, the resources will be released to the user and the approval process will end. For resource usage requests that fail, the Listen state will be returned, waiting for the user to modify the information.

Figure 5 shows a flow chart of a training method of a random forest model according to some embodiments of the present disclosure.

As shown in Figure 5, the training method of the random forest model includes steps S210-S240. In some embodiments, the training method of the random forest model is executed by a training device of the random forest model.

In step S210, a training set is obtained, where the training set includes samples of user requests for resource use. This also includes a label indicating whether to release the resource requested by the user.

For example, a usage request sample includes a user-filled request for resource usage, as well as a label indicating whether to release the resource requested by the user.

In step S220, multiple characteristics of the user's resource usage request are determined.

For example, from the sample of resource use request submitted by the user, multiple features such as the user's position, rank, and responsible work are extracted as multiple features of the user's resource use request, and the values of these features are determined.

In step S230, for each decision tree in the random forest model, some features are extracted from multiple features as candidate features of the decision tree.

For example, for each decision-making time, a part of features are randomly extracted and used as candidate features for decision tree training. The sample has Y features, and T (T < Y) features are randomly selected from all features of the sample as candidate features for a decision tree.

To randomly extract features, you can use a random sampling method with replacement, such as the bagging method. Each time a feature is extracted, it is replaced and then extracted again, instead of sampling 10 at a time and then replacing it. . Random sampling with replacement makes the probability of each sample being drawn conform to a uniform distribution.

In some embodiments, training each decision tree according to the value of the candidate feature of each decision tree and the label of the sample includes: for each decision tree, extracting multiple samples from the training set as training for the decision tree Set; train the decision tree based on the values of candidate features corresponding to the decision tree in the training set of the decision tree and the label indicating whether to release the resource requested by the user.

For example, similar to random sampling of features, random sampling with replacement is also used to extract samples from the training set. There are X samples in the training set, and S (S<X) samples are randomly sampled from the data set with replacement as a training set for a decision tree. By extracting a training set for each decision tree, the training set is divided into multiple subsets. Each decision tree is constructed using a subset as a training set. Finally, multiple trained decision trees form a forest.

When training the random forest model of this disclosure, rows and columns (samples and features) are randomly selected, which can truly randomly divide the entire data table into multiple parts, and use one part for each decision tree. As long as the number of decision trees is Enough, there is always a decision tree that can capture the value of the data set to the greatest extent, thereby improving the accuracy of resource approval of the random forest model.

In step S240, each decision tree is trained according to the value of the candidate feature of each decision tree and the label of the sample.

The following describes the training method of a single decision tree.

In some embodiments, training each decision tree according to the value of the candidate feature of each decision tree and the label of the sample includes: taking the root node of the decision tree as the current node, and selecting from the candidate features according to the training set. root node pair According to the characteristic value of the sample in the training set corresponding to the current node and the label of the sample, determine the training set corresponding to the child node of the current node; according to the training set corresponding to the child node of the current node Concentrate the value of the feature of the sample corresponding to the current node, as well as the label of the sample, and select the feature corresponding to the child node of the current node from the remaining candidate features; use the child node of the current child node as the current node, and loop to determine the child node of the current node. The training set corresponding to the child node of , and the step of selecting the feature corresponding to the child node of the current node from the remaining candidate features until the cutoff condition is reached.

In some embodiments, the child nodes of the current node include a first child node of the current node and a second child node of the current node, according to the value of the feature corresponding to the current node according to the sample in the training set corresponding to the current node, and the value of the sample. Label, determines the training set corresponding to the child node of the current node, including: the value of the feature corresponding to the current node based on the sample in the training set corresponding to the current node, and the label of the sample, from the value of the feature corresponding to the current node Select the value of a feature in the range as the split point to divide the training set corresponding to the first sub-node of the current node and the training set corresponding to the second sub-node of the current node; according to the division and the first sub-node of the current node The corresponding training set and the split point of the training set corresponding to the second child node of the current node are used to determine whether the samples in the training set corresponding to the current node are divided into the training set of the first child node or the training set of the second child node. .

For example, for the S samples, use the extracted T features to train the decision tree model. Starting from the root node, first select the features corresponding to the root node. Take the CART (classification and regression tree) classification tree as an example. CART is a binary tree that classifies infinitely downward from the origin. That is to say, its nodes have only two choices, 'yes' and 'no'. Through Continuously divide the feature space into a limited number of units, and determine the predicted probability distribution on these units.

Use the Gini coefficient to measure the importance of features. The Gini coefficient represents the impurity of the model. The smaller the Gini coefficient, the lower the impurity, and the better the feature. The purity of data set D can be measured by the Gini value. Assuming that there are K samples in the set, the calculation formula of the Gini coefficient is as follows:

|C _k | represents the number of samples whose labels belong to K category. The number of samples in the training set D is |D|. Represents the probability that the label of the sample belongs to the K category. Gini(D) reflects the probability that two samples randomly selected from the data set D have inconsistent class labels. Therefore, the smaller Gini(D), the higher the purity of data set D.

Calculate the Gini coefficient of each feature value of each existing feature of the current node to the data set D. For example, first determine the value range of feature A based on the samples of the training set. Taking the current node as the root node as an example, at this time, the root node corresponds to The training set is D. From the value range of A, select the value a. According to whether the feature A takes the value a, the training set D is divided into two parts. For example, when the value of the feature A of the sample is a, then the value of the feature A is a. The sample is divided into the training set D ₁ , otherwise, it is divided into the training set D _2. The formula for calculating the Gini coefficient of the feature value A and the cut point a for the data set D is as follows:

Among them, Gini(D ₁ ) represents the Gini coefficient of data set D1.

Among the calculated Gini coefficients of each value of each feature, select the feature with the smallest Gini coefficient as the feature corresponding to the current node, and use the smallest value of the Gini coefficient of this feature as the training corresponding to the first child node of the current node. Set and the split point of the training set corresponding to the second child node of the current node.

When determining the split point, the decision tree can handle both continuous values and discrete values. For example, for continuous values, assuming that the continuous feature A of m samples has m values, arranged from small to large, then CART takes the average of two adjacent sample values as the dividing point. There are m-1 dividing points in total, and they are calculated separately. The Gini coefficient when these m-1 points are used as binary classification points. Select the point with the smallest Gini coefficient as the cutting point of the continuous feature. For example, the point with the smallest Gini coefficient is a, then the value less than a is category 1, and the value greater than a is category 2. This achieves the discretization of continuous features.

For discrete values, CART uses the cyclic bisection method. CART divides the value of feature A into three situations: (a ₁ , a ₂ a ₃ ) or (a ₁ a ₂ , a ₃ ) or (a ₂ , a ₁ a ₃ ), and finds the combination with the smallest Gini coefficient, such as ( a ₂ , a ₁ a ₃ ), and then establish a binary tree node. One node is the sample corresponding to a ₂ , and the other node is the sample corresponding to a ₁ and a ₃ . Because the values of feature A are not completely separated this time.

After determining the split point that divides the training set corresponding to the first sub-node of the current node and the training set corresponding to the second sub-node of the current node, determine the samples in the training set corresponding to the current node based on the split point Whether it is divided into the first sub-node or the second sub-node, thereby generating a training set of the first sub-node and the second sub-node.

Then use the child node as the current node, loop the above steps of determining the characteristics of the current node based on the training set of the current node, and determining the training set of the child node based on the characteristics of the current node, until the cutoff condition is reached, then return to the decision subtree, and the current node stops Recurse, and finally build the entire decision tree.

Through the above method, the interaction between different features can be measured. For example, if the training set in the same decision tree is split into two child nodes according to a certain feature M, and it is easier to split on feature J, then features M and J are interactive.

In some embodiments, the cutoff conditions include that there are no remaining candidate features, the number of samples in the training set corresponding to the current node is less than a second preset threshold, and the Gini coefficient of the training set corresponding to the current node is less than a third preset threshold. value of at least one.

For example, if the number of samples in D is less than the threshold, or there are no features to choose from, or the Gini coefficient of the training set of the current node is less than the threshold, the decision tree subtree is returned and the current node stops recursing.

This disclosure automatically determines the importance of features in a user's resource usage request based on the Gini coefficient. In addition, it can measure the interactivity between different features, build a decision tree and generate resource approval results without the need for dimensionality reduction or feature selection, which improves Accuracy and efficiency in approving resource usage requests.

According to the above method, multiple decision trees are constructed to finally form a random forest model.

In some embodiments, determining multiple characteristics of the user's request for resource use includes, in the case where a value of a characteristic of a sample of the user's request for resource use is missing, calculating a path of the sample and other samples through the node in the decision tree. The similarity of the sample and the path through the node in the decision tree between the sample and other samples is used to determine the value of the missing feature of the sample.

For example, first, preset some estimates for the missing values in the sample. For numeric variables, select the median or mode of the remaining data as the estimate of the current missing value. If it is a numeric variable, a new estimate is obtained through a weighted average. Then, based on the estimated values, build a random forest and put all the data into the random forest and run it again. Record the step-by-step classification path of each group of data in the decision tree, determine which group of data is most similar to the missing data path, and introduce a similarity matrix to record the similarity between the data. For example, if there are N groups of data, the similarity matrix The size is N*N. If the missing value is a categorical variable, a new estimated value is obtained through weighted voting, and so on until a stable estimated value is obtained.

By constructing multiple decision trees to fill in missing values, the filled data is random and uncertain, and can better reflect the true distribution of these unknown data. In addition, since in the process of constructing the decision tree, each node uses random partial features instead of all the features of the training set, it can be well applied to filling high-dimensional data. Therefore, the present disclosure can reduce the interference of missing values on resource approval and improve the accuracy of approval of resource use requests.

In some embodiments, training each decision tree includes pruning the decision tree based on the values of the candidate features of each decision tree and the labels of the samples.

Figure 6 shows a schematic diagram of pruning a decision tree according to some embodiments of the present disclosure.

As shown in Figure 6, the post-pruning method is used, that is, a decision tree is first generated, and then all pruned CART trees are generated based on the generated decision trees, and then cross-validation is used to test the effect of pruning, and general The pruning strategy with the best performance.

For any subtree T _t located at node t, if there is no pruning, the loss function of the subtree T _t is:

C _α (T _t )=C (T _t )+α|T _t |

If it is cut off and only the root node is retained, the loss function of the root node is as follows:

C _α (T) = C (T) + α

Among them, α is the regularization parameter (the same as the regularization of linear regression), C(T _t ) is the prediction error of the verification data (that is, the Gini coefficient of the verification data), |T _t | is the number of leaf nodes of the subtree T.

According to the principle of minimizing the loss function, if the following formula is satisfied, the subtree T needs to be pruned:

Through pruning, redundant parts of the decision tree can be cut off to avoid overfitting to the training set and improve generalization capabilities.

In some embodiments, each decision tree is trained based on the values of the candidate features of each decision tree and the labels of the samples, including training multiple decision trees in parallel. For example, training multiple decision trees of a random forest in parallel and independently can improve the training speed of the random forest model.

Figure 7 shows a block diagram of a resource approval device according to some embodiments of the present disclosure.

As shown in FIG. 7 , the resource approval device 7 includes an acquisition module 71 , a first determination module 72 , a selection module 73 , a prediction module 74 , and a second determination module 75 .

The obtaining module 71 is configured to obtain the user's request for resource use, for example, performing step S110 as shown in Figure 1 .

The first determination module 72 is configured to determine multiple characteristics of the user's resource usage request, for example, perform step S120 as shown in FIG. 1 .

The selection module 73 is configured to, for each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features, for example, perform step S130 as shown in FIG. 1 .

The prediction module 74 is configured to predict the approval result according to the value of the feature corresponding to each decision tree, where the approval result indicates whether to release the resource requested by the user, for example, perform step S140 as shown in FIG. 1 .

The second determination module 75 is configured to synthesize the approval results of each decision tree and determine whether to release the resources requested by the user, for example, performing step S150 as shown in FIG. 1 .

Figure 8 shows a block diagram of a training device for a random forest model according to some embodiments of the present disclosure.

As shown in Figure 8, the training device of the random forest model includes an acquisition module 81, a determination module 82, an extraction module 83, and a training module 84.

The acquisition module 81 is configured to acquire a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to issue the resources requested by the user. For example, step S210 shown in Figure 5 is performed.

The determination module 82 is configured to determine multiple characteristics of the user's resource usage request, for example, perform step S220 as shown in FIG. 5 .

The extraction module 83 is configured to, for each decision tree in the random forest model, extract some features from multiple features as candidate features of the decision tree, for example, perform step S230 as shown in FIG. 5 .

The training module 84 is configured to train each decision tree according to the value of the candidate feature of each decision tree and the label of the sample, for example, perform step S240 as shown in FIG. 5 .

Figure 9 shows a block diagram of an electronic device according to other embodiments of the present disclosure.

As shown in Figure 9, the electronic device 9 includes a memory 91; and a processor 92 coupled to the memory 91. The memory 91 is used to store instructions for executing corresponding embodiments of the resource approval method or the training method of the random forest model. The processor 92 is configured to execute the resource approval method or the random forest model training method in any embodiment of the present disclosure based on instructions stored in the memory 91 .

As shown in Figure 10, computer system 100 may be embodied in the form of a general purpose computing device. Computer system 100 includes memory 1010, a processor 1020, and a bus 1000 that connects various system components.

The memory 1010 may include, for example, system memory, non-volatile storage media, and the like. System memory stores, for example, operating systems, applications, boot loaders, and other programs. System memory may include volatile storage media such as random access memory (RAM) and/or cache memory. The non-volatile storage medium stores, for example, instructions for executing corresponding embodiments of at least one of the resource approval methods or the random forest model training methods in any embodiments of the present disclosure. Non-volatile storage media includes but is not limited to disk storage, optical storage, flash memory, etc.

The processor 1020 may be implemented as a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gates or transistors and other discrete hardware components. accomplish. Correspondingly, each module, such as the judgment module and the determination module, can be implemented by instructions executing corresponding steps in a central processing unit (CPU) running memory, or by dedicated circuits executing corresponding steps.

Bus 1000 may use any of a variety of bus structures. For example, bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.

The computer system 100 may also include an input/output interface 1030, a network interface 1040, a storage interface 1050, and the like. These interfaces 1030, 1040, 1050, the memory 1010 and the processor 1020 may be connected through a bus 1000. The input and output interface 1030 can provide a connection interface for input and output devices such as a monitor, mouse, and keyboard. Network interface 1040 provides connection interfaces for various networked devices. The storage interface 1050 provides a connection interface for external storage devices such as floppy disks, USB disks, and SD cards.

Various aspects of the disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces implementations in one or more blocks of the flowcharts and/or block diagrams. A device with specified functions.

Computer-readable program instructions, which may also be stored in computer-readable memory, cause the computer to operate in a specific manner to produce an article of manufacture, including implementing the functions specified in one or more blocks of the flowcharts and/or block diagrams. instructions.

The disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects.

Through the resource approval method, random forest model training method and device, and computer storage medium in the above embodiments, the efficiency and accuracy of resource approval are improved.

So far, the resource approval method, the training method and device of the random forest model, and the computer storage medium according to the present disclosure have been described in detail. To avoid obscuring the concepts of the present disclosure, some details that are well known in the art have not been described. Based on the above description, those skilled in the art can completely understand how to implement the technical solution disclosed here.

Claims

A resource approval method including:

Obtain the user's request for resource use;

Determine multiple characteristics of a user's request for resource usage;

For each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features;

According to the value of the feature corresponding to each decision tree, the approval result is predicted, where the approval result indicates whether to release the resource requested by the user;

Based on the approval results of each decision tree, it is determined whether to release the resources requested by the user.
The resource approval method according to claim 1, wherein integrating the approval results of each decision tree to determine whether to release the resources requested by the user includes:

Based on the proportion of the number of decision trees that generate the same approval result to the total number of decision trees and the first preset threshold, it is determined whether to release the resource requested by the user.
The resource approval method according to claim 2, wherein the determination of whether to release the resources requested by the user is based on the ratio of the number of decision trees that generate the same approval results to the total number of decision trees and the first preset threshold, include:

When the ratio of the number of decision trees that generate the same approval result to the total number of decision trees exceeds the first preset threshold, it is determined whether to release the resource requested by the user based on the approval result.
The resource approval method according to claim 3, wherein the determination of whether to release the resources requested by the user is based on the ratio of the number of decision trees that generate the same approval results to the total number of decision trees and the first preset threshold, include:

When the ratio of the number of decision trees that generate the same approval result to the total number of decision trees does not exceed the first preset threshold, it is determined whether to release the resource requested by the user based on multiple characteristics.
The resource approval method according to claim 1, wherein the user's resource usage request also includes the user's historical usage request for the resource.
The resource approval method according to claim 1, wherein the characteristics of the user's resource use request include: the type of resource requested by the user, specifications of the resource requested by the user, the number of resources requested by the user, At least one of the resource usage permissions and the reason why the user requested to use the resource.
A training method for a random forest model, including:

Obtain a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to release the resources requested by the user;

Determine multiple characteristics of a user's request for resource usage;

For each decision tree in the random forest model, some features are extracted from multiple features as candidate features for the decision tree;

Each decision tree is trained based on the values of its candidate features, as well as the labels of the samples.
The training method of a random forest model according to claim 7, wherein said training each decision tree according to the value of the candidate feature of each decision tree and the label of the sample includes:

Use the root node of the decision tree as the current node, and select the feature corresponding to the root node from the candidate features based on the training set;

Determine the training set corresponding to the child node of the current node based on the value of the characteristic of the sample in the training set corresponding to the current node and the label of the sample;

According to the value of the feature corresponding to the current node in the training set sample corresponding to the child node of the current node, and the label of the sample, select the feature corresponding to the child node of the current node from the remaining candidate features;

Taking the child nodes of the current child node as the current node, iterate through the steps of determining the training set corresponding to the child node of the current node and selecting features corresponding to the child nodes of the current node from the remaining candidate features until the cutoff condition is reached.
The training method of a random forest model according to claim 8, wherein the child nodes of the current node include a first child node of the current node and a second child node of the current node, and the training set according to the training set corresponding to the current node The value of the characteristic of the sample corresponding to the current node, and the label of the sample, determine the training set corresponding to the child node of the current node, including:

According to the value of the feature corresponding to the current node in the sample in the training set corresponding to the current node, and the label of the sample, select a feature value from the value range of the feature corresponding to the current node as the first dividing line between the current node and the current node. The split point between the training set corresponding to the child node and the training set corresponding to the second child node of the current node;

According to the split point that divides the training set corresponding to the first sub-node of the current node and the training set corresponding to the second sub-node of the current node, it is determined whether the sample in the training set corresponding to the current node is divided into the first sub-node The training set is also the training set of the second child node.
The training method of a random forest model according to claim 8, wherein the cut-off conditions include that there are no remaining candidate features, the number of samples in the training set corresponding to the current node is less than a second preset threshold, and the number of samples in the training set corresponding to the current node is The Gini coefficient of the training set is less than at least one of the third preset thresholds.
The training method of a random forest model according to claim 7, wherein said training each decision tree according to the value of the candidate feature of each decision tree and the label of the sample includes:

For each decision tree, multiple samples are extracted from the training set as the training set for the decision tree;

The decision tree is trained based on the values of the candidate features corresponding to the decision tree in the samples in the training set of the decision tree and the labels indicating whether to release the resources requested by the user.
The training method of a random forest model according to claim 7, wherein the multiple characteristics of determining the user's resource usage request include:

When the sample requested by the user for resource usage is missing the value of the feature, calculate the similarity between the sample and other samples passing through the node in the decision tree;

Based on the similarity of the path between the sample and other samples passing through the node in the decision tree, the value of the missing feature of the sample is determined.
A resource approval device, including:

The acquisition module is configured to obtain the user's request for resource use;

A first determination module configured to determine a plurality of characteristics of the user's request for resource use;

The selection module is configured to, for each decision tree in the random forest model, select the feature corresponding to the decision tree from multiple features;

The prediction module is configured to predict the approval result based on the value of the feature corresponding to each decision tree, where the approval result indicates whether to release the resources requested by the user;

The second determination module is configured to synthesize the approval results of each decision tree and determine whether to release the resources requested by the user.
A training device for a random forest model, including:

The acquisition module is configured to obtain a training set, where the training set includes samples of user requests for resource use, and the samples also include labels indicating whether to release the resources requested by the user;

a determining module configured to determine a plurality of characteristics of the user's request for resource usage;

The extraction module is configured to extract some features from multiple features for each decision tree in the random forest model as candidate features for the decision tree;

The training module is configured to train each decision tree based on the values of the candidate features of each decision tree and the labels of the samples.
An electronic device including:

memory; and

A processor coupled to the memory, the processor being configured to execute the resource approval method according to any one of claims 1 to 6, or according to claims 7 to 12 based on instructions stored in the memory The training method of the random forest model described in any one of the above.
A computer storage medium with computer program instructions stored thereon. When the instructions are executed by a processor, the resource approval method according to any one of claims 1 to 6 is implemented, or the resource approval method according to any one of claims 7 to 12 is implemented. The training method of the random forest model.
A computer program consisting of:

Instructions, which when executed by a processor cause the processor to execute the resource approval method according to any one of claims 1 to 6, or the random forest model according to any one of claims 7 to 12 Training methods.