CN111105305A

CN111105305A - Machine learning-based receivable and receivable cash cashing risk control method and system

Info

Publication number: CN111105305A
Application number: CN201911244056.0A
Authority: CN
Inventors: 黄林; 梁樑; 曾水保; 吴斌; 朱香友; 黄晓漫; 黄超
Original assignee: Anhui Sea Converge Financial Investment Group Co ltd
Current assignee: Anhui Sea Converge Financial Investment Group Co ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-05-05

Abstract

The invention discloses a machine learning-based receivable and cash redemption risk control method and system, which comprises the following steps: establishing a decision tree according to a recursion method, training a plurality of decision trees through the obtained sample data, and establishing a random forest model; training the random forest model by using the acquired verification data to obtain the output accuracy of the random forest model based on the verification data; optimizing calling parameters in the random forest model, and selecting the random forest model with the output accuracy rate larger than the preset accuracy rate as an optimized prediction model, wherein the calling parameters comprise the maximum number of the use characteristics of a single decision tree allowed by the random forest model, the number of the built subtrees and the minimum leaf node number; classifying and outputting accounts receivable by using the optimized random forest model so as to predict whether debtors or initial debtors default or not; the cash-in risk of due accounts receivable is reduced, and the sustainability of the flow of accounts receivable is improved.

Description

Machine learning-based receivable and receivable cash cashing risk control method and system

Technical Field

The invention relates to the technical field of financial risk management, in particular to a machine learning-based receivable and payable risk control method and system.

Background

The most important risk of accounts receivable is the credit risk, if the creditor cannot pay, the creditor cannot pay the funds according to the date, how to effectively identify the cash risk of the creditor who commits to pay or the initial creditor who commits to buy back, and take the targeted risk control measures to prevent the cash risk of accounts receivable due to the cash, which is an important ring for ensuring the continuous and healthy development of accounts receivable creditor transfer business. Traditionally, a wind control mode and technology based on full-time investigation, expert scoring and the like and based on manual investigation and judgment are difficult to meet the requirements of rapid, efficient and low-cost business development.

Disclosure of Invention

Based on the technical problems in the background art, the invention provides a machine learning-based receivable cash redemption risk control method and system, which reduce the redemption risk of due receivable accounts and improve the sustainability of receivable account circulation.

The invention provides a machine learning-based receivable and cash redemption risk control method, which comprises the following steps:

establishing a decision tree according to a recursion method, training a plurality of decision trees through the obtained sample data, and establishing a random forest model;

training the random forest model by using the acquired verification data to obtain the output accuracy of the random forest model based on the verification data;

optimizing calling parameters in the random forest model, and selecting the random forest model with the output accuracy rate larger than the preset accuracy rate as an optimized prediction model, wherein the calling parameters comprise the maximum number of the use characteristics of a single decision tree allowed by the random forest model, the number of the built subtrees and the minimum leaf node number;

classifying and outputting accounts receivable by using the optimized random forest model so as to predict whether debtors or initial debtors default or not;

further, the establishing the decision tree according to the recursive method includes:

s11: sequentially traversing possible values a of each feature A in the current data set, and calculating the Gini index of each segmentation point (A, a);

s12: selecting the segmentation point with the minimum Gini index as the optimal segmentation point, and segmenting the current data set D into two subsets D through the optimal segmentation point₁And D₂Wherein D is₁Is a set of samples in the current dataset that satisfies A ═ a, where D₂Is a sample set in the current dataset that does not satisfy a ═ a;

s13: for D after segmentation₁And D₂And step S11 and step S12 are executed in sequence and circularly until the Gini index of the sample set is smaller than the preset threshold.

S14: a decision tree is generated based on the kini index minimization criterion.

Further, in the step of calculating the kiney index of each segmentation point (a, a), M classes are preset, and then the probability p that the current data set belongs to the kth class of the M classes_kThen p is_kGini (p) of the distribution:

in a binary decision tree_kGini (p) of the distribution:

Gini(p)＝2p(1-p)

for D₁(ii) the Kini index of the middle sample set

The kini index Gini (D, a) of the current dataset D under the condition of characteristic a ═ a is:

wherein Gini (D)₁) As subset D₁Gini (D) is a Gini index₂) As subset D₂The kini index of (a).

Further, the training of the random forest model on the acquired verification data to obtain the output accuracy of the random forest model based on the verification data includes:

s21: acquiring verification data, wherein the verification data comprises training set data and verification set data;

s22: training a classifier of the random forest model by using training set data;

s23: predicting and outputting verification set data through a classifier of the trained random forest model to obtain the output accuracy rate E1 of the random forest model based on the verification data;

s24: circulating the steps S21 to S23 to obtain the output accuracy rates E2, E3, ·, EN of the N-1 random forest models based on the verification data;

s25: and weighting the N times of output accuracy rates E1, E2, and EN to obtain an average value of the output accuracy rates, and using the average value of the output accuracy rates as the final output accuracy rate of the random forest model.

Further, obtaining the output accuracy of the random forest model based on the verification data in steps S23 and S24 includes:

outputting a fulfillment exchange risk value of the debtor or the initial creditor corresponding to the expired receivable account through a random forest model;

acquiring a performance value of the debtor or the initial creditor corresponding to the expired receivable in the actual performance;

and obtaining the output accuracy of the random forest model according to the deviation of the fulfillment redemption risk value from the fulfillment value.

Further, in the step of classifying and outputting the accounts received by the optimized random forest model to obtain a fulfillment exchange risk value of due accounts, when the fulfillment exchange risk value is smaller than a preset threshold value, due exchange of the accounts to be received is improved by adding credit increase measures.

Further, in the optimization of the calling parameters in the random forest model, double-layer circulation traversal is performed to establish array parameters of the number of subtrees and the minimum leaf node number to obtain a plurality of output accuracy rates, and a group with the highest output accuracy rate is selected as the optimal calling parameter of the random forest model.

Further, in the process of selecting the optimized random forest model with the output accuracy rate larger than the preset accuracy rate to output accounts receivable in a classified mode, the random forest model with the output accuracy rate of more than 90% is selected as the optimized random forest model.

A receivable cash redemption risk control system based on machine learning comprises a construction module, an optimization accuracy rate module and a prediction output module;

the construction module is used for establishing a decision tree according to a recursion method, training a plurality of decision trees through the acquired sample data and constructing a random forest model;

the optimization accuracy module is used for training the random forest model by using the acquired verification data to obtain the output accuracy of the random forest model based on the verification data;

the parameter optimization module is used for optimizing calling parameters in the random forest model, selecting the random forest model with the output accuracy rate larger than the preset accuracy rate as the optimized prediction model, and the calling parameters comprise the maximum number of the features allowed to be used by a single decision tree by the random forest model, the number of the established subtrees and the minimum leaf node number.

And the prediction output module uses the optimized random forest model obtained in the optimized parameter optimization module to classify and output accounts receivable so as to predict whether debtors or initial debtors default or not.

A computer readable storage medium having stored thereon a number of get classification programs for being invoked by a processor and performing the steps of:

and classifying and outputting accounts receivable by using the optimized random forest model so as to predict whether debtors or initial debtors default or not.

The machine learning-based receivable and cash redemption risk control method and system provided by the invention have the advantages that: according to the receivable exchange risk control method and system based on machine learning, due exchange risk of a debtor who receives accounts and debts or an initial debtor who promises due and repurchase is quickly identified, the contract exchange risk value of the debtor or the initial debtor can be directly obtained, the exchange condition that the receivable accounts are due can be known in advance through the predicted contract exchange risk value, the exchange risk that the receivable accounts are due is reduced through the advance prediction and the advance increase of credit measures, and the sustainability of the transfer of the receivable accounts is improved.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a machine learning-based receivable and payment risk control method according to the present invention;

FIG. 2 is a schematic flow chart of an accounts receivable redemption risk control system based on machine learning according to the present invention

The system comprises a building module, a 2-accuracy module, a 3-parameter optimization module and a 4-prediction output module.

Detailed Description

The present invention is described in detail below with reference to specific embodiments, and in the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather should be construed as broadly as the present invention is capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Referring to fig. 1, the method for controlling a receivable cash redemption risk based on machine learning provided by the invention comprises the following steps:

s1: establishing a decision tree according to a recursion method, training a plurality of decision trees through the obtained sample data, and establishing a random forest model;

a random forest model is established by using a random forest classifier method in a Python machine learning tool class scimit-lean.

The characteristics of the sample data at least comprise enterprise names, unified social credit codes (industrial and commercial registration numbers), residence areas of enterprises, areas (provinces, cities, counties and districts), enterprise properties, industries, economic types, enterprise scale, employee quantity, industrial status, establishment time, registered capital, credit status, total business income, business profits, accounts receivable and mobile asset proportion, default records and the like.

S2: training the random forest model by using the acquired verification data to obtain the output accuracy of the random forest model based on the verification data;

s3: and optimizing calling parameters in the random forest model, and selecting the random forest model with the output accuracy rate greater than the preset accuracy rate as the optimized prediction model, wherein the calling parameters comprise the maximum number of the use characteristics of a single decision tree allowed by the random forest model, the number of the built subtrees and the minimum leaf node number.

The accuracy of the model prediction result is improved by adjusting the calling parameters in the RandomForestClassiier method, and a random forest model with the output accuracy of more than 90% is selected as the optimized random forest model

S4: and classifying and outputting the receivable accounts by using the optimized random forest model to obtain a fulfillment exchange risk value of due receivable accounts so as to predict whether debtors or initial creditors default or not.

According to the steps S1 to S4, the method is mainly used for predicting default behaviors of debtors or initial creditors, the cashing condition of due accounts receivable can be known in advance through the prediction result, the cashing risk of due accounts receivable is reduced through the advance prediction and the advance increase of credit-adding measures, and the sustainability of the flow of accounts receivable is improved. Firstly, rapidly identifying the due cashing risk of a debtor receiving the debt right or an initial debtor committing to due repurchase, directly obtaining a performance cashing risk value of the debtor or the initial debtor, and when the performance cashing risk value is larger than a preset threshold value, indicating that the cashing risk of the debtor or the initial debtor is lower, and completing the due cashing of receivable accounts; when the fulfillment redemption risk value is smaller than the preset threshold value, the redemption risk of the debtor or the initial creditor is indicated, corresponding credit increase measures can be required to be provided in advance to enhance the maturity fulfillment capability to avoid the risk, the risk control flow is assisted by the account receivable and creditor management service mechanism from multiple aspects of cost, efficiency, accuracy and the like, and the continuous and healthy development of the account receivable and creditor transfer business is ensured.

In the random forest model, the input value must be an integer or a floating point number, so when data is input into the random forest model, the data needs to be preprocessed, and a character string is converted into the integer or the floating point number, so that the random forest model can be used compatibly.

Further, in the step S1 of building the decision tree according to the recursive method, x1, x2,.. xn represents n attributes of the accounts receivable debtors and the initial creditors, and the n-dimensional space is recursively divided into non-overlapping rectangles, and the dividing step includes:

the M types are preset, and the types used in the application are divided into two types: whether or not to default: yes (1) and no (0), the default recorded by each debtor or initial creditor can only belong to a certain condition, namely default or no default, namely M is 2. The probability p that the current dataset belongs to the kth of the M classes_kThen p is_kGini (p) of the distribution:

in a binary decision tree_kGini (p) of the distribution:

Gini(p)＝2p(1-p)

wherein Gini (D)₁) As subset D₁The Giny index of (a) represents the set D₁Uncertainty of (2), Gini (D)₂) As subset D₂The Giny index of (a) represents the set D₂Uncertainty of (2). Gini (D, A): representing the uncertainty of the set D after A ═ a segmentation; the larger the kini index, the greater the interpretation uncertainty; the smaller the kini coefficient, the smaller the uncertainty, the more thorough and clean the data segmentation.

For the stop condition in performing steps S11 and S12 on the loop, the kini index of the sample set may be less than a predetermined threshold, or the number of sample data may be less than a predetermined threshold of 30, or there may be no more features available for segmentation.

According to the steps S11 to S14, n decision trees can be constructed, and each decision tree can grow to the maximum extent on the premise of not pruning; and finally, forming a random forest by the n generated decision trees, and constructing a random forest model through data training.

Further, at step S2: the training of the random forest model by using the acquired verification data to obtain the output accuracy of the random forest model based on the verification data comprises the following steps:

randomly selecting 80% of the verification data as training set data, and using the rest 20% of the verification data as verification set data.

The output accuracy of the random forest model adopts the following formula:

wherein, TP is the number of records with correct prediction, TN is the number of records with wrong prediction, and TP + TN is the total number of predicted records.

The obtaining of the output accuracy of the random forest model based on the verification data in steps S23 and S24 includes:

outputting a fulfillment exchange risk value of the debtor or the initial creditor corresponding to the expired receivable account through a random forest model; and obtaining the output accuracy of the random forest model according to the deviation of the fulfillment redemption risk value from the fulfillment value.

And comparing the result output by the random forest model with the actual result in the operation process to obtain the output accuracy.

Preferably, at step S3: and classifying and outputting the accounts received by the optimized random forest model to obtain a fulfillment contract payment risk value when the accounts to be received are due, and increasing a credit increase measure to improve due payment of the accounts to be received when the fulfillment contract payment risk value is smaller than a preset threshold value.

The credit-increasing measures include, but are not limited to, the following ways:

A. paying a certain proportion of performance bond

B. Core enterprise commitment to pay

C. The guarantee company guarantees

D. Third party warranty

E. Insurance company insurance

F. Effective asset pledge

G. Bank preauthorization credit

H. Non-gold company transfer

Through the combination of one or more credit-increasing measures, the risk that the debtor cannot pay or the initial debtor cannot buy after the accounts receivable are due can be effectively prevented, the settlement of the full amount can be timely obtained after the accounts receivable and the debt are due, and the adoption of the specific measures can be comprehensively determined by the accounts receivable and debt management service mechanism in combination with the system recommendation and the manual judgment.

At step S3 l: in the optimization of the calling parameters in the random forest model, double-layer circulation traversal is used for establishing array parameters of the number of subtrees and the minimum leaf node number to obtain a plurality of output accuracy rates, and one group with the highest output accuracy rate is selected as the optimal calling parameter of the random forest model. One specific example is as follows:

and (3) optimizing calling parameters of the random forest model, firstly, the random forest model allows the maximum number of the features used by a single decision tree, and all the features are selected because the feature data segments are not many and only ten in total in the data of the debtors or the initial creditors. In the remaining parameters for establishing the number of subtrees and the minimum leaf node number, tuning is performed on variables of the number of established subtrees (n _ estimators) and the minimum leaf node number (min _ sample _ leaf), an array of the number of established subtrees (n _ estimators) and the minimum leaf node number (min _ sample _ leaf) is established through loop traversal, wherein the initial value of the number of established subtrees (n _ estimators) is 1, the self-increment is 3, the maximum value is 50, the initial value of the minimum leaf node number (min _ sample _ leaf) is 1, the self-increment is 5, the maximum value is 100, a prediction result is obtained through double-layer loop traversal modeling, the accuracy is calculated through comparison between a real result and the prediction result, and as shown in table 1, a group of parameters with the highest accuracy is finally selected as the optimal parameters of the random forest model.

TABLE 1

Serial number	Number of decision trees	Minimum leaf node number	Rate of accuracy
				1	1	1	0.8
2	1	6	0.9
				3	1	11	0.9
4	1	16	0.95
				5	1	21	0.95
6	1	26	0.95
				7	4	1	0.75
8	4	6	0.75
				9	4	11	0.75
…	…	…	…
				N-2	49	86	0.75
N-1	49	91	0.75
				N	49	96	0.75

The optimal parameters (1, 16, 0.95) can be intuitively obtained through table 1, that is, when the number of decision trees is 1 and the number of minimum leaf nodes is 16, the output accuracy of the random forest model is 0.95. Therefore, the random forest model under the parameter can be selected to classify and output accounts receivable, and a relatively accurate fulfillment and payment risk value is obtained.

A receivable cash redemption risk control system based on machine learning comprises a construction module 1, an accuracy rate module 2, a parameter optimization module 3 and a prediction output module 4;

the construction module 1 is used for establishing a decision tree according to a recursion method, training a plurality of decision trees through the acquired sample data, and constructing a random forest model;

the optimization accuracy module 2 is used for training the random forest model by using the acquired verification data to obtain the output accuracy of the random forest model based on the verification data;

the parameter optimization module 3 is configured to optimize call parameters in the random forest model, and select the random forest model with an output accuracy greater than a preset accuracy as the optimized prediction model, where the call parameters include a maximum number of features allowed to be used by a single decision tree by the random forest model, a number of subtrees to be built, and a minimum number of leaf nodes.

And the prediction output module 4 uses the optimized random forest model obtained in the optimization parameter optimization module 3 to classify and output accounts receivable so as to predict whether debtors or initial creditors default or not.

the random forest model is trained by using the acquired verification data to obtain the output accuracy of the random forest model based on the verification data;

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A machine learning-based receivable cash redemption risk control method is characterized by comprising the following steps:

2. The machine learning-based receivables redemption risk control method of claim 1, wherein said building a decision tree according to a recursive method comprises:

s13: for D after segmentation₁And D₂Step S11 and step S12 are executed in a circulating mode respectively until the Gini index of the sample set is smaller than a preset threshold value;

3. The machine-learning-based receivable redemption risk control method of claim 2, wherein in the calculating of the kini index of each cut point (a, a), there are predefined M classes of default conditions, and then the probability p that the current data belongs to the kth class of the M classes_kThen p is_kGini (p) of the distribution:

in a binary decision tree_kGini (p) of the distribution:

Gini(p)＝2p(1-p)

4. The machine-learning-based receivables redemption risk control method of claim 1, wherein the training of the random forest model using the acquired validation data to obtain the output accuracy of the validation-data-based random forest model comprises:

5. The machine learning-based receivable redemption risk control method of claim 4, wherein obtaining the output accuracy of the random forest model based on the validation data in steps S23 and S24 comprises:

6. The machine learning-based receivable exchange risk control method according to any one of claims 1 to 5, wherein in the step of classifying and outputting the receivable corresponding to the account through the optimized random forest model to obtain a fulfillment exchange risk value at which the receivable is due, when the fulfillment exchange risk value is smaller than a preset threshold value, the due exchange of the receivable is improved by adding a credit-increase measure.

7. The machine-learning-based receivables redemption risk control method of claim 6, wherein in optimizing the invocation parameters in the random forest model, the double-layer loop traversal establishes the array parameters of the number of subtrees and the minimum leaf node number to obtain a plurality of output accuracy rates, and selects the group with the highest output accuracy rate as the optimal invocation parameter of the random forest model.

8. The machine learning-based receivable exchange risk control method according to any one of claims 1-5, wherein in the step of selecting the optimized random forest model with the output accuracy rate greater than the preset accuracy rate to classify and output receivable accounts, the random forest model with the output accuracy rate of more than 90% is selected as the optimized random forest model.

9. A receivable cash redemption risk control system based on machine learning is characterized by comprising a construction module (1), an accuracy rate module (2), a parameter optimization module (3) and a prediction output module (4);

the construction module (1) is used for establishing a decision tree according to a recursion method, training a plurality of decision trees through the acquired sample data, and constructing a random forest model;

the optimization accuracy module (2) is used for training the random forest model by using the acquired verification data to obtain the output accuracy of the random forest model based on the verification data;

the parameter optimization module (3) is used for optimizing calling parameters in the random forest model, selecting the random forest model with the output accuracy rate larger than the preset accuracy rate as the optimized prediction model, and the calling parameters comprise the maximum number of the features allowed to be used by a single decision tree by the random forest model, the number of the established subtrees and the minimum number of leaf nodes.

And the prediction output module (4) uses the optimized random forest model obtained in the optimization parameter optimization module (3) to classify and output accounts receivable so as to predict whether debtors or initial debtors default or not.

10. A computer readable storage medium having stored thereon a number of get classification programs for being invoked by a processor and performing the steps of: