CN111275547B

CN111275547B - Wind control system and method based on isolated forest

Info

Publication number: CN111275547B
Application number: CN202010196415.6A
Authority: CN
Inventors: 毕艳亮
Original assignee: Chongqing Fumin Bank Co Ltd
Current assignee: Chongqing Fumin Bank Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2023-07-18
Anticipated expiration: 2040-03-19
Also published as: CN111275547A

Abstract

The invention relates to the technical field of computers, in particular to an air control system and method based on an isolated forest, comprising the following steps: s1, collecting historical data of a client, and determining effective characteristics of a data source according to the historical data; s2, evaluating rejection proportion of the data source characteristics according to the passing rate of the current project and the number of the used data sources; s3, putting the effective characteristic values into an isolated forest algorithm, determining the proportion of abnormal points by combining the rejection proportion, and dividing an abnormal sample and a normal sample; s4, making the abnormal sample and the normal sample into a box diagram, and making a strategy meeting the expected rejection proportion according to the distribution condition of the box diagram. The invention has the advantages that: first, the pass rate of a single data source with respect to an attribute policy can be quantitatively determined. Secondly, any new item can adjust the strategy without waiting until the performance exists, and the strategy iteration period can be compressed.

Description

Wind control system and method based on isolated forest

Technical Field

The invention relates to the technical field of computers, in particular to a wind control system and method based on an isolated forest.

Background

Risk control refers to various measures and methods taken by a manager to eliminate or reduce various possibilities of occurrence of risk events or to reduce losses caused by occurrence of risk events. For banks, the main risk is also a credit risk, where the loan risk is the main content. At present, most banks adopt a T+1 wind control mode, namely daily transaction data is stored in a database, and a wind control system extracts transaction details to perform risk assessment after going to work every day. The method cannot control risk of each transaction in real time, and cannot effectively detect abnormal data.

Document CN109345137a discloses an outlier detection method based on agricultural big data, comprising: a data acquisition step of acquiring agricultural production data, agricultural soil data and agricultural meteorological resource data and integrating the data into a training data set; the step of building an iTree tree, namely selecting m sample points from a training data set, and continuously and randomly selecting splitting attributes and splitting points until a termination condition is reached; a step of constructing an isolated forest algorithm model, which is to initialize the number t of the iTree trees in the isolated forest and a sub-sample set m extracted during the construction of the iTree trees, enter a step of circularly constructing the iTree trees, construct the independent iTree trees, and form an isolated forest algorithm model by the aggregation of all the iTree trees; and an abnormal value judging step of calculating an abnormal value s (x), and judging whether the test data x is an abnormal value or not by the abnormal value s (x). The invention applies the isolated forest algorithm model to the abnormal value detection of the agricultural big data, and can effectively improve the detection effect of the abnormal value of the agricultural big data.

The core of financial wind control is mainly to determine whether to pass or not according to specific characteristics of users. There are two cases in the past process: first, for newly started projects, the strategies used by other projects are referenced. However, in many cases, the new project differs significantly from the previous project, resulting in a risk due to the inapplicability of the new project strategy (including both cases, too low threshold, too many users are rejected, and too high threshold, too few users are rejected). Secondly, for new data source use, due to lack of knowledge of the data source and matching of items to be added, it is difficult to formulate a reasonable strategy.

Disclosure of Invention

The invention provides a wind control method based on an isolated forest, which solves the technical problems that strategies used by other projects are not suitable for new projects and reasonable strategies are difficult to formulate for new data source use.

The basic scheme provided by the invention is as follows: the wind control method based on the isolated forest comprises the following steps: s1, collecting historical data of a client, and determining effective characteristics of a data source according to the historical data; s2, evaluating rejection proportion of the data source characteristics according to the passing rate of the current project and the number of the used data sources; s3, putting the effective characteristic values into an isolated forest algorithm, determining the proportion of abnormal points by combining the rejection proportion, and dividing an abnormal sample and a normal sample; s4, making the abnormal sample and the normal sample into a box diagram, and making a strategy meeting the expected rejection proportion according to the distribution condition of the box diagram.

The working principle of the invention is as follows: firstly, putting the effective characteristic values into an isolated forest algorithm, determining the proportion of abnormal points by combining the rejection proportion, and dividing an abnormal sample and a normal sample. And then the abnormal sample and the normal sample are made into a box diagram, and a strategy conforming to the expected rejection proportion is formulated according to the distribution condition of the box diagram. The invention has the advantages that: first, the pass rate of a single data source with respect to an attribute policy can be quantitatively determined. Secondly, any new item can adjust the strategy without waiting until the performance exists, and the strategy iteration period can be compressed. Thirdly, a problematic data source can be found in time, for example, the data source can be determined to be invalid if the distribution of the related characteristics of a certain data source is not different between an abnormal group and a normal group.

The invention can quantitatively determine the passing rate of a single data source related to the attribute strategy according to the distribution condition of the box diagram. Any new project can adjust the strategy without waiting until the performance exists, can compress the strategy iteration period, and can timely find out the data source with problems.

Further, step S1 includes: s11, extracting historical data of a client; s12, extracting corresponding characteristics of the client data; s13, the extracted characteristic information is dataized, and clustering operation is carried out; s14, calculating the spatial position distance between the clustering center point and other points; and S15, the calculated distance is presented as two-dimensional data, and a point far from the origin of coordinates is given a corresponding larger weight fraction. The beneficial effects are that: therefore, the actual data volume in the abnormality detection process is reduced, the calculation resources are saved, and the abnormality detection efficiency is improved. Meanwhile, under the step of feature extraction data analysis, the method can solve some overfitting problems in anomaly detection, and the robustness of an anomaly detection algorithm is enhanced.

Further, step S3 includes: s31, training a single tree; s32, integrating the results of all the isolated trees; s33, calculating an abnormal score S (x), and judging whether the test data x is an abnormal value or not according to the abnormal score S (x). The beneficial effects are that: the orphan forest algorithm has linear time complexity and is an ensable method and therefore can be used on top of datasets containing massive amounts of data. Generally, the greater the number of trees, the more stable the algorithm. Since each tree is generated independently of the other, it can be deployed on a large-scale distributed system to accelerate operations.

Further, step S31 includes: s31a, randomly selecting psi points from training data to serve as subsamples, and putting the subsamples into a root node of an isolated tree; s31b, randomly designating a dimension, and randomly generating a cutting point P in the current node data range; s31c, placing a point smaller than P in the currently selected dimension on the left branch of the current node, and placing a point larger than or equal to P on the right branch of the current node; s31d, recursively constructing new leaf nodes at the left branch and the right branch of the node until only one data or tree grows to the set height. The beneficial effects are that: thus, the covering and inundation effect of the abnormality can be reduced when the abnormal value is detected; in addition, due to the linear time complexity, the distance or density is not required to be calculated to find the abnormal data, and the high-dimensional data and the mass data can be effectively processed.

Further, step S32 includes: s32a, repeatedly cutting from the beginning; s32b, then calculates the average value of the result of each cut. The beneficial effects are that: since the cutting process is completely random, an ensable method is required to converge the results, i.e., repeatedly starting the cut from scratch.

Further, step S33 includes: s33a, calculating the depth h (x) of each tree; s33b, calculating the average depth E (h (x)) of all the trees; s33c, calculating abnormal scores; s33d, judging whether the test data x is an abnormal value or not according to the abnormal score S (x) of the test data x. The beneficial effects are that: if the anomaly score is close to 1, it must be an anomaly point; if the anomaly score is much less than 0.5, then it must not be an outlier; if the score of all points of the outlier score is around 0.5, there is a high probability that outliers are not present in the sample. Whether the test data is an abnormal value can be intuitively judged according to the abnormal score.

The invention also discloses a wind control system based on the isolated forest, which comprises: the acquisition module is used for acquiring historical data of the clients and determining effective characteristics of the data sources according to the historical data; the evaluation module is used for evaluating rejection proportion of the data source characteristics according to the passing rate of the current project and the number of the used data sources; the algorithm module is used for putting the effective characteristic value into an isolated forest algorithm, determining the proportion of the abnormal points by combining the rejection proportion and dividing the abnormal samples and the normal samples; the decision module is used for making the abnormal sample and the normal sample into a box diagram, and making a strategy which accords with the expected rejection proportion according to the distribution condition of the box diagram.

The invention determines the passing rate of the attribute strategy related to a single data source according to the distribution condition of the box diagram, and can compress the strategy iteration period.

Further, the algorithm module includes: the training unit is used for training a single tree; an integrating unit for integrating the results of all the isolated trees; and the judging unit is used for calculating the abnormal score s (x) and judging whether the test data x is an abnormal value or not through the abnormal score s (x). The beneficial effects are that: each tree is independent of the other and can be deployed on a large-scale distributed system to accelerate operations.

Further, the step of training the single tree by the training unit comprises: s51, randomly selecting psi points from training data to serve as subsamples, and putting the subsamples into a root node of an isolated tree; s52, randomly designating a dimension, and randomly generating a cutting point P in the current node data range; s53, placing a point smaller than P in the current selected dimension on the left branch of the current node, and placing a point larger than or equal to P on the right branch of the current node; s54, recursively constructing new leaf nodes at the left branch node and the right branch node of the node until only one data or tree grows to the set height. The beneficial effects are that: the method not only can reduce the covering and inundation effects of the anomaly during the detection of the anomaly value, but also can effectively process high-dimensional data and mass data.

Further, the step of the judging unit calculating the abnormality score s (x) and judging whether the test data x is an abnormal value by the abnormality score s (x) includes: s61, calculating the depth h (x) of each tree; s62, calculating the average depth E (h (x)) of all the trees; s63, calculating abnormal scores; s64, judging whether the test data x is an abnormal value or not according to the abnormal score S (x) of the test data x. The beneficial effects are that: whether the test data is an abnormal value can be intuitively judged according to the abnormal score.

Drawings

FIG. 1 is a flow chart of an embodiment of an isolated forest based wind control method of the present invention.

Detailed Description

The following is a further detailed description of the embodiments:

example 1

The embodiment of the wind control method based on the isolated forest is basically as shown in the accompanying figure 1: the method comprises the following steps: s1, collecting historical data of a client, and determining effective characteristics of a data source according to the historical data; s2, evaluating rejection proportion of the data source characteristics according to the passing rate of the current project and the number of the used data sources; s3, putting the effective characteristic values into an isolated forest algorithm, determining the proportion of abnormal points by combining the rejection proportion, and dividing an abnormal sample and a normal sample; s4, making the abnormal sample and the normal sample into a box diagram, and making a strategy meeting the expected rejection proportion according to the distribution condition of the box diagram.

When a customer loans to a bank, the bank can audit the repayment capacity of the borrower, the repayment record of the borrower, the repayment willingness of the borrower, the profitability of the loan item, the guarantee of the loan, the legal responsibility of loan repayment and the like, and the customer can also provide the materials.

First, it is determined which fields of the data source are valid to be used. There are two ways: first, it can be determined through past experience which fields of a data source are valid. For example, past experience has shown that: the record of borrower repayment shows the on-schedule repayment without delay repayment or account-relying situation; meanwhile, the borrower loan has good project profit capability and stable fund flow; such borrowers are always able to pay on schedule without risk of overdue. Experience has shown that it is effective to judge the credit of the customer with the "borrower payoff record" and "loan item profitability". The second way is to use a clustering algorithm. First, historical data of the client is extracted, such as credit data of five types of clients, namely normal, concerned, secondary, suspicious and lost in the past. And secondly, extracting corresponding characteristics of the client data, such as repayment records, repayment capacity, fund flows and the like. Thirdly, the extracted characteristic information is dataized, and clustering operation is carried out. And fourthly, calculating the space position distance between the clustering center point and other points. And fifthly, presenting the calculated distance as two-dimensional data, and giving a corresponding larger weight score to a point far from the origin of coordinates.

Then, the rejection rate of the data source is determined according to the required passing rate of the project and the number of data sources to be used. For example, a loan program requires eighty percent of the rate of passage, and there are only three sources of data that can be strategically placed. Twenty percent of rejection can be roughly allocated to three data sources, and the allocation is performed according to the quality of the data sources. Such as: the first data source has the highest quality, and ten percent of the first data source is rejected; the second data source is of the next highest quality, rejecting six percent of it; the third data source is of low quality, rejecting four percent of it. In this way, the rejection of the individual data sources can be initially determined.

Then, extracting effective fields of the data source, filling the fields with null values, and putting the fields into an orphan forest algorithm. Training a single tree: step one, selecting psi points from training data randomly as subsamples, and putting the subsamples into a root node of an isolated tree. Such as: five points are selected as sub-samples, namely, the repayment capability of the borrower, the repayment record of the borrower, the repayment willingness of the borrower, the profitability of the loan item and the guarantee of the loan. And secondly, randomly designating a dimension, and randomly generating a cutting point P in the current node data range. Such as: the specified dimension is four. And thirdly, placing a point smaller than P in the currently selected dimension on the left branch of the current node, and placing a point larger than or equal to P on the right branch of the current node. Such as: the four left branches of the borrower, the repayment record of the borrower, the repayment willingness of the borrower and the profitability of the loan item are placed on the current node, and the guarantee of the loan is placed on the right branch of the current node. And fourthly, recursively constructing new leaf nodes at the left branch node and the right branch node of the node until only one data or tree grows to the set height. Such as: the three of the repayment capability of the borrower, the repayment record of the borrower and the repayment willingness of the borrower are placed on the left branch of the current node, and the profitability of the loan item is placed on the right branch of the current node. … borrower's repayment ability, borrower's repayment record, these two left branches of putting at current node, borrower's repayment willingness put the right branch at current node. … repeatedly starting cutting from the beginning according to the above steps, integrating the results of all the isolated trees, and then calculating the average value of the cutting results each time. Then, an anomaly score s (x) is calculated, and it is judged whether the test data x is an anomaly value or not by the anomaly score s (x). Step one, the depth h (x) on each tree is calculated. According to the isolated forest algorithm, the guarantee depth of the loan is 4, the profit capability depth of the loan item is 3, the repayment willingness depth of the borrower is 2, and the repayment capability of the borrower and the repayment record depth of the borrower are 1. Step two, the average depth E (h (x)) of all trees is calculated. It can be seen that the average depth is (4+3+2+1×2)/5=2.2. And thirdly, calculating abnormal scores, wherein a calculation formula can refer to the prior art of an isolated forest algorithm. And step four, judging whether the test data x is an abnormal value or not through the abnormal score s (x) of the test data x. Such as: the guarantee score of the loan is 0.85, (close to 1), and then the guarantee data of the loan is an abnormal value; the profitability score for a loan term is 0.05, (much less than 0.5), then the profitability data for the loan term is not outlier. And finally, determining the abnormal point duty ratio according to the rejection ratio, and dividing the sample into an abnormal sample and a normal sample.

And then, respectively drawing a box diagram for each characteristic of the abnormal sample and the normal sample, and determining a policy threshold value with reasonable properties according to the difference condition of the box diagram. If the difference of the box diagram of each attribute of a certain data source is small, the data source can be judged to be invalid for the item, and the line should be cut off at the time. Such as: the normal population has a quarter locus of 4 and a three quarter locus of 9; the normal distribution of this data is between 0 and 16.5 according to the theory of boxplot correlation. The abnormal population had 13 quarters and 20 quarters; according to the theory of box diagram correlation, the normal distribution of the abnormal population is 0 to 30, so the attribute threshold can be set to 17 (positive and negative 1 fluctuation). Eventually bringing the overall rejection rate in a predetermined interval.

And returning to the box line drawing step when the query quantity of the data source is about twenty thousands, and carrying out the same analysis on the data source to determine each strategy threshold value. And so on, the iteration and correction of the whole strategy are realized rapidly, the strategy can be adjusted without waiting until the performance exists, and meanwhile, the effectiveness of the strategy can be ensured.

Example 2

On the basis of the embodiment 1, a wind control system based on an isolated forest is also disclosed, which comprises: the acquisition module is used for acquiring historical data of the clients and determining effective characteristics of the data sources according to the historical data; the evaluation module is used for evaluating rejection proportion of the data source characteristics according to the passing rate of the current project and the number of the used data sources; the algorithm module is used for putting the effective characteristic value into an isolated forest algorithm, determining the proportion of the abnormal points by combining the rejection proportion and dividing the abnormal samples and the normal samples; the decision module is used for making the abnormal sample and the normal sample into a box diagram, and making a strategy which accords with the expected rejection proportion according to the distribution condition of the box diagram. And determining the passing rate of a single data source related to the attribute strategy according to the distribution condition of the box diagram, and compressing the strategy iteration period.

The algorithm module comprises: the training unit is used for training a single tree; an integrating unit for integrating the results of all the isolated trees; and the judging unit is used for calculating the abnormal score s (x) and judging whether the test data x is an abnormal value or not through the abnormal score s (x). Each tree is independent of the other and can be deployed on a large-scale distributed system to accelerate operations.

The step of training the single tree by the training unit comprises: s51, randomly selecting psi points from training data to serve as subsamples, and putting the subsamples into a root node of an isolated tree; s52, randomly designating a dimension, and randomly generating a cutting point P in the current node data range; s53, placing a point smaller than P in the current selected dimension on the left branch of the current node, and placing a point larger than or equal to P on the right branch of the current node; s54, recursively constructing new leaf nodes at the left branch node and the right branch node of the node until only one data or tree grows to the set height. Therefore, the method not only can reduce the covering and inundation effects of the abnormality when detecting the abnormal value, but also can effectively process high-dimensional data and mass data.

The step of the judging unit calculating the abnormality score s (x) and judging whether the test data x is an abnormal value by the abnormality score s (x) includes: s61, calculating the depth h (x) of each tree; s62, calculating the average depth E (h (x)) of all the trees; s63, calculating abnormal scores; s64, judging whether the test data x is an abnormal value or not according to the abnormal score S (x) of the test data x. Whether the test data is an abnormal value can be intuitively judged according to the abnormal score.

Example 3

On the basis of embodiment 2, when a customer loans to a bank, the customer submits various data that needs to be audited. And when the data such as the repayment capacity of the borrower, the repayment record of the borrower, the repayment willingness of the borrower, the profitability of loan items and the like are extracted, the repeated operation of the user is also acquired.

After the repeated operation of the user is collected, counting the number of repeated operation, and judging whether the number of repeated operation reaches a preset operation threshold. When the number of repeated operations does not reach the preset operation threshold, processing is performed according to example 1; and when the number of repeated operations reaches or exceeds a preset operation threshold, performing manual auditing. Thus, delay or hysteresis of the user can be effectively reduced, and experience efficiency is improved.

The foregoing is merely an embodiment of the present invention, and a specific structure and characteristics of common knowledge in the art, which are well known in the scheme, are not described herein, so that a person of ordinary skill in the art knows all the prior art in the application day or before the priority date of the present invention, and can know all the prior art in the field, and have the capability of applying the conventional experimental means before the date, so that a person of ordinary skill in the art can complete and implement the present embodiment in combination with his own capability in the light of the present application, and some typical known structures or known methods should not be an obstacle for a person of ordinary skill in the art to implement the present application. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present invention, and these should also be considered as the scope of the present invention, which does not affect the effect of the implementation of the present invention and the utility of the patent.

Claims

1. The wind control method based on the isolated forest is characterized by comprising the following steps of: the method comprises the following steps:

s1, collecting historical data of a client, and determining effective characteristics of a data source according to the historical data;

s2, evaluating rejection proportion of the data source characteristics according to the passing rate of the current project and the number of the used data sources;

s3, putting the effective characteristic values into an isolated forest algorithm, determining the proportion of abnormal points by combining the rejection proportion, and dividing an abnormal sample and a normal sample;

s4, making the abnormal sample and the normal sample into a box diagram, and making a strategy conforming to the expected rejection proportion according to the distribution condition of the box diagram;

wherein, step S1 includes:

s11, extracting historical data of a client;

s12, extracting corresponding characteristics of the client data;

s13, the extracted characteristic information is dataized, and clustering operation is carried out;

s14, calculating the spatial position distance between the clustering center point and other points;

s15, the calculated distance is presented as two-dimensional data, and a point far from the origin of coordinates is given a corresponding larger weight fraction;

wherein, step S3 includes:

s31, training a single tree;

s32, integrating the results of all the isolated trees;

s33, calculating an abnormal score S (x), and judging whether the test data x is an abnormal value or not according to the abnormal score S (x);

wherein, step S31 includes:

s31a, randomly selecting psi points from training data to serve as subsamples, and putting the subsamples into a root node of an isolated tree;

s31b, randomly designating a dimension, and randomly generating a cutting point P in the current node data range;

s31c, placing a point smaller than P in the currently selected dimension on the left branch of the current node, and placing a point larger than or equal to P on the right branch of the current node; s31d, recursing steps S31b and S31c on the left branch and the right branch of the node, and continuously constructing new leaf nodes until only one data or tree grows to the set height on the leaf nodes;

wherein, step S32 includes:

s32a, repeatedly cutting from the beginning;

s32b, calculating an average value of each segmentation result;

wherein, step S33 includes:

s33a, calculating the depth h (x) of each tree;

s33b, calculating the average depth E (h (x)) of all the trees;

s33c, calculating abnormal scores;

s33d, judging whether the test data x is an abnormal value or not according to the abnormal score S (x) of the test data x.

2. Wind control system based on isolated forest, its characterized in that: comprising the following steps: the acquisition module is used for acquiring historical data of the clients and determining effective characteristics of the data sources according to the historical data; the evaluation module is used for evaluating rejection proportion of the data source characteristics according to the passing rate of the current project and the number of the used data sources; the algorithm module is used for putting the effective characteristic value into an isolated forest algorithm, determining the proportion of the abnormal points by combining the rejection proportion and dividing the abnormal samples and the normal samples; the decision module is used for making the abnormal sample and the normal sample into a box diagram, and making a strategy which accords with the expected rejection proportion according to the distribution condition of the box diagram.

3. The isolated forest based wind control system of claim 2, wherein: the algorithm module comprises: the training unit is used for training a single tree; an integrating unit for integrating the results of all the isolated trees; and the judging unit is used for calculating the abnormal score s (x) and judging whether the test data x is an abnormal value or not through the abnormal score s (x).

4. A stand alone forest based wind control system as claimed in claim 3 wherein: the step of training the single tree by the training unit comprises: s51, randomly selecting psi points from training data to serve as subsamples, and putting the subsamples into a root node of an isolated tree; s52, randomly designating a dimension, and randomly generating a cutting point P in the current node data range; s53, placing a point smaller than P in the current selected dimension on the left branch of the current node, and placing a point larger than or equal to P on the right branch of the current node; s54, recursively constructing new leaf nodes at the left branch node and the right branch node of the node until only one data or tree grows to the set height.

5. The isolated forest based wind control system of claim 4, wherein: the step of the judging unit calculating the abnormality score s (x) and judging whether the test data x is an abnormal value by the abnormality score s (x) includes: s61, calculating the depth h (x) of each tree; s62, calculating the average depth E (h (x)) of all the trees; s63, calculating abnormal scores; s64, judging whether the test data x is an abnormal value or not according to the abnormal score S (x) of the test data x.