CN117876119B

CN117876119B - Distributed-type-based wind control model construction method and system

Info

Publication number: CN117876119B
Application number: CN202410269620.9A
Authority: CN
Inventors: 王中健
Original assignee: Yaorongyun Digital Technology Chengdu Co ltd
Current assignee: Yaorongyun Digital Technology Chengdu Co ltd
Filing date: 2024-03-11
Publication date: 2024-06-04
Anticipated expiration: 2044-03-11

Abstract

The invention relates to the technical field of internet financial risk control, in particular to a distributed-based wind control model construction method and system. The method for constructing the wind control model based on the distribution comprises the following steps: respectively storing a historical sample data set and a sample data set to be evaluated into storage nodes, and building a distributed storage frame through the storage nodes; constructing a distributed computing frame based on the distributed storage frame, and training corresponding computing nodes by using historical sample data stored in the distributed storage frame; and generating a target wind control model by integrating the trained computing nodes, and evaluating the risk of the sample data to be evaluated by using the target wind control model. The distributed wind control model construction method based on the distributed wind control model provided by the invention can be used for constructing the distributed wind control model which has strong comprehensive evaluation capability, can efficiently process large-scale data and meets the real-time monitoring requirement, and can meet the risk control management requirement in the internet finance field.

Description

Distributed-type-based wind control model construction method and system

Technical Field

The invention relates to the technical field of internet financial risk control, in particular to a distributed-based wind control model construction method and system.

Background

With the rapid development of the internet financial field, risk control is an indispensable task for internet financial institutions. However, conventional wind control models encounter a series of problems in handling challenges of large-scale data, real-time requirements, and increased complexity:

1. The data scale in the internet finance field is huge, and traditional wind control models can not effectively process huge data, so that the processing speed is low and the occupied computing resources are high.

2. The internet finance industry has very high requirements on real-time performance, and needs to monitor and process risks in real time, and identify and prevent potential risk events in time.

3. With the development of internet financial services, a risk model becomes more and more complex, and multiple factors and features need to be considered for comprehensive evaluation, however, a traditional wind control model often cannot handle the complexity, and features cannot be fully mined and an accurate model cannot be established.

4. Traditional wind control models may adversely affect the user experience, for example requiring complex authentication procedures or excessive security checks.

Therefore, there is a need to construct a new wind control model for effectively addressing challenges such as data size, real-time, complexity and security in the internet financial field.

Disclosure of Invention

Aiming at the defects of the existing wind control model and the demands of practical application, the invention provides a distributed wind control model construction method, which aims to effectively cope with the challenges of large-scale data, real-time requirements and complexity increase in the field of internet finance and provides an accurate, efficient and safe wind control model for the internet finance institution.

In a first aspect, the method for constructing the distributed wind control model provided by the invention comprises the following steps: respectively storing a historical sample data set and a sample data set to be evaluated into storage nodes, and building a distributed storage frame through the storage nodes; constructing a distributed computing frame based on the distributed storage frame, and training corresponding computing nodes by using historical sample data stored in the distributed storage frame; and generating a target wind control model by integrating the trained computing nodes, and evaluating the risk of the sample data to be evaluated by using the target wind control model.

The invention can effectively process huge data scale in the internet finance field through the wind control model constructed by the distributed storage frame and the calculation frame; the wind control model constructed by the invention can utilize an internal distributed computing framework to perform real-time data processing and model training, so that risk assessment can be responded quickly and updated in real time, and the distributed computing framework can also provide more powerful computing capacity and resources, so that more complex models and more characteristics can be processed, thereby improving the accuracy and prediction capacity of the models; meanwhile, the wind control model based on the distributed storage frame can provide a data backup and redundancy mechanism, so that the safety and reliability of data are enhanced.

Optionally, the storing the historical sample data set and the sample data set to be evaluated in storage nodes respectively, and building a distributed storage frame through the storage nodes, including the following steps: acquiring a historical sample data set and a sample data set to be evaluated; dividing the historical sample data set and the sample data set to be evaluated respectively to generate one or more historical sample data subsets and one or more sample data subsets to be evaluated; storing the historical sample data subset and the sample data subset to be evaluated in corresponding storage nodes respectively; summarizing storage nodes corresponding to all the historical sample data subsets to generate a historical sample distributed storage frame; and summarizing storage nodes corresponding to all the sample data subsets to be evaluated, and generating a sample distributed storage frame to be evaluated. According to the invention, the historical sample data set and the sample data set to be evaluated are respectively stored and constructed into the distributed storage frame, so that the data access efficiency, the parallel processing capability and the system expandability can be improved, the overall performance and the processing capability of the system are improved, and the requirements of large-scale data and complex model construction are met.

Optionally, the sample data subset of any one of the sample data subsets contains an amount of sample data less than or equal to Q, wherein Q represents a sample data amount threshold in the sample data subset. The invention limits the size of the sample data subset below a certain threshold, can improve the distributed processing efficiency of the data and the scalability of the system, ensures reasonable data volume on each storage node, and reduces the burden of data transmission and calculation.

Optionally, the acquiring the historical sample data set and the sample data set to be evaluated includes the following steps: setting risk characteristics and obtaining corresponding data according to the risk characteristics; performing feature engineering on the data; labeling the data after the feature engineering to obtain a history sample data set with a label; and (5) sorting the data after the feature engineering to obtain a sample data set to be evaluated. According to the invention, by setting risk characteristics, carrying out characteristic engineering and labeling, the quality and accuracy of data can be improved, a more reliable data base is provided for building the wind control model, and the sorted sample data set to be evaluated can better adapt to the model building requirement.

Optionally, the building a distributed computing frame based on the distributed storage frame, and training the corresponding computing node by using the historical sample data stored in the distributed storage frame, includes the following steps: building a distributed computing framework, wherein the distributed computing framework comprises two or more computing nodes; setting a distributed computing environment by combining the distributed computing framework and the distributed storage framework, wherein any computing node in the distributed computing framework corresponds to a storage node in the distributed storage framework; and constructing a calculation model on any calculation node in the distributed calculation environment, and training the calculation model by using sample data on a storage node corresponding to the calculation node. According to the invention, the data in the distributed storage frame is utilized to train the computing nodes in the distributed computing frame, so that parallel computing and distributed training can be realized, and the computing efficiency and the model construction speed are improved. Meanwhile, the distributed computing environment provided by the invention can fully utilize computing resources and provide more powerful computing capacity, so that the accuracy and the prediction capacity of the model are enhanced.

Optionally, the setting a distributed computing environment in combination with the distributed computing framework and the distributed storage framework includes the following steps: correspondingly configuring a computing node for the storage node of any one of the historical sample data subsets; fully connecting the storage nodes of any sample data subset to be evaluated with the storage nodes corresponding to all the historical sample data subsets, so that data sharing is realized between the two storage nodes connected with each other; loading sample data to be evaluated in a sample data subset to be evaluated through data sharing; and using the trained computing nodes to primarily evaluate the risk of the sample data to be evaluated. According to the invention, through configuration of the connection of the corresponding computing node and the storage node and data sharing among the nodes, sample data to be evaluated can be loaded more efficiently in a distributed computing environment, and the trained computing node is utilized for preliminary risk evaluation. Therefore, the overall calculation and evaluation efficiency can be improved, and the risk identification and decision making capability of sample data to be evaluated can be quickened.

Optionally, the building a computing model on any computing node in the distributed computing environment includes the following steps: based on each risk characteristic in the historical sample data set, respectively setting corresponding initial prediction weights; constructing an initial prediction model by utilizing the risk characteristics and the corresponding initial prediction weights; selecting any historical sample data from the historical sample data set as initial sample data; combining the initial sample data with an initial prediction model to obtain a prediction tag value of the initial sample data; combining the actual tag value and the predicted tag value of the initial sample data to obtain an initial predicted residual; constructing a secondary prediction model by combining the initial prediction residual error with an initial prediction model; and continuously iterating the secondary prediction model by using the residual historical sample data to obtain a final prediction model. According to the invention, the calculation model is built in the distributed calculation environment, and the model is optimized in an iterative mode, so that the prediction capability and accuracy of the model can be effectively improved. The prediction effect of the model can be gradually optimized through the iterative processing of initial prediction and residual error, so that the final prediction model is more accurate and reliable. Therefore, the accuracy and reliability of risk assessment can be improved, and more accurate wind control decision support is provided for Internet financial institutions.

Optionally, the final prediction model satisfies the following formula:，/> Wherein/> Domain/>, representing initial prediction weightsSubset/>Any group element/>Can make the functionAt minimum,/>Represents the total number of historical sample data in the set of historical sample data,Representing the total number of risk features in any set of historical sample data,/>Represents the j-th set of historical sample data/>Actual tag value of/>Represents the/>Group history sample data/>Corresponding predictive weights,/>Represents the j-th set of historical sample data/>I-th sample data,/>Represents the/>Group sample data/>Predicted tag value of/>Represents the m-th set of historical sample data/>I-th sample data,/>Represents the/>Group history sample data/>Is included in the prediction residual of (a).

Optionally, the target wind control model satisfies the following formula:，/>， Wherein/> Domain/>, representing initial prediction weightsSubset/>Any group element/>Can make the function/>At minimum,/>Representing sample data to be evaluated/>Predictive weight corresponding to the ith sample data,/>Represents the j-th set of historical sample data/>I-th sample data,/>Representing the number of final predictive models,/>Represents the/>Sample data/>, to be evaluated, of each final prediction modelPredicted tag value of/>Representing sample data to be evaluated/>I-th sample data,/>Represent training of the first/>First/>, in the final prediction model processGroup sample data/>Prediction residual of/>Representing sample data to be evaluated/>Risk prediction value of/>Represents the/>Risk weights for the final prediction models.

In a second aspect, in order to better implement the method for constructing the wind control model based on the distributed mode, the invention further provides a system for constructing the wind control model based on the distributed mode. The distributed wind control model building system comprises an input device, a processor, a memory and an output device, wherein the input device, the processor, the memory and the output device are mutually connected, the memory is used for storing a computer program, the computer program comprises program instructions, and the processor is configured to call the program instructions to execute the distributed wind control model building method provided by the first aspect of the invention.

The structural and functional design of the distributed wind control model building system provided by the invention enables the distributed wind control model building process to be more automatic and operable, improves the wind control model building efficiency, and ensures the performance of the generated wind control model.

Drawings

FIG. 1 is a flowchart of a method for constructing a distributed-type wind control model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a computing framework in a distributed computing environment according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a distributed wind control model-based building system according to an embodiment of the present invention.

Detailed Description

Specific embodiments of the invention will be described in detail below, it being noted that the embodiments described herein are for illustration only and are not intended to limit the invention. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that: no such specific details are necessary to practice the invention. In other instances, well-known circuits, software, or methods have not been described in detail in order not to obscure the invention.

Throughout the specification, references to "one embodiment," "an embodiment," "one example," or "an example" mean: a particular feature, structure, or characteristic described in connection with the embodiment or example is included within at least one embodiment of the invention. Thus, the appearances of the phrases "in one embodiment," "in an embodiment," "one example," or "an example" in various places throughout this specification are not necessarily all referring to the same embodiment or example. Furthermore, the particular features, structures, or characteristics may be combined in any suitable combination and/or sub-combination in one or more embodiments or examples. Moreover, those of ordinary skill in the art will appreciate that the illustrations provided herein are for illustrative purposes and that the illustrations are not necessarily drawn to scale.

In an alternative embodiment, please refer to fig. 1, fig. 1 is a flowchart of a method for constructing a distributed wind control model according to an embodiment of the present invention. As shown in fig. 1, the method for constructing the distributed wind control model comprises the following steps:

And S01, respectively storing the historical sample data set and the sample data set to be evaluated into storage nodes, and constructing a distributed storage frame through the storage nodes.

Distributed storage technology is a technology for storing large-scale data that distributes the data across multiple physical nodes to provide high reliability, high scalability, and high performance data storage and access capabilities. Further, distributed storage techniques that may be used by the present embodiments include, but are not limited to Hadoop Distributed FILE SYSTEM (HDFS), APACHE CASSANDRA, or a distributed database.

Based on the distributed storage technology, in an optional embodiment, the storing the historical sample data set and the sample data set to be evaluated in the storage node in the step S01 respectively, and building a distributed storage frame through the storage node, includes the following steps:

and S011, acquiring a historical sample data set and a sample data set to be evaluated.

It should be appreciated that the historical sample dataset is a dataset for training and validating a wind control model, and the sample dataset to be evaluated is a data sample to be risk evaluated using the trained and validated wind control model. Further, the acquisition modes of the historical sample data set and the sample data set to be evaluated can be selected according to actual situations, such as database retrieval, investigation file and the like.

In an alternative embodiment, to assist an online lending platform in performing real-time credit evaluation on a new borrowing application, the acquiring a historical sample data set and a sample data set to be evaluated in step S011 includes the following steps:

and S0111, setting risk characteristics and obtaining corresponding data according to the risk characteristics.

In this embodiment, the following risk features are set for evaluating credit risk of borrowers: age (age reflects the maturity and stability of the borrower), income level (income level is used to evaluate his repayment ability and financial status), liability ratio (ratio of liability to income reflects the liability of the borrower), credit history (credit history including past loan records and repayment performance), employment status (employment status is used to evaluate his stable income source, academic level, reflects the educational background and potential income of the borrower).

Based on the risk characteristics, the corresponding historical sample data and sample data to be evaluated are obtained by searching the user database in the online lending platform, wherein part of the historical sample data are shown in the following table 1:

Part of the sample data to be evaluated is shown in table 2 below:

it should be appreciated that in other embodiments, the risk features may or may not include the risk features described above, and that the specific risk features may be appropriately adjusted according to the risk assessment objective.

And S0112, performing feature engineering on the data.

Feature engineering refers to feature engineering processing of acquired data to extract useful information and construct more representative features. In this embodiment, the feature engineering includes processing steps such as data cleaning, feature transformation, data normalization, missing value processing, and feature selection.

The processing modes of the feature transformation, the data standardization and the like specifically include: carrying out box division treatment on the age characteristics, and dividing continuous age values into discrete age groups; the revenue level features are normalized or normalized to have the same scale and range. Calculating a liability ratio feature, i.e., dividing the liability amount by the income amount to obtain the proportion of liabilities relative to income; for credit history features, past lending records and repayment performances are converted into related indexes such as overdue times, repayment delay days and the like; for employment characteristics, converting the employment characteristics into discrete categories such as work, no work, self-employment and the like; for the academic horizontal feature, it is converted into discrete categories such as family, major, doctor, etc.

And S0113, marking the data after the feature engineering to obtain a history sample data set with a label.

And labeling each sample according to the historical sample data set, namely distributing a corresponding risk label for each sample. The risk tags may be set according to specific business needs and risk assessment rules, for example, to classify the sample into high risk, medium risk and low risk classes.

Further, the labeling operation can be based on manual judgment, or can be performed automatically by using an existing risk assessment model. After the labeling is completed, the obtained historical sample data set contains characteristic data and corresponding labels.

S0114, sorting the data after the feature engineering to obtain a sample data set to be evaluated.

And (3) sorting the data subjected to the characteristic engineering treatment into a sample data set to be evaluated, wherein the data set only comprises characteristic data and does not comprise a label. The sample dataset to be evaluated will be used for risk assessment and prediction of new samples.

In this embodiment, by setting risk features and performing feature engineering and labeling in steps S0111 to S0114, the quality and accuracy of data can be improved, a more reliable data base is provided for building a wind control model, and the sorted sample data set to be evaluated can better adapt to the model building requirement.

S012, dividing the historical sample data set and the sample data set to be evaluated respectively, and generating one or more historical sample data subsets and one or more sample data subsets to be evaluated.

Since the original historical sample data set and the sample data set to be evaluated can be very bulky, to support distributed training and evaluation, step S012 divides it into a plurality of data subsets with smaller data volumes for parallel processing in a distributed computing framework.

Further, the partitioning may be based on different policies, such as random partitioning, temporal partitioning, partitioning by user attributes, partitioning by feature attributes, and so forth. It should be appreciated that the manner of partitioning should be selected based on the specific traffic requirements and data characteristics.

And S013, respectively storing the historical sample data subset and the sample data subset to be evaluated in corresponding storage nodes.

It will be appreciated that the storage nodes are child storage nodes based on distributed storage technology, and that any one of the storage nodes may be a separate server, database or storage system for storing and managing the corresponding sample data subsets. Regardless of the distributed storage technique selected, each storage node is responsible for storing and managing a respective subset of sample data.

In an alternative embodiment, the sample data subset stored to any one of the storage nodes in step S013 contains an amount of sample data less than or equal to Q, where Q represents a sample data amount threshold in the sample data subset. Further, the specific value of the sample data amount threshold Q should be weighted and adjusted according to the specific situation.

The embodiment limits the size of the sample data subset below a certain threshold, can improve the distributed processing efficiency of data and the scalability of a system, ensures reasonable data volume on each storage node, and reduces the burden of data transmission and calculation.

And S014, summarizing storage nodes corresponding to all the historical sample data subsets, and generating a historical sample distributed storage frame.

And S015, summarizing storage nodes corresponding to all the sample data subsets to be evaluated, and generating a sample distributed storage frame to be evaluated.

In this embodiment, in steps S011 to S015, by storing and constructing a distributed storage frame separately from a historical sample data set and a sample data set to be evaluated, the data access efficiency, the parallel processing capability and the expandability of the system can be improved, so that the overall performance and the processing capability of the system are increased, and the requirements of large-scale data and complex model construction are met.

S02, building a distributed computing frame based on the distributed storage frame, and training corresponding computing nodes by using historical sample data stored in the distributed storage frame.

It should be understood that, step S02 builds a distributed computing frame based on the distributed storage frame, which may be to build a corresponding distributed computing frame based on the historical sample distributed storage frame set forth in the above embodiment, or may be to build a corresponding distributed computing frame based on a storage frame obtained by combining the historical sample distributed storage frame set forth in the above embodiment and the sample distributed storage frame to be evaluated.

In an optional embodiment, to simplify the construction process of the distributed computing frame and improve the construction efficiency, the constructed distributed computing frame in this embodiment is constructed based on the history sample distributed storage frame proposed in the foregoing embodiment.

In this embodiment, the step S02 of constructing a distributed computing frame based on the distributed storage frame and training the corresponding computing node by using the historical sample data stored in the distributed storage frame includes the following steps:

S021, constructing a distributed computing framework, wherein the distributed computing framework comprises two or more computing nodes.

Further, the distributed computing framework can be built through software such as Apache Hadoop, APACHE SPARK, APACHE FLINK and the like. It should be appreciated that the framework is built with consideration of the nature, scale, and requirements of the computing tasks, as well as the scalability, fault tolerance, and performance of the framework.

S022, setting a distributed computing environment by combining the distributed computing framework and the distributed storage framework, wherein any computing node in the distributed computing framework corresponds to a storage node in the distributed storage framework.

The distributed computing environment is set up in step S022 to combine the selected distributed computing framework with the distributed storage framework to create an environment suitable for distributed computing. In the distributed computing environment, computing nodes may be connected to storage nodes through a network and communicate with a distributed storage framework through corresponding protocols and interfaces. In this way, the computing node is able to obtain historical sample data from the storage node for use in model training and evaluation. Specifically, the setting operation mainly includes operations such as cluster configuration, node connection, data distribution and sharing.

In an alternative embodiment, please refer to fig. 2, fig. 2 is a schematic diagram of a computing framework in a distributed computing environment according to an embodiment of the present invention. The setting up a distributed computing environment in combination with the distributed computing framework and the distributed storage framework described in step S022 includes the steps of:

S0221, correspondingly configuring a computing node for the storage node of any one of the historical sample data subsets.

As shown in fig. 2, the history sample distributed storage framework for storing history sample data includes a plurality of storage nodes (storage nodes 11, 12, …, and 1K), each storage node includes a plurality of history sample data (i.e., each storage node stores a corresponding subset of history sample data), each storage node corresponds to a computing node, and any two computing nodes are independent of each other.

S0222, all the storage nodes of any sample data subset to be evaluated are connected with the storage nodes corresponding to all the historical sample data subsets, so that data between the two storage nodes connected with each other is shared.

As shown in fig. 2, the sample to be evaluated distributed storage framework for storing sample data to be evaluated also includes a plurality of storage nodes (storage nodes 21, 22, …, and 2K), each storage node includes a plurality of sample data to be evaluated (i.e., each storage node stores a corresponding subset of sample data to be evaluated), and any storage node is fully connected with storage nodes corresponding to all the historical sample data subsets, so as to realize data sharing among the storage nodes.

S0223, loading sample data to be evaluated in the sample data subset to be evaluated through data sharing.

As shown in fig. 2, any computing node may implement loading corresponding sample data to be evaluated through data sharing between the historical sample data storage node and the sample data to be evaluated storage node.

S0224, using the trained computing nodes to primarily evaluate risks of sample data to be evaluated.

As shown in fig. 2, the computing node may train with the historical sample data in the historical sample data storage node, and then evaluate the sample data to be evaluated after loading the corresponding sample data to be evaluated through data sharing between the historical sample data storage node and the sample data storage node to be evaluated.

Further, the distributed computing framework further comprises integrated computing nodes, wherein the integrated computing nodes are used for summarizing initial evaluation results of the sample data to be evaluated among the plurality of computing nodes, and further evaluating risks of the sample data to be evaluated according to the initial evaluation results.

In this embodiment, by configuring the connection between the corresponding computing node and the storage node and the data sharing between the nodes, the steps S0221 to S0224 can load the sample data to be evaluated more efficiently in the distributed computing environment, and perform the preliminary risk assessment by using the trained computing node. Therefore, the overall calculation and evaluation efficiency can be improved, and the risk identification and decision making capability of sample data to be evaluated can be quickened.

S023, building a calculation model on any calculation node in the distributed calculation environment, and training the calculation model by using sample data on a storage node corresponding to the calculation node.

Further, the computing models in any two computing nodes in the distributed computing environment may be the same computing model or different computing models. The choice of computational model may be the same or different depending on the specific needs and goals of the distributed computing task.

When the distributed computing task needs to process large-scale data in parallel, the same computing model can be copied to each computing node, so that each computing node can independently process a data subset distributed by itself to perform model training and prediction. When the distributed computing task needs to process a plurality of different computing tasks or adopt different algorithm models, the characteristics and advantages of each computing node can be fully utilized, and the computing efficiency and the diversity of the models are improved.

In an alternative embodiment, the building a computing model on any computing node in the distributed computing environment described in step S023 includes the steps of:

s0231, respectively setting corresponding initial prediction weights based on each risk characteristic in the historical sample data set.

Specifically, for risk feature quantity，/>Corresponding to initial predictive weight/>，/>. In this embodiment, n risk features are included in any one of the history sample data.

S0232, constructing an initial prediction model by utilizing the risk characteristics and the corresponding initial prediction weights.

In this embodiment, the initial prediction model built satisfies the following formula: Wherein/> Representing initial sample data/>Predicted tag value of/>Representing initial sample data/>Is used for the number of risk features in the model,Representing initial sample data/>Initial predictive weight corresponding to the ith sample data,/>Representing initial sample data/>Is the i-th sample data in the database. Wherein, aim at/>The range of values of (i.e. definition domain/>)Which may be any set of real combinations during the initial prediction process.

S0233, selecting any historical sample data from the historical sample data set as initial sample data.

It will be appreciated that the initial sample data selected for any one computing node is from a corresponding subset of historical sample data.

S0234, combining the initial sample data with the initial prediction model to obtain a prediction label value of the initial sample data.

Initial sample data to be selectedSubstituting the initial predictive model/>Obtaining the corresponding predictive label value/>。

S0235, combining the actual tag value and the predicted tag value of the initial sample data to obtain an initial predicted residual.

In this embodiment, the initial prediction residual satisfies the following formula: wherein, the method comprises the steps of, wherein, Representing initial sample data/>I.e., the initial prediction residual of the initial prediction model,/>Representing initial sample data/>Actual tag value of/>Representing initial sample data/>Is used to predict the tag value.

S0236, constructing a secondary prediction model by combining the initial prediction residual error with the initial prediction model.

In the present embodiment, the initial prediction model is based onAnd an initial prediction residualThe quadratic prediction model satisfies the following formula:，/> Wherein/> Definition field/>, of initial prediction weightsSubset/>Any group element/>Can make the functionAt minimum,/>Representing initial sample data/>Actual tag value of/>Representing initial sample data/>I-th sample data,/>Representing the first set of sample data/>Is used to predict the tag value.

And S0237, continuously iterating the secondary prediction model by using the residual history sample data to obtain a final prediction model.

Based on the iterative mode, the final prediction model satisfies the following formula:，/> Wherein/> Domain/>, representing initial prediction weightsSubset/>Any group element/>Can make the functionAt minimum,/>Represents the total number of historical sample data in the set of historical sample data,Representing the total number of risk features in any set of historical sample data,/>Represents the j-th set of historical sample data/>Actual tag value of/>Represents the/>Group history sample data/>Corresponding predictive weights,/>Represents the j-th set of historical sample data/>I-th sample data,/>Represents the/>Group sample data/>Predicted tag value of/>Represents the m-th set of historical sample data/>I-th sample data,/>Represents the/>Group history sample data/>Is included in the prediction residual of (a).

In this embodiment, steps S0231 to S0237 can effectively improve the prediction capability and accuracy of the model by constructing a calculation model in a distributed computing environment and optimizing the model in an iterative manner. The prediction effect of the model can be gradually optimized through the iterative processing of initial prediction and residual error, so that the final prediction model is more accurate and reliable. Therefore, the accuracy and reliability of risk assessment can be improved, and more accurate wind control decision support is provided for Internet financial institutions.

In this embodiment, steps S021 to S023 can load sample data to be evaluated in a distributed computing environment more efficiently by configuring the connection between the corresponding computing node and the storage node and the data sharing between the nodes, and perform preliminary risk assessment by using the trained computing node. Therefore, the overall calculation and evaluation efficiency can be improved, and the risk identification and decision making capability of sample data to be evaluated can be quickened.

S03, integrating the trained computing nodes to generate a target wind control model, and evaluating risks of sample data to be evaluated by using the target wind control model.

Because the models on each compute node are independently trained, they may have different characteristics and behaviors. By integrating these models, advantages and disadvantages between different models can be balanced, thereby improving the performance and accuracy of the overall model.

In an optional embodiment, in order to better utilize huge data to perform model training and obtain a final prediction result, a single control variable principle is adopted to divide a specific data subset in a data distributed storage mode, namely, any risk feature is selected as a variable risk feature based on a feature engineering result, other risk feature quantities are fixed in a set range, corresponding historical sample data samples are searched correspondingly to generate a plurality of historical sample data subsets, and the plurality of historical sample data subsets are trained correspondingly to a plurality of calculation nodes.

The final prediction result obtained based on the above method has a strong characteristic bias, and further, in this embodiment, the target wind control model generated by the integrated trained computing node in step S031 satisfies the following formula:，/>， Wherein/> Domain/>, representing initial prediction weightsSubset/>Any group element/>Can make the function/>At minimum,/>Representing sample data to be evaluated/>Predictive weight corresponding to the ith sample data,/>Represents the j-th set of historical sample data/>I-th sample data,/>Representing the number of final predictive models,/>Represents the/>Sample data/>, to be evaluated, of each final prediction modelPredicted tag value of/>Representing sample data to be evaluated/>I-th sample data,/>Represent training of the first/>First/>, in the final prediction model processGroup sample data/>Prediction residual of/>Representing sample data to be evaluated/>Risk prediction value of/>Represents the/>Risk weights for the final prediction models.

The target wind control model provided in the embodiment weights a plurality of final prediction results by introducing risk weights, so that the contribution degree of the target wind control model to risk prediction can be adjusted according to the performances of different models, and the accuracy and stability of overall risk prediction are improved by utilizing the combined effect of the plurality of final prediction models.

The invention can effectively process huge data scale in the internet finance field through the wind control model constructed by the distributed storage frame and the calculation frame; in addition, the wind control model can utilize an internal distributed computing framework to conduct real-time data processing and model training, so that risk assessment can be responded quickly and updated in real time, the internal distributed computing framework can also provide more powerful computing capacity and resources, more complex models and more characteristics can be processed, and the accuracy and prediction capacity of the models are improved; meanwhile, the wind control model based on the distributed storage frame can provide a data backup and redundancy mechanism, so that the safety and reliability of data are enhanced.

In order to better implement the above-mentioned wind control model building method based on the distributed mode, in another alternative embodiment, the invention further provides a wind control model building system based on the distributed mode, please refer to fig. 3, fig. 3 is a schematic structural diagram of the wind control model building system based on the distributed mode provided by the embodiment of the invention.

As shown in fig. 3, the distributed wind control model building system includes an input device, a processor, a memory, and an output device, where the input device, the processor, the memory, and the output device are connected to each other, and the memory is used to store a computer program, where the computer program includes program instructions, and the processor is configured to call the program instructions to execute the distributed wind control model building method provided by the present invention.

In this embodiment, the input device is used to receive and provide data to the system. In the wind control model building process, the input device may include a data source, a data acquisition device, etc. for acquiring a historical sample data set and a sample data set to be evaluated.

The processor is a central processing unit of the system for executing the computer program instructions. In the wind control model building system, the processor is configured to call program instructions in the distributed wind control model building method to realize various operations of wind control model building.

The memory is used for storing computer programs, data, intermediate results and other information. It should be appreciated that in a distributed-based wind control model building system, memory plays an important role for storing historical sample data sets, model parameters, intermediate calculation results, etc., to support distributed data access and calculation.

The output device is used for displaying and outputting results. In the wind control model building system, the output device may include a display, a printer, etc. for displaying the built wind control model or other relevant output information.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.

Claims

1. The wind control model construction method based on the distributed type is characterized by comprising the following steps of:

Respectively storing a historical sample data set and a sample data set to be evaluated into storage nodes, and building a distributed storage frame through the storage nodes;

constructing a distributed computing frame based on the distributed storage frame, and training corresponding computing nodes by using historical sample data stored in the distributed storage frame;

generating a target wind control model by integrating the trained computing nodes, and evaluating the risk of sample data to be evaluated by using the target wind control model;

The method comprises the steps of respectively storing a historical sample data set and a sample data set to be evaluated into a storage node, and constructing a distributed storage frame through the storage node, and comprises the following steps:

Acquiring a historical sample data set and a sample data set to be evaluated;

dividing the historical sample data set and the sample data set to be evaluated respectively to generate one or more historical sample data subsets and one or more sample data subsets to be evaluated;

Storing the historical sample data subset and the sample data subset to be evaluated in corresponding storage nodes respectively;

Summarizing storage nodes corresponding to all the historical sample data subsets to generate a historical sample distributed storage frame;

summarizing storage nodes corresponding to all sample data subsets to be evaluated, and generating a sample distributed storage frame to be evaluated;

The method for constructing the distributed computing frame based on the distributed storage frame and training the corresponding computing nodes by using the historical sample data stored in the distributed storage frame comprises the following steps:

building a distributed computing framework, wherein the distributed computing framework comprises two or more computing nodes;

setting a distributed computing environment by combining the distributed computing framework and the distributed storage framework, wherein any computing node in the distributed computing framework corresponds to a storage node in the distributed storage framework;

Building a calculation model on any calculation node in the distributed calculation environment, and training the calculation model by using sample data on a storage node corresponding to the calculation node;

The setting of the distributed computing environment by combining the distributed computing framework and the distributed storage framework comprises the following steps:

correspondingly configuring a computing node for the storage node of any one of the historical sample data subsets;

Fully connecting the storage nodes of any sample data subset to be evaluated with the storage nodes corresponding to all the historical sample data subsets, so that data sharing is realized between the two storage nodes connected with each other;

Loading sample data to be evaluated in a sample data subset to be evaluated through data sharing;

using the trained computing nodes to primarily evaluate the risk of the sample data to be evaluated;

the building of a computing model on any computing node in the distributed computing environment comprises the following steps:

based on each risk characteristic in the historical sample data set, respectively setting corresponding initial prediction weights;

Constructing an initial prediction model by utilizing the risk characteristics and the corresponding initial prediction weights;

selecting any historical sample data from the historical sample data set as initial sample data;

Combining the initial sample data with an initial prediction model to obtain a prediction tag value of the initial sample data;

Combining the actual tag value and the predicted tag value of the initial sample data to obtain an initial predicted residual;

Constructing a secondary prediction model by combining the initial prediction residual error with an initial prediction model;

Continuously iterating the secondary prediction model by using the residual historical sample data to obtain a final prediction model;

the final prediction model satisfies the following formula: ， Wherein/> Domain/>, representing initial prediction weightsSubset/>Any group element/>Can make the function/>At minimum,/>Representing the total number of historical sample data in a set of historical sample data,/>Representing the total number of risk features in any set of historical sample data,/>Represents the j-th set of historical sample data/>Actual tag value of/>Represents the/>Group history sample data/>Corresponding predictive weights,/>Represents the j-th set of historical sample data/>I-th sample data,/>Represents the/>Group sample data/>Predicted tag value of/>Represents the m-th set of historical sample data/>I-th sample data,/>Represents the/>Group history sample data/>Is a prediction residual of (2);

the target wind control model satisfies the following formula: ，，/> Wherein/> Domain/>, representing initial prediction weightsSubset/>Any group element/>Can make the function/>At minimum,/>Representing sample data to be evaluated/>Predictive weight corresponding to the ith sample data,/>Represents the j-th set of historical sample data/>I-th sample data,/>Representing the number of final predictive models,/>Represents the/>Sample data/>, to be evaluated, of each final prediction modelPredicted tag value of/>Representing sample data to be evaluated/>I-th sample data,/>Represent training of the first/>First/>, in the final prediction model processGroup sample data/>Prediction residual of/>Representing sample data to be evaluatedRisk prediction value of/>Represents the/>Risk weights for the final prediction models.

2. The distributed-based wind control model building method of claim 1, wherein any one of the sample data subsets contains an amount of sample data less than or equal to Q, wherein Q represents a sample data amount threshold in the sample data subset.

3. The distributed wind control model building method according to claim 1, wherein the acquiring the historical sample dataset and the sample dataset to be evaluated comprises the steps of:

Setting risk characteristics and obtaining corresponding data according to the risk characteristics;

performing feature engineering on the data;

labeling the data after the feature engineering to obtain a history sample data set with a label;

And (5) sorting the data after the feature engineering to obtain a sample data set to be evaluated.

4. A distributed-based wind control model building system, comprising an input device, a processor, a memory, and an output device, the input device, the processor, the memory, and the output device being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the distributed-based wind control model building method of any of claims 1 to 3.