CN109582724B

CN109582724B - Distributed automatic feature engineering system architecture

Info

Publication number: CN109582724B
Application number: CN201811493937.1A
Authority: CN
Inventors: 施铭铮; 刘占辉
Original assignee: Xiamen Qianbitou Information Technology Co ltd
Current assignee: Xiamen Qianbitou Information Technology Co ltd
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2022-04-08
Anticipated expiration: 2038-12-07
Also published as: CN109582724A

Abstract

The invention discloses a distributed automatic characteristic engineering system architecture, which comprises the following implementation steps: distributed automatic feature calculation cluster, dimension reduction algorithm, model training and hyper-parameter searching; the distributed automatic characteristic engineering system has reasonable structural design, can save a large amount of labor cost for the automobile financing leasing company, the work which should be finished by a characteristic engineer and a wind control expert experienced in the field is automatically finished by the process of the invention, the automobile financing leasing company only needs to provide original unprocessed service data, the invention automatically finishes the whole process of characteristic engineering and model training and outputs a final wind control report.

Description

Distributed automatic feature engineering system architecture

Technical Field

The invention relates to a distributed automatic characteristic engineering system architecture, belonging to the technical field of automobile financial wind control.

Background

The conventional flow of automotive financial governance is to make a series of governance rules by experts in the field, each rule may contain some calculation formula and is calculated from data related to the client applying for the loan, each rule may require one or more client data (in this document, one client data is defined as a feature), and if the data provided by the client applying the loan is calculated by all the rules to obtain a score lower than the passing score of the governance, the client's loan application will not pass.

Feature engineering is characterized in that a useful feature is often obtained by performing some simple arithmetic or statistical calculation on a plurality of original features (namely original data), namely, after a client gives data for applying for loan, a feature extraction and generation process is also performed, which is collectively called feature engineering, and after a new feature is generated through feature engineering, all the features are combined and input into a machine learning algorithm for calculation.

The first problem is that experts who have wind control experience and can make good wind control rules are a scarce resource, the second problem is that even though experts make wind control rules, the rules are summarized by the personal experience of the experts and do not represent the whole requirement of the whole automobile financial industry, and the second problem is that the experts or feature engineers manually extract new features, which is time-consuming, and since one set of data generally comes from different data sources, the data of the same data source often comprises a plurality of data tables, and the arrangement and combination calculation of the plurality of original data can take a few weeks for the feature engineers.

Therefore, the invention of a framework capable of automatically completing the whole process of feature engineering and model training and outputting the final wind control report has very important significance, and therefore, the invention provides a distributed automatic feature engineering system architecture.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a distributed automatic characteristic engineering system architecture to solve the problems in the background art, the design of the invention is reasonable, a large amount of labor cost can be saved for an automobile financing leasing company, the work which is originally completed by a characteristic engineer and a wind control expert experienced in the field is automatically completed by the process of the invention, the automobile financing leasing company only needs to provide original unprocessed service data, and the invention automatically completes the whole process of characteristic engineering and model training and outputs a final wind control report.

In order to achieve the purpose, the invention provides the following technical scheme: the distributed automatic characteristic engineering system architecture is realized by the following steps:

the method comprises the following steps: a distributed automatic feature computation cluster; the method comprises the steps that a computer cluster is needed, the first part is a distributed automatic characteristic computing cluster, the original data is supposed to be composed of a master table and n slave tables, the data form is common in the field of automobile financial wind control, n computers are distributed in the cluster, each computer is provided with hbase and python, each computer has two roles, one role is used for storing distributed data, and the other role is used for storing distributed dataThe roles are distributed computing, each computer will be assigned one of n slave tables, the master table will be copied to each computer, for example, computer n will merge the master table and slave table n and generate a new feature, the generation of the new feature is controlled by a set of parameters, the number of parameters of different slave tables is not the same in general, and it is assumed that there is x from table 1₁A parameter, x from Table 2₂A parameter, and so on, then all n slaves have x in common₁+x₂+…+x_nThe parameters, the master table and all the slave tables can be longitudinally cut so as to be more uniformly distributed to all computers of the cluster, all the features generated by the first part are combined into a large table which is used as input data of the second part, the feature extraction of the first part generates a large number of features, if all the features are input into the model, a large number of model parameters are needed, so that all the features are subjected to dimension reduction before the data enters the model;

step two: a dimension reduction algorithm; the dimension reduction algorithm comprises the following specific steps:

the method comprises the following steps: sorting all the features by using a decision tree algorithm or other algorithms capable of arranging the feature importance to obtain a feature importance list, selecting x% (such as 10%) features in the list as a basic feature list, and marking the basic feature list as L;

secondly, the step of: taking one characteristic f from the rest characteristics and adding the characteristic f into L;

③: performing ridge regression (ridge regression) or similar regression calculations on L;

fourthly, the method comprises the following steps: calculating AUC;

fifthly: feature f is retained if AUC increases (i.e. accuracy increases), and removed if AUC does not increase;

sixthly, the method comprises the following steps: circulating from step two to step five until all the features are processed, and obtaining a feature set after dimension reduction after the dimension reduction algorithm is completed;

step three: training a model; the feature set after dimensionality reduction is put into a model as input data for training, the model training uses an integration algorithm, the algorithm comprises algorithms such as a deep neural network (using python packages such as TensorFlow, Keras and the like), a gradient boosting machine (using python packages such as LightGBM, xgboost, catboost and the like) and a random forest (using sciit-learnpyton package) and the like, and the main steps of the model training are as follows:

the method comprises the following steps: determining a set of algorithms, such as a deep neural network, LightGBM, xgboost, catboost, random forest, etc.;

secondly, the step of: inputting a feature set after dimensionality reduction for each algorithm in the algorithm set, performing model evaluation by using k-fold cross validation, and finally outputting AUC (AUC), wherein the output of each algorithm is only one column, namely the AUC value;

③: merging the output columns of all algorithms, i.e. AUC, into a table a, for example, there are 20 algorithms in the algorithm set, each algorithm outputs one column, and the final merged table a will have 20 columns;

fourthly, the method comprises the following steps: determining an integration algorithm, wherein Logistic regression, neural network and the like are commonly used, if the Logistic regression is used as the integration algorithm, inputting the table A into the Logistic regression algorithm, and calculating to obtain a final integrated AUC value, it can be verified that the integrated AUC values are more accurate than the AUC values calculated by any single algorithm, if there are 20 algorithms, these 20 algorithms will be deployed separately into 20 computers in the cluster, and the feature set output by the second part dimension reduction algorithm will be copied to all computers, that is to say the input data of these 20 computers is the same, the running algorithms are different, the AUC columns obtained after all the algorithms are calculated are collected into a host computer for calculation of the integrated algorithm, the AUC calculated by the integrated algorithm is one column, averaging the values to obtain an AUC mean (single value), and carrying out the AUC mean and the hyperparameter included in the third part of model training and the x in the first part of feature engineering.₁+x₂+…+x_nTaking the parameters as input data of a fourth part;

step four: searching a hyper-parameter; considering the hyper-parameter search as a model training again, the input of the model is all hyper-parameters in the first part and the third part, and the AUC mean value, all hyper-parameters form a hyper-parameter space, the goal of model training is to find a point in the hyperparametric space to make the AUC obtain the maximum value, and select Bayesian optimization as the algorithm for searching the hyperparametric, wherein the Bayesian optimization is a cyclic iteration process, a new hyperparametric value is generated as feedback after each calculation is completed, these new hyper-parameter values will be entered in the first part and a new loop will be started, each loop (i.e. from the first part to the fourth part) adding a point in the hyper-parameter space, and when the circulation times are increased continuously, the value obtained by the Bayesian optimization algorithm is converged continuously until the value is converged to a preset threshold value, and the circulation is stopped.

In one embodiment: the specific implementation steps of the distributed automatic feature engineering system architecture are as follows:

the method comprises the following steps: first, there needs to be a cluster of computers, such as Aliskiren;

secondly, the step of: installing an Apache Hadoop cluster edition, an Apache HBase cluster edition, an Apache Hhoenix and an Apache Thrift on each computer on the cluster, so that the whole cluster has the function of a distributed database;

③: the method comprises the steps that MySQL is installed on one computer, an InoDB database engine is used for creating a project database on the MySQL, the project database manages the whole process, all computers on a cluster can concurrently access the project database to obtain state data of the process, and the consistency of the whole process is guaranteed;

fourthly, the method comprises the following steps: installing a required python program and a required python package on each computer of the cluster, wherein the python code deployed by each computer is the same;

fifthly: the python programs on each computer will start at the same time, and the parallel computation and the concurrent control in the cluster are managed by the project database in MySQL;

sixthly, the method comprises the following steps: the order query result is returned in json format, and a wind control report is presented by the foraging big data system.

In one embodiment: and the AUC in the second step is the accuracy measurement standard commonly used in the field of wind control.

In one embodiment: the python package includes numpy, pandas, scipy, scimit-spare, TensorFlow, Keras, LightGBM, xgboost, catboost, and the like.

After the technical scheme is adopted, on one hand, a large amount of labor cost can be saved for the automobile financing leasing company, work which is originally finished by a characteristic engineer and a wind control expert experienced in the field is automatically finished by the process of the invention, the automobile financing leasing company only needs to provide original unprocessed business data, the invention automatically finishes the whole process of characteristic engineering and model training and outputs a final wind control report, and the automobile financing leasing company can select a mode of charging according to orders, so that a mode of charging according to needs can be provided for small and medium-sized companies with small order quantity, and the mode is more flexible than that of hiring experts;

on the other hand, compared with manual operation, the work efficiency of automatic feature engineering and model training is greatly improved, for the data volume of 10 ten thousand orders, machine time which needs about 500 hours is completed by a cluster (namely, if 20 computers are executed in parallel, about 25 hours is needed), a feature engineer and a wind control expert need to complete work for several weeks, the work can be completed by only one day after being distributed to each computer of the cluster, and after the analysis of the full volume data is completed, the result can be returned only by less than 1 second when each order is inquired by a subsequent order wind control report;

in addition, in the aspect of accuracy, because the distributed automatic feature engineering system framework uses a Bayesian optimization algorithm to search all possible feature combinations, new and useful features which are not found manually can be found, the automatic feature extraction by a computer can achieve the same effect as the manual feature extraction by a wind control expert, the computer calculation is based on data and objective, and the model does not depend on any artificial subjective generation rule, so that the inaccuracy of the model caused by errors of the experience or subjectivity of the feature engineer and the wind control expert can be avoided.

Drawings

FIG. 1 is a functional block diagram of a distributed automatic feature engineering system architecture of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a distributed automatic feature engineering system architecture, which is implemented as follows:

the method comprises the following steps: a distributed automatic feature computation cluster; it is necessary to have a computer cluster, the first part is a distributed automatic feature computing cluster, and it is assumed that the original data is composed of a master table and n slave tables, which is also a common data form in the field of automobile financial wind control, n computers are distributed in the cluster, each computer is deployed with hbase and python, each computer has two roles, one role is to store distributed data, the other role is distributed computing, each computer is assigned with one slave table of n slave tables, the master table is copied to each computer, for example, the computer n combines the master table and the slave table n and generates new features, the generation of the new features is controlled by a set of parameters, the number of parameters of different slave tables is not the same in general, it is assumed that the slave table 1 has x₁A parameter, x from Table 2₂A parameter, and so on, then all n slaves have x in common₁+x₂+…+x_nThe parameters, the master table and all the slave tables can be longitudinally cut so as to be more uniformly distributed to all computers of the cluster, all the features generated by the first part are combined into a large table which is used as input data of the second part, the feature extraction of the first part generates a large number of features, if all the features are input into the model, a large number of model parameters are needed, so that all the features are subjected to dimension reduction before the data enters the model;

fourthly, the method comprises the following steps: calculating AUC;

fourthly, the method comprises the following steps: determining an integration algorithm, wherein the commonly used integration algorithm comprises a Logistic regression algorithm, a neural network and the like, if the Logistic regression is used as the integration algorithm, inputting the table A into the Logistic regression algorithm, and calculating to obtain a final integrated AUC value, so that the integrated AUC value can be verified to be comparable to any oneThe accuracy of the AUC values calculated by the single algorithm is high, if there are 20 algorithms, the 20 algorithms are respectively deployed in 20 computers in the cluster, the feature set output by the second part of dimension reduction algorithm is copied to all the computers, that is, the input data of the 20 computers are the same, but the running algorithms are different, the AUC columns obtained after the calculation of all the algorithms is completed are collected into one host computer for the calculation of the integrated algorithm, the AUC calculated by the integrated algorithm is a column, the column values are averaged to obtain the AUC mean (single value), the AUC mean, the hyperparameters included in the third part of model training and the x values in the first part of feature engineering are averaged to obtain the AUC mean (single value), and the AUC mean, the hyperparameters included in the third part of model training and the x values in the first part of feature engineering are all the same as the AUC mean₁+x₂+…+x_nTaking the parameters as input data of a fourth part;

In this embodiment, the specific implementation steps of the distributed automatic feature engineering system architecture are as follows:

Further, the AUC in the second step is an accuracy measure commonly used in the field of wind control.

Through the structure, after the distributed automatic characteristic engineering system architecture is applied, on one hand, a large amount of labor cost can be saved for an automobile financing leasing company, work which should be completed by a characteristic engineer and a wind control expert experienced in the field is automatically completed by the process of the invention, the automobile financing leasing company only needs to provide original unprocessed service data, the invention automatically completes the whole process of characteristic engineering and model training and outputs a final wind control report, and the automobile financing leasing company can select a mode of charging according to orders, so that a mode of charging according to needs is provided for small and medium-sized companies with small orders, which is more flexible than a mode of hiring experts, on the other hand, compared with manual operation, the working efficiency of automatic characteristic engineering and model training is greatly improved, and for the data volume of 10 ten thousand orders, the cluster is used for completing machine time which needs about 500 hours (namely, if 20 computers are executed in parallel, about 25 hours is needed), a feature engineer and a wind control expert need to complete work for several weeks, the work can be completed only one day after being distributed to all the computers of the cluster, and after the analysis of the full data is completed, the follow-up order wind control report can return results only in less than 1 second for inquiring each order.

Preferably, the present embodiment also has a configuration in which the python package includes numpy, pandas, scipy, scinit-spare, Tensorflow, Keras, LightGBM, xgboost, catboost, and the like.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. The distributed automatic feature engineering system architecture is characterized in that the distributed automatic feature engineering system architecture is realized by the following steps:

the method comprises the following steps: a distributed automatic feature computation cluster; the method is characterized in that a computer cluster is needed, the first part is a distributed automatic feature computing cluster, original data consists of a master table and n slave tables, the data is also in a data form common in the field of automobile financial wind control, n computers are distributed in the cluster, each computer is provided with hbase and python, each computer has two roles, one role is used for storing distributed data, the other role is distributed computing, each computer is distributed with one slave table of the n slave tables, and the master table is used for storing the distributed dataThe table is copied to each computer, computer n merges master table and slave table n and generates a new feature controlled by a set of parameters, the number of parameters of different slave tables is different, and from table 1 there is x₁A parameter, x from Table 2₂A parameter, and so on, then all n slaves have x in common₁+x₂+…+x_nThe parameters, the master table and all the slave tables can be longitudinally cut so as to be more uniformly distributed to all computers of the cluster, all the features generated by the first part are combined into a large table which is used as input data of the second part, the feature extraction of the first part generates a large number of features, all the features are input into the model, a large number of model parameters are needed, and therefore all the features are subjected to dimension reduction processing before the data enter the model;

the method comprises the following steps: sorting all the features by using a decision tree algorithm to obtain a feature importance list, selecting x% of the features in the list as a basic feature list, and marking the basic feature list as L;

③: performing ridge regression calculation on the L;

fourthly, the method comprises the following steps: calculating AUC;

fifthly: retaining feature f if AUC increases, removing feature f if AUC does not increase;

step three: training a model; the feature set after dimensionality reduction is used as input data and put into a model for training, the model training uses an integrated algorithm, the algorithm comprises a deep neural network, a gradient lifting machine and a random forest algorithm, and the model training mainly comprises the following steps:

the method comprises the following steps: determining an algorithm set comprising a deep neural network, a LightGBM, an xgboost, a catboost and a random forest;

③: combining the output columns of all algorithms, namely AUC, into a table A, taking the output column of the algorithm as a characteristic and taking the output column as the input of the next-stage algorithm;

fourthly, the method comprises the following steps: determining an integration algorithm, using Logistic regression as the integration algorithm, inputting the Table A into the Logistic regression algorithm, and calculating to obtain a final integrated AUC value, which can verify that the accuracy of the AUC value after integration is higher than that of the AUC value calculated by any single algorithm, the algorithms will be respectively deployed in computers in a cluster, the feature set output by the second part dimension reduction algorithm will be copied to all computers, that is, the input data of the computers are the same, but the running algorithms are different, AUC columns obtained after all algorithm calculations are completed will be collected into a host computer for integrated algorithm calculation, the AUC calculated by the integrated algorithm is a column, averaging the column values to obtain the AUC mean, averaging the AUC mean, the hyperparameters contained in the third part of model training and the x in the first part of feature engineering₁+x₂+…+x_nTaking the parameters as input data of a fourth part;

step four: searching a hyper-parameter; the searching of the hyperparameters is regarded as another model training, the input of the model is all hyperparameters in the first part and the third part and also has an AUC mean value, all the hyperparameters form a hyperparameter space, the goal of the model training is to find a point in the hyperparameter space so that the AUC obtains the maximum value, Bayesian optimization is selected as an algorithm for searching the hyperparameters, the Bayesian optimization algorithm is a loop iteration process, new hyperparameters are generated after calculation is completed each time and used as feedback, the new hyperparameters are input into the first part and a new loop is started, each loop is added with a point in the hyperparameter space, values obtained by the Bayesian optimization algorithm are converged after the number of the loops is increased continuously until the values converge to a preset threshold value, and the loop is stopped.

2. The distributed automated feature engineering system architecture of claim 1, wherein: the specific implementation steps of the distributed automatic feature engineering system architecture are as follows:

the method comprises the following steps: first, a cluster of computers is required;

secondly, the step of: the method comprises the following steps that an Apachedoop cluster edition, an ApachheBase cluster edition, Apachenix and ApacherThrift are installed on each computer on a cluster, and therefore the whole cluster has the function of a distributed database;

3. The distributed automated feature engineering system architecture of claim 1, wherein: and the AUC in the second step is the accuracy measurement standard commonly used in the field of wind control.

4. The distributed automated feature engineering system architecture of claim 2, wherein: the python includes numpy, pandas, scipy, scimit-spare, TensorFlow, Keras, LightGBM, xgboost, catboost.