CN109582724B - Distributed automatic feature engineering system architecture - Google Patents

Distributed automatic feature engineering system architecture Download PDF

Info

Publication number
CN109582724B
CN109582724B CN201811493937.1A CN201811493937A CN109582724B CN 109582724 B CN109582724 B CN 109582724B CN 201811493937 A CN201811493937 A CN 201811493937A CN 109582724 B CN109582724 B CN 109582724B
Authority
CN
China
Prior art keywords
algorithm
auc
cluster
feature
steps
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811493937.1A
Other languages
Chinese (zh)
Other versions
CN109582724A (en
Inventor
施铭铮
刘占辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Qianbitou Information Technology Co ltd
Original Assignee
Xiamen Qianbitou Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Qianbitou Information Technology Co ltd filed Critical Xiamen Qianbitou Information Technology Co ltd
Priority to CN201811493937.1A priority Critical patent/CN109582724B/en
Publication of CN109582724A publication Critical patent/CN109582724A/en
Application granted granted Critical
Publication of CN109582724B publication Critical patent/CN109582724B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Feedback Control In General (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a distributed automatic characteristic engineering system architecture, which comprises the following implementation steps: distributed automatic feature calculation cluster, dimension reduction algorithm, model training and hyper-parameter searching; the distributed automatic characteristic engineering system has reasonable structural design, can save a large amount of labor cost for the automobile financing leasing company, the work which should be finished by a characteristic engineer and a wind control expert experienced in the field is automatically finished by the process of the invention, the automobile financing leasing company only needs to provide original unprocessed service data, the invention automatically finishes the whole process of characteristic engineering and model training and outputs a final wind control report.

Description

Distributed automatic feature engineering system architecture
Technical Field
The invention relates to a distributed automatic characteristic engineering system architecture, belonging to the technical field of automobile financial wind control.
Background
The conventional flow of automotive financial governance is to make a series of governance rules by experts in the field, each rule may contain some calculation formula and is calculated from data related to the client applying for the loan, each rule may require one or more client data (in this document, one client data is defined as a feature), and if the data provided by the client applying the loan is calculated by all the rules to obtain a score lower than the passing score of the governance, the client's loan application will not pass.
Feature engineering is characterized in that a useful feature is often obtained by performing some simple arithmetic or statistical calculation on a plurality of original features (namely original data), namely, after a client gives data for applying for loan, a feature extraction and generation process is also performed, which is collectively called feature engineering, and after a new feature is generated through feature engineering, all the features are combined and input into a machine learning algorithm for calculation.
The first problem is that experts who have wind control experience and can make good wind control rules are a scarce resource, the second problem is that even though experts make wind control rules, the rules are summarized by the personal experience of the experts and do not represent the whole requirement of the whole automobile financial industry, and the second problem is that the experts or feature engineers manually extract new features, which is time-consuming, and since one set of data generally comes from different data sources, the data of the same data source often comprises a plurality of data tables, and the arrangement and combination calculation of the plurality of original data can take a few weeks for the feature engineers.
Therefore, the invention of a framework capable of automatically completing the whole process of feature engineering and model training and outputting the final wind control report has very important significance, and therefore, the invention provides a distributed automatic feature engineering system architecture.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a distributed automatic characteristic engineering system architecture to solve the problems in the background art, the design of the invention is reasonable, a large amount of labor cost can be saved for an automobile financing leasing company, the work which is originally completed by a characteristic engineer and a wind control expert experienced in the field is automatically completed by the process of the invention, the automobile financing leasing company only needs to provide original unprocessed service data, and the invention automatically completes the whole process of characteristic engineering and model training and outputs a final wind control report.
In order to achieve the purpose, the invention provides the following technical scheme: the distributed automatic characteristic engineering system architecture is realized by the following steps:
the method comprises the following steps: a distributed automatic feature computation cluster; the method comprises the steps that a computer cluster is needed, the first part is a distributed automatic characteristic computing cluster, the original data is supposed to be composed of a master table and n slave tables, the data form is common in the field of automobile financial wind control, n computers are distributed in the cluster, each computer is provided with hbase and python, each computer has two roles, one role is used for storing distributed data, and the other role is used for storing distributed dataThe roles are distributed computing, each computer will be assigned one of n slave tables, the master table will be copied to each computer, for example, computer n will merge the master table and slave table n and generate a new feature, the generation of the new feature is controlled by a set of parameters, the number of parameters of different slave tables is not the same in general, and it is assumed that there is x from table 11A parameter, x from Table 22A parameter, and so on, then all n slaves have x in common1+x2+…+xnThe parameters, the master table and all the slave tables can be longitudinally cut so as to be more uniformly distributed to all computers of the cluster, all the features generated by the first part are combined into a large table which is used as input data of the second part, the feature extraction of the first part generates a large number of features, if all the features are input into the model, a large number of model parameters are needed, so that all the features are subjected to dimension reduction before the data enters the model;
step two: a dimension reduction algorithm; the dimension reduction algorithm comprises the following specific steps:
the method comprises the following steps: sorting all the features by using a decision tree algorithm or other algorithms capable of arranging the feature importance to obtain a feature importance list, selecting x% (such as 10%) features in the list as a basic feature list, and marking the basic feature list as L;
secondly, the step of: taking one characteristic f from the rest characteristics and adding the characteristic f into L;
③: performing ridge regression (ridge regression) or similar regression calculations on L;
fourthly, the method comprises the following steps: calculating AUC;
fifthly: feature f is retained if AUC increases (i.e. accuracy increases), and removed if AUC does not increase;
sixthly, the method comprises the following steps: circulating from step two to step five until all the features are processed, and obtaining a feature set after dimension reduction after the dimension reduction algorithm is completed;
step three: training a model; the feature set after dimensionality reduction is put into a model as input data for training, the model training uses an integration algorithm, the algorithm comprises algorithms such as a deep neural network (using python packages such as TensorFlow, Keras and the like), a gradient boosting machine (using python packages such as LightGBM, xgboost, catboost and the like) and a random forest (using sciit-learnpyton package) and the like, and the main steps of the model training are as follows:
the method comprises the following steps: determining a set of algorithms, such as a deep neural network, LightGBM, xgboost, catboost, random forest, etc.;
secondly, the step of: inputting a feature set after dimensionality reduction for each algorithm in the algorithm set, performing model evaluation by using k-fold cross validation, and finally outputting AUC (AUC), wherein the output of each algorithm is only one column, namely the AUC value;
③: merging the output columns of all algorithms, i.e. AUC, into a table a, for example, there are 20 algorithms in the algorithm set, each algorithm outputs one column, and the final merged table a will have 20 columns;
fourthly, the method comprises the following steps: determining an integration algorithm, wherein Logistic regression, neural network and the like are commonly used, if the Logistic regression is used as the integration algorithm, inputting the table A into the Logistic regression algorithm, and calculating to obtain a final integrated AUC value, it can be verified that the integrated AUC values are more accurate than the AUC values calculated by any single algorithm, if there are 20 algorithms, these 20 algorithms will be deployed separately into 20 computers in the cluster, and the feature set output by the second part dimension reduction algorithm will be copied to all computers, that is to say the input data of these 20 computers is the same, the running algorithms are different, the AUC columns obtained after all the algorithms are calculated are collected into a host computer for calculation of the integrated algorithm, the AUC calculated by the integrated algorithm is one column, averaging the values to obtain an AUC mean (single value), and carrying out the AUC mean and the hyperparameter included in the third part of model training and the x in the first part of feature engineering.1+x2+…+xnTaking the parameters as input data of a fourth part;
step four: searching a hyper-parameter; considering the hyper-parameter search as a model training again, the input of the model is all hyper-parameters in the first part and the third part, and the AUC mean value, all hyper-parameters form a hyper-parameter space, the goal of model training is to find a point in the hyperparametric space to make the AUC obtain the maximum value, and select Bayesian optimization as the algorithm for searching the hyperparametric, wherein the Bayesian optimization is a cyclic iteration process, a new hyperparametric value is generated as feedback after each calculation is completed, these new hyper-parameter values will be entered in the first part and a new loop will be started, each loop (i.e. from the first part to the fourth part) adding a point in the hyper-parameter space, and when the circulation times are increased continuously, the value obtained by the Bayesian optimization algorithm is converged continuously until the value is converged to a preset threshold value, and the circulation is stopped.
In one embodiment: the specific implementation steps of the distributed automatic feature engineering system architecture are as follows:
the method comprises the following steps: first, there needs to be a cluster of computers, such as Aliskiren;
secondly, the step of: installing an Apache Hadoop cluster edition, an Apache HBase cluster edition, an Apache Hhoenix and an Apache Thrift on each computer on the cluster, so that the whole cluster has the function of a distributed database;
③: the method comprises the steps that MySQL is installed on one computer, an InoDB database engine is used for creating a project database on the MySQL, the project database manages the whole process, all computers on a cluster can concurrently access the project database to obtain state data of the process, and the consistency of the whole process is guaranteed;
fourthly, the method comprises the following steps: installing a required python program and a required python package on each computer of the cluster, wherein the python code deployed by each computer is the same;
fifthly: the python programs on each computer will start at the same time, and the parallel computation and the concurrent control in the cluster are managed by the project database in MySQL;
sixthly, the method comprises the following steps: the order query result is returned in json format, and a wind control report is presented by the foraging big data system.
In one embodiment: and the AUC in the second step is the accuracy measurement standard commonly used in the field of wind control.
In one embodiment: the python package includes numpy, pandas, scipy, scimit-spare, TensorFlow, Keras, LightGBM, xgboost, catboost, and the like.
After the technical scheme is adopted, on one hand, a large amount of labor cost can be saved for the automobile financing leasing company, work which is originally finished by a characteristic engineer and a wind control expert experienced in the field is automatically finished by the process of the invention, the automobile financing leasing company only needs to provide original unprocessed business data, the invention automatically finishes the whole process of characteristic engineering and model training and outputs a final wind control report, and the automobile financing leasing company can select a mode of charging according to orders, so that a mode of charging according to needs can be provided for small and medium-sized companies with small order quantity, and the mode is more flexible than that of hiring experts;
on the other hand, compared with manual operation, the work efficiency of automatic feature engineering and model training is greatly improved, for the data volume of 10 ten thousand orders, machine time which needs about 500 hours is completed by a cluster (namely, if 20 computers are executed in parallel, about 25 hours is needed), a feature engineer and a wind control expert need to complete work for several weeks, the work can be completed by only one day after being distributed to each computer of the cluster, and after the analysis of the full volume data is completed, the result can be returned only by less than 1 second when each order is inquired by a subsequent order wind control report;
in addition, in the aspect of accuracy, because the distributed automatic feature engineering system framework uses a Bayesian optimization algorithm to search all possible feature combinations, new and useful features which are not found manually can be found, the automatic feature extraction by a computer can achieve the same effect as the manual feature extraction by a wind control expert, the computer calculation is based on data and objective, and the model does not depend on any artificial subjective generation rule, so that the inaccuracy of the model caused by errors of the experience or subjectivity of the feature engineer and the wind control expert can be avoided.
Drawings
FIG. 1 is a functional block diagram of a distributed automatic feature engineering system architecture of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a distributed automatic feature engineering system architecture, which is implemented as follows:
the method comprises the following steps: a distributed automatic feature computation cluster; it is necessary to have a computer cluster, the first part is a distributed automatic feature computing cluster, and it is assumed that the original data is composed of a master table and n slave tables, which is also a common data form in the field of automobile financial wind control, n computers are distributed in the cluster, each computer is deployed with hbase and python, each computer has two roles, one role is to store distributed data, the other role is distributed computing, each computer is assigned with one slave table of n slave tables, the master table is copied to each computer, for example, the computer n combines the master table and the slave table n and generates new features, the generation of the new features is controlled by a set of parameters, the number of parameters of different slave tables is not the same in general, it is assumed that the slave table 1 has x1A parameter, x from Table 22A parameter, and so on, then all n slaves have x in common1+x2+…+xnThe parameters, the master table and all the slave tables can be longitudinally cut so as to be more uniformly distributed to all computers of the cluster, all the features generated by the first part are combined into a large table which is used as input data of the second part, the feature extraction of the first part generates a large number of features, if all the features are input into the model, a large number of model parameters are needed, so that all the features are subjected to dimension reduction before the data enters the model;
step two: a dimension reduction algorithm; the dimension reduction algorithm comprises the following specific steps:
the method comprises the following steps: sorting all the features by using a decision tree algorithm or other algorithms capable of arranging the feature importance to obtain a feature importance list, selecting x% (such as 10%) features in the list as a basic feature list, and marking the basic feature list as L;
secondly, the step of: taking one characteristic f from the rest characteristics and adding the characteristic f into L;
③: performing ridge regression (ridge regression) or similar regression calculations on L;
fourthly, the method comprises the following steps: calculating AUC;
fifthly: feature f is retained if AUC increases (i.e. accuracy increases), and removed if AUC does not increase;
sixthly, the method comprises the following steps: circulating from step two to step five until all the features are processed, and obtaining a feature set after dimension reduction after the dimension reduction algorithm is completed;
step three: training a model; the feature set after dimensionality reduction is put into a model as input data for training, the model training uses an integration algorithm, the algorithm comprises algorithms such as a deep neural network (using python packages such as TensorFlow, Keras and the like), a gradient boosting machine (using python packages such as LightGBM, xgboost, catboost and the like) and a random forest (using sciit-learnpyton package) and the like, and the main steps of the model training are as follows:
the method comprises the following steps: determining a set of algorithms, such as a deep neural network, LightGBM, xgboost, catboost, random forest, etc.;
secondly, the step of: inputting a feature set after dimensionality reduction for each algorithm in the algorithm set, performing model evaluation by using k-fold cross validation, and finally outputting AUC (AUC), wherein the output of each algorithm is only one column, namely the AUC value;
③: merging the output columns of all algorithms, i.e. AUC, into a table a, for example, there are 20 algorithms in the algorithm set, each algorithm outputs one column, and the final merged table a will have 20 columns;
fourthly, the method comprises the following steps: determining an integration algorithm, wherein the commonly used integration algorithm comprises a Logistic regression algorithm, a neural network and the like, if the Logistic regression is used as the integration algorithm, inputting the table A into the Logistic regression algorithm, and calculating to obtain a final integrated AUC value, so that the integrated AUC value can be verified to be comparable to any oneThe accuracy of the AUC values calculated by the single algorithm is high, if there are 20 algorithms, the 20 algorithms are respectively deployed in 20 computers in the cluster, the feature set output by the second part of dimension reduction algorithm is copied to all the computers, that is, the input data of the 20 computers are the same, but the running algorithms are different, the AUC columns obtained after the calculation of all the algorithms is completed are collected into one host computer for the calculation of the integrated algorithm, the AUC calculated by the integrated algorithm is a column, the column values are averaged to obtain the AUC mean (single value), the AUC mean, the hyperparameters included in the third part of model training and the x values in the first part of feature engineering are averaged to obtain the AUC mean (single value), and the AUC mean, the hyperparameters included in the third part of model training and the x values in the first part of feature engineering are all the same as the AUC mean1+x2+…+xnTaking the parameters as input data of a fourth part;
step four: searching a hyper-parameter; considering the hyper-parameter search as a model training again, the input of the model is all hyper-parameters in the first part and the third part, and the AUC mean value, all hyper-parameters form a hyper-parameter space, the goal of model training is to find a point in the hyperparametric space to make the AUC obtain the maximum value, and select Bayesian optimization as the algorithm for searching the hyperparametric, wherein the Bayesian optimization is a cyclic iteration process, a new hyperparametric value is generated as feedback after each calculation is completed, these new hyper-parameter values will be entered in the first part and a new loop will be started, each loop (i.e. from the first part to the fourth part) adding a point in the hyper-parameter space, and when the circulation times are increased continuously, the value obtained by the Bayesian optimization algorithm is converged continuously until the value is converged to a preset threshold value, and the circulation is stopped.
In this embodiment, the specific implementation steps of the distributed automatic feature engineering system architecture are as follows:
the method comprises the following steps: first, there needs to be a cluster of computers, such as Aliskiren;
secondly, the step of: installing an Apache Hadoop cluster edition, an Apache HBase cluster edition, an Apache Hhoenix and an Apache Thrift on each computer on the cluster, so that the whole cluster has the function of a distributed database;
③: the method comprises the steps that MySQL is installed on one computer, an InoDB database engine is used for creating a project database on the MySQL, the project database manages the whole process, all computers on a cluster can concurrently access the project database to obtain state data of the process, and the consistency of the whole process is guaranteed;
fourthly, the method comprises the following steps: installing a required python program and a required python package on each computer of the cluster, wherein the python code deployed by each computer is the same;
fifthly: the python programs on each computer will start at the same time, and the parallel computation and the concurrent control in the cluster are managed by the project database in MySQL;
sixthly, the method comprises the following steps: the order query result is returned in json format, and a wind control report is presented by the foraging big data system.
Further, the AUC in the second step is an accuracy measure commonly used in the field of wind control.
Through the structure, after the distributed automatic characteristic engineering system architecture is applied, on one hand, a large amount of labor cost can be saved for an automobile financing leasing company, work which should be completed by a characteristic engineer and a wind control expert experienced in the field is automatically completed by the process of the invention, the automobile financing leasing company only needs to provide original unprocessed service data, the invention automatically completes the whole process of characteristic engineering and model training and outputs a final wind control report, and the automobile financing leasing company can select a mode of charging according to orders, so that a mode of charging according to needs is provided for small and medium-sized companies with small orders, which is more flexible than a mode of hiring experts, on the other hand, compared with manual operation, the working efficiency of automatic characteristic engineering and model training is greatly improved, and for the data volume of 10 ten thousand orders, the cluster is used for completing machine time which needs about 500 hours (namely, if 20 computers are executed in parallel, about 25 hours is needed), a feature engineer and a wind control expert need to complete work for several weeks, the work can be completed only one day after being distributed to all the computers of the cluster, and after the analysis of the full data is completed, the follow-up order wind control report can return results only in less than 1 second for inquiring each order.
Preferably, the present embodiment also has a configuration in which the python package includes numpy, pandas, scipy, scinit-spare, Tensorflow, Keras, LightGBM, xgboost, catboost, and the like.
In addition, in the aspect of accuracy, because the distributed automatic feature engineering system framework uses a Bayesian optimization algorithm to search all possible feature combinations, new and useful features which are not found manually can be found, the automatic feature extraction by a computer can achieve the same effect as the manual feature extraction by a wind control expert, the computer calculation is based on data and objective, and the model does not depend on any artificial subjective generation rule, so that the inaccuracy of the model caused by errors of the experience or subjectivity of the feature engineer and the wind control expert can be avoided.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (4)

1. The distributed automatic feature engineering system architecture is characterized in that the distributed automatic feature engineering system architecture is realized by the following steps:
the method comprises the following steps: a distributed automatic feature computation cluster; the method is characterized in that a computer cluster is needed, the first part is a distributed automatic feature computing cluster, original data consists of a master table and n slave tables, the data is also in a data form common in the field of automobile financial wind control, n computers are distributed in the cluster, each computer is provided with hbase and python, each computer has two roles, one role is used for storing distributed data, the other role is distributed computing, each computer is distributed with one slave table of the n slave tables, and the master table is used for storing the distributed dataThe table is copied to each computer, computer n merges master table and slave table n and generates a new feature controlled by a set of parameters, the number of parameters of different slave tables is different, and from table 1 there is x1A parameter, x from Table 22A parameter, and so on, then all n slaves have x in common1+x2+…+xnThe parameters, the master table and all the slave tables can be longitudinally cut so as to be more uniformly distributed to all computers of the cluster, all the features generated by the first part are combined into a large table which is used as input data of the second part, the feature extraction of the first part generates a large number of features, all the features are input into the model, a large number of model parameters are needed, and therefore all the features are subjected to dimension reduction processing before the data enter the model;
step two: a dimension reduction algorithm; the dimension reduction algorithm comprises the following specific steps:
the method comprises the following steps: sorting all the features by using a decision tree algorithm to obtain a feature importance list, selecting x% of the features in the list as a basic feature list, and marking the basic feature list as L;
secondly, the step of: taking one characteristic f from the rest characteristics and adding the characteristic f into L;
③: performing ridge regression calculation on the L;
fourthly, the method comprises the following steps: calculating AUC;
fifthly: retaining feature f if AUC increases, removing feature f if AUC does not increase;
sixthly, the method comprises the following steps: circulating from step two to step five until all the features are processed, and obtaining a feature set after dimension reduction after the dimension reduction algorithm is completed;
step three: training a model; the feature set after dimensionality reduction is used as input data and put into a model for training, the model training uses an integrated algorithm, the algorithm comprises a deep neural network, a gradient lifting machine and a random forest algorithm, and the model training mainly comprises the following steps:
the method comprises the following steps: determining an algorithm set comprising a deep neural network, a LightGBM, an xgboost, a catboost and a random forest;
secondly, the step of: inputting a feature set after dimensionality reduction for each algorithm in the algorithm set, performing model evaluation by using k-fold cross validation, and finally outputting AUC (AUC), wherein the output of each algorithm is only one column, namely the AUC value;
③: combining the output columns of all algorithms, namely AUC, into a table A, taking the output column of the algorithm as a characteristic and taking the output column as the input of the next-stage algorithm;
fourthly, the method comprises the following steps: determining an integration algorithm, using Logistic regression as the integration algorithm, inputting the Table A into the Logistic regression algorithm, and calculating to obtain a final integrated AUC value, which can verify that the accuracy of the AUC value after integration is higher than that of the AUC value calculated by any single algorithm, the algorithms will be respectively deployed in computers in a cluster, the feature set output by the second part dimension reduction algorithm will be copied to all computers, that is, the input data of the computers are the same, but the running algorithms are different, AUC columns obtained after all algorithm calculations are completed will be collected into a host computer for integrated algorithm calculation, the AUC calculated by the integrated algorithm is a column, averaging the column values to obtain the AUC mean, averaging the AUC mean, the hyperparameters contained in the third part of model training and the x in the first part of feature engineering1+x2+…+xnTaking the parameters as input data of a fourth part;
step four: searching a hyper-parameter; the searching of the hyperparameters is regarded as another model training, the input of the model is all hyperparameters in the first part and the third part and also has an AUC mean value, all the hyperparameters form a hyperparameter space, the goal of the model training is to find a point in the hyperparameter space so that the AUC obtains the maximum value, Bayesian optimization is selected as an algorithm for searching the hyperparameters, the Bayesian optimization algorithm is a loop iteration process, new hyperparameters are generated after calculation is completed each time and used as feedback, the new hyperparameters are input into the first part and a new loop is started, each loop is added with a point in the hyperparameter space, values obtained by the Bayesian optimization algorithm are converged after the number of the loops is increased continuously until the values converge to a preset threshold value, and the loop is stopped.
2. The distributed automated feature engineering system architecture of claim 1, wherein: the specific implementation steps of the distributed automatic feature engineering system architecture are as follows:
the method comprises the following steps: first, a cluster of computers is required;
secondly, the step of: the method comprises the following steps that an Apachedoop cluster edition, an ApachheBase cluster edition, Apachenix and ApacherThrift are installed on each computer on a cluster, and therefore the whole cluster has the function of a distributed database;
③: the method comprises the steps that MySQL is installed on one computer, an InoDB database engine is used for creating a project database on the MySQL, the project database manages the whole process, all computers on a cluster can concurrently access the project database to obtain state data of the process, and the consistency of the whole process is guaranteed;
fourthly, the method comprises the following steps: installing a required python program and a required python package on each computer of the cluster, wherein the python code deployed by each computer is the same;
fifthly: the python programs on each computer will start at the same time, and the parallel computation and the concurrent control in the cluster are managed by the project database in MySQL;
sixthly, the method comprises the following steps: the order query result is returned in json format, and a wind control report is presented by the foraging big data system.
3. The distributed automated feature engineering system architecture of claim 1, wherein: and the AUC in the second step is the accuracy measurement standard commonly used in the field of wind control.
4. The distributed automated feature engineering system architecture of claim 2, wherein: the python includes numpy, pandas, scipy, scimit-spare, TensorFlow, Keras, LightGBM, xgboost, catboost.
CN201811493937.1A 2018-12-07 2018-12-07 Distributed automatic feature engineering system architecture Active CN109582724B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811493937.1A CN109582724B (en) 2018-12-07 2018-12-07 Distributed automatic feature engineering system architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811493937.1A CN109582724B (en) 2018-12-07 2018-12-07 Distributed automatic feature engineering system architecture

Publications (2)

Publication Number Publication Date
CN109582724A CN109582724A (en) 2019-04-05
CN109582724B true CN109582724B (en) 2022-04-08

Family

ID=65929000

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811493937.1A Active CN109582724B (en) 2018-12-07 2018-12-07 Distributed automatic feature engineering system architecture

Country Status (1)

Country Link
CN (1) CN109582724B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111861047A (en) * 2019-04-08 2020-10-30 阿里巴巴集团控股有限公司 Data processing control method and computing device
CN112380205B (en) * 2020-11-17 2024-04-02 北京融七牛信息技术有限公司 Automatic feature generation method and system of distributed architecture
CN112801304A (en) * 2021-03-17 2021-05-14 中奥智能工业研究院(南京)有限公司 Automatic data analysis and modeling process

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339608A (en) * 2016-11-09 2017-01-18 中国科学院软件研究所 Traffic accident rate predicting system based on online variational Bayesian support vector regression
CN107103332A (en) * 2017-04-07 2017-08-29 武汉理工大学 A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset
CN107516135A (en) * 2017-07-14 2017-12-26 浙江大学 A kind of automation monitoring learning method for supporting multi-source data
CN108154430A (en) * 2017-12-28 2018-06-12 上海氪信信息技术有限公司 A kind of credit scoring construction method based on machine learning and big data technology
CN108304941A (en) * 2017-12-18 2018-07-20 中国软件与技术服务股份有限公司 A kind of failure prediction method based on machine learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140310221A1 (en) * 2013-04-12 2014-10-16 Nec Laboratories America, Inc. Interpretable sparse high-order boltzmann machines
US10380609B2 (en) * 2015-02-10 2019-08-13 EverString Innovation Technology Web crawling for use in providing leads generation and engagement recommendations
CN108566364B (en) * 2018-01-15 2021-01-12 中国人民解放军国防科技大学 Intrusion detection method based on neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339608A (en) * 2016-11-09 2017-01-18 中国科学院软件研究所 Traffic accident rate predicting system based on online variational Bayesian support vector regression
CN107103332A (en) * 2017-04-07 2017-08-29 武汉理工大学 A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset
CN107516135A (en) * 2017-07-14 2017-12-26 浙江大学 A kind of automation monitoring learning method for supporting multi-source data
CN108304941A (en) * 2017-12-18 2018-07-20 中国软件与技术服务股份有限公司 A kind of failure prediction method based on machine learning
CN108154430A (en) * 2017-12-28 2018-06-12 上海氪信信息技术有限公司 A kind of credit scoring construction method based on machine learning and big data technology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
欺诈网页挖掘中特征优选及检测性能研究;王嘉卿;《中国优秀硕士学位论文全文数据库 信息科技辑》;20181015(第10期);I139-147 *
自动化特征工程与参数调整算法研究;张浩;《中国优秀硕士学位论文全文数据库 信息科技辑》;20180915(第9期);I138-188 *

Also Published As

Publication number Publication date
CN109582724A (en) 2019-04-05

Similar Documents

Publication Publication Date Title
CN104881706B (en) A kind of power-system short-term load forecasting method based on big data technology
US10606862B2 (en) Method and apparatus for data processing in data modeling
Vinodh et al. Lean Six Sigma project selection using hybrid approach based on fuzzy DEMATEL–ANP–TOPSIS
CN109582724B (en) Distributed automatic feature engineering system architecture
US20170330078A1 (en) Method and system for automated model building
Pfahringer Semi-random model tree ensembles: An effective and scalable regression method
US11366806B2 (en) Automated feature generation for machine learning application
Akopov et al. A multi-agent genetic algorithm for multi-objective optimization
CN114722729B (en) Automatic cutter recommendation method and device, terminal and storage medium
US20170076307A1 (en) Price estimation device, price estimation method, and recording medium
CN111476274B (en) Big data predictive analysis method, system, device and storage medium
CN113032367A (en) Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system
US11775876B2 (en) Methods and systems for relating features with labels in electronics
CN116245019A (en) Load prediction method, system, device and storage medium based on Bagging sampling and improved random forest algorithm
Garciarena et al. Evolving imputation strategies for missing data in classification problems with TPOT
Telipenko et al. Results of research on development of an intellectual information system of bankruptcy risk assessment of the enterprise
CN115879824A (en) Method, device, equipment and medium for assisting expert decision based on ensemble learning
Campi et al. Parametric cost modelling of components for turbomachines: Preliminary study
CN111160048B (en) Translation engine optimization system and method based on cluster evolution
Zhigirev et al. Introducing a Novel Method for Smart Expansive Systems' Operation Risk Synthesis. Mathematics 2022, 10, 427
Yemets et al. Combinatorial optimization under uncertainty
Montevechi et al. Ensemble-Based Infill Search Simulation Optimization Framework
Mandel Expert-statistical data processing using the method of analogs
Khalyasmaa et al. Performance analysis of scientific and technical solutions evaluation based on machine learning
Krishnamoorthy et al. Optimizing stochastic temporal manufacturing processes with inventories: An efficient heuristic algorithm based on deterministic approximations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant