CN113191877A - Data feature acquisition method and system and electronic equipment - Google Patents

Data feature acquisition method and system and electronic equipment Download PDF

Info

Publication number
CN113191877A
CN113191877A CN202110487432.XA CN202110487432A CN113191877A CN 113191877 A CN113191877 A CN 113191877A CN 202110487432 A CN202110487432 A CN 202110487432A CN 113191877 A CN113191877 A CN 113191877A
Authority
CN
China
Prior art keywords
screening
feature
financial data
characteristic
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110487432.XA
Other languages
Chinese (zh)
Inventor
蔡鹏�
常宏达
陈树华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Top Elephant Technology Co ltd
Original Assignee
Top Elephant Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Top Elephant Technology Co ltd filed Critical Top Elephant Technology Co ltd
Priority to CN202110487432.XA priority Critical patent/CN113191877A/en
Publication of CN113191877A publication Critical patent/CN113191877A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Technology Law (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Development Economics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a data characteristic acquisition method, a system and electronic equipment, and relates to the field of financial data processing, wherein the method comprises the steps of firstly acquiring financial data, and configuring a financial characteristic field and a marking field according to the financial data; inputting the financial data into the trained screening model for feature screening; wherein, screening the model includes: a training set screening unit and a verification set screening unit; the training set screening unit screens the characteristics of the financial data according to the financial characteristic field to generate a training set; the verification set screening unit screens the characteristics of the financial data according to the marking field to generate a verification set; and then determining the characteristic contribution degree of the financial data according to the training set and/or the verification set output by the screening model, and acquiring the characteristics of the financial data of which the characteristic contribution degree reaches a preset threshold value. The method can synchronously acquire the importance indexes of the features, improves the model performance while reducing the number of the features, and is beneficial to improving the extraction precision of the related features of the financial data.

Description

Data feature acquisition method and system and electronic equipment
Technical Field
The invention relates to the field of financial data processing, in particular to a data feature acquisition method, a data feature acquisition system and electronic equipment.
Background
In the credit assessment before the credit of the user, the financial institution realizes the accurate evaluation of the credit of the user through a large number of data sources such as personal credit investigation, historical behavior data of the user in the financial institution, and a third-party credit report of the user. But there are thousands of fields in the credit investigation process only through the people's bank, and the whole feature table is generally 3000-5000 dimension, even higher. It can be seen that in the existing credit evaluation, feature screening needs to be performed on the data source, and data which is as few as possible and can characterize the credit of the user is selected. In the specific implementation process, the artificial intelligence model is usually used for carrying out feature screening on the importance in the personal credit data, but the existing artificial intelligence model has low model performance and complex model structure.
In summary, the models used in the screening of personal credit data in the prior art have the problems of low performance and complex structure.
Disclosure of Invention
In view of this, the present invention provides a data feature obtaining method, a system and an electronic device, in which a training set screening unit and a verification set screening unit of a screening model are used in a feature obtaining process, so that importance indexes of features can be synchronously obtained, the model performance is improved while the number of features is reduced, and the method and the system are beneficial to improving the extraction accuracy of related features of financial data.
In a first aspect, an embodiment of the present invention provides a data feature obtaining method, where the data feature obtaining method is applied to feature extraction of financial data, and includes:
acquiring financial data, and configuring a financial characteristic field and a marking field according to the financial data;
inputting financial data into the trained screening model for feature screening; wherein, screening the model includes: a training set screening unit and a verification set screening unit; the training set screening unit screens the characteristics of the financial data according to the financial characteristic field to generate a training set; the verification set screening unit screens the characteristics of the financial data according to the marking field to generate a verification set;
and determining the characteristic contribution degree of the financial data according to the training set and/or the verification set output by the screening model, and acquiring the characteristics of the financial data of which the characteristic contribution degree reaches a preset threshold value.
In some embodiments, the obtaining of the training set includes:
setting a feature screening proportion and the number of features;
inputting the financial data into a training set screening unit, and randomly screening financial characteristic fields of the financial data by the training set screening unit according to a characteristic screening proportion;
the training set screening unit generates a training set after carrying out iterative screening on financial characteristic fields of financial data for multiple times; the number of times of iterative screening is the same as the number of features.
In some embodiments, the obtaining of the verification set includes:
setting a feature screening proportion and the number of features;
inputting the financial data into a verification set screening unit, and randomly screening the mark fields of the financial data by the verification set screening unit according to the characteristic screening proportion;
the verification set screening unit generates a verification set after performing multiple iterative screening on the marking fields of the financial data; the number of times of iterative screening is the same as the number of features.
In some embodiments, the determining the feature contribution of the financial data according to the training set and the verification set output by the screening model includes:
acquiring characteristics of the financial data in a training set and a verification set;
calculating the contribution value and the weight value of the features in the screening model; wherein the contribution value is used for representing the contribution degree of the characteristic in the financial data; the weight value is used for representing the importance of the feature in the financial data;
and determining the characteristic contribution degree of the financial data according to the contribution value and the weight value.
In some embodiments, the above contribution value is calculated by the following formula:
G=avg(S11~S1t)-avg(S1t+1~S1k),
wherein G is the contribution value of the feature; avg is a mean function; t is the number of the feature selected by the training set or the verification set; k is the number of training sets or verification sets used when the screening model is trained; and S1 is the result of the model performance index corresponding to the characteristic.
In some embodiments, the weighted value is calculated by the following formula:
I=avg(i1~it),
wherein I is a weight value of the characteristic; avg is a mean function; t is the number of the feature selected by the training set or the verification set; i is the feature importance result of the feature.
In some embodiments, the training process of the screening model includes:
setting a feature screening proportion and a feature number threshold;
randomly screening financial data according to the characteristic screening proportion to generate a training set and a verification set which are the same as the threshold value of the number of the characteristics;
training the initialized convolutional neural network model by utilizing a training set and a verification set, and evaluating the performance of the convolutional neural network model by utilizing the verification set;
and stopping training when the performance of the convolutional neural network model meets a preset performance threshold value to obtain a screening model.
In a second aspect, an embodiment of the present invention provides a data feature acquisition system, where the data feature acquisition system is applied to feature extraction of financial data, and the system includes:
the data configuration module is used for acquiring financial data and configuring financial characteristic fields and marking fields according to the financial data;
the characteristic screening module is used for inputting the financial data into the trained screening model for characteristic screening; wherein, screening the model includes: a training set screening unit and a verification set screening unit; the training set screening unit screens the characteristics of the financial data according to the financial characteristic field to generate a training set; the verification set screening unit screens the characteristics of the financial data according to the marking field to generate a verification set;
and the characteristic acquisition module is used for determining the characteristic contribution degree of the financial data according to the training set and/or the verification set output by the screening model and acquiring the characteristics of the financial data of which the characteristic contribution degree reaches a preset threshold value.
In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory and a processor, where the memory stores a computer program that is executable on the processor, and when the processor executes the computer program, the steps of the data feature obtaining method mentioned in the first aspect are implemented.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable medium having a non-volatile program code executable by a processor, where the program code causes the processor to execute the steps of the data feature acquisition method mentioned in the first aspect.
The embodiment of the invention has the following beneficial effects:
the invention provides a data feature acquisition method, a system and electronic equipment, which are applied to feature extraction of financial data, and the method comprises the steps of firstly acquiring financial data, and configuring financial feature fields and marking fields according to the financial data; inputting the financial data into the trained screening model for feature screening; wherein, screening the model includes: a training set screening unit and a verification set screening unit; the training set screening unit screens the characteristics of the financial data according to the financial characteristic field to generate a training set; the verification set screening unit screens the characteristics of the financial data according to the marking field to generate a verification set; and then determining the characteristic contribution degree of the financial data according to the training set and/or the verification set output by the screening model, and acquiring the characteristics of the financial data of which the characteristic contribution degree reaches a preset threshold value. According to the method, the training set screening unit and the verification set screening unit of the screening model are utilized in the feature obtaining process, the importance indexes of the features can be synchronously obtained, the number of the features is reduced, the performance of the model is improved, and the method is beneficial to improving the extraction precision of the related features of the financial data.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention as set forth above.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a data feature obtaining method according to an embodiment of the present invention;
fig. 2 is a flowchart of a training set obtaining process in the data feature obtaining method according to the embodiment of the present invention;
fig. 3 is a flowchart of an acquisition process of a verification set in the data feature acquisition method according to the embodiment of the present invention;
fig. 4 is a flowchart of determining a feature contribution degree of financial data according to a training set and a verification set output by a screening model in the data feature obtaining method according to the embodiment of the present invention;
fig. 5 is a flowchart of a training process of a screening model in the data feature obtaining method according to the embodiment of the present invention;
fig. 6 is a flowchart of another training process of a screening model in the data feature obtaining method according to the embodiment of the present invention;
fig. 7 is a schematic structural diagram of a data feature obtaining system according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Icon:
710-a data configuration module; 720-feature screening module; 730-a feature acquisition module; 101-a processor; 102-a memory; 103-a bus; 104-communication interface.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the credit assessment before the credit of the user, the financial institution realizes the accurate evaluation of the credit of the user through a large number of data sources such as personal credit investigation, historical behavior data of the user in the financial institution, and a third-party credit report of the user. But there are thousands of fields in the credit investigation process only through the people's bank, and the whole feature table is generally 3000-5000 dimension, even higher. It can be seen that in the existing credit evaluation, feature screening needs to be performed on the data source, and data which is as few as possible and can characterize the credit of the user is selected. In particular implementations, artificial intelligence models are often used to perform feature screening on importance in personal credit data, and the goal of feature selection is to reduce the complexity of induction systems by eliminating irrelevant and redundant features. In the field of machine learning, this technique is becoming increasingly important in order to reduce computational cost and memory and improve prediction accuracy.
The problem of feature selection is described in mathematical language, that is, a problem of searching for an optimal feature subset on a given loss function and a given test set is solved. The problem is an NP-hard problem, and a solution method for an accurate solution does not exist at present. The Embedded method is a one-step approximation solution to this problem; the Wrapper method is a closer step, and a heuristic search algorithm is adopted to solve the problem. The simplest heuristic algorithm is a hill climbing algorithm, and the corresponding algorithm is a recursive feature elimination method (RFE), which is equivalent to circularly using an Embedded method, but only n features are deleted in each round, and the model is retrained after remaining features are reserved. Many methods can be derived if other heuristic search algorithms (simulated annealing, genetic algorithms, ant colony algorithms, particle swarm algorithms, etc.) are used.
However, these methods all require repeated model training, and in order to ensure the effect of feature screening, the number of iteration rounds all require at least tens of rounds, the number of models to be trained in each round is related to the number of features of the data set, and if the number of models to be trained in one round is thousands of features, the number of models to be trained in one round may need thousands of models. In a traffic scenario where the amount of data is large, the computational cost is unacceptable.
Although the importance of the feature is given in the training of the machine learning model, the importance is the information gain of the feature during the training of the model, but is not equal to the contribution of the model precision of the feature, and the situation that the importance of one feature is very high can completely occur, but the model precision trained after the feature is deleted is improved on the contrary, so that the most accurate method for calculating the contribution of the feature A to the model is to compare the difference of the performance indexes between the model B0 of the feature A and other features and the model B1 of only other features. However, in the method, N +1 models are trained to calculate the importance of the N features. Due to the nonlinearity of the machine learning method, only one feature can be deleted in one round of calculation, m rounds of calculation are needed to delete m features, which is the basic idea of Recursive Feature Elimination (RFE), and even if a heuristic search algorithm is used, the calculation amount is still unacceptable in many occasions.
In a credit evaluation scenario before personal credit, a financial institution may obtain a large number of data sources, including personal credit, historical behavior data of a customer at the financial institution, a third-party credit report of the customer, and the like, wherein the personal credit only has thousands of fields, and the whole feature width table is generally 3000-. And the final mold-in variables will generally require control to within 300. This requires a round of feature screening to compress the original 3000- & 5000 dimensional width table to within 300 dimensions.
The simplest method is to perform feature screening according to the importance of features given in model training, and the method is very simple, and only needs to train one model in the simplest case and then select the features according to feature importance ranking. However, when the 300-dimensional feature training model obtained by screening is used, the performance indexes (KS, AUC and the like) of the model are usually greatly reduced, and the whole model fails in case of serious conditions.
The recursive feature elimination method (RFE) has high precision, and basically ensures that the performance index of a model trained by 300-dimensional features after feature screening is slightly reduced compared with that of a model trained by original thousands of-dimensional features, and the complexity of the model is greatly reduced. However, if only 1 feature is deleted in each iteration, thousands of iterations are needed, more than one million models are trained in total, and the engineering is not acceptable. Some traditional optimization methods can only reduce the calculated amount by 1-2 orders of magnitude, and are still not ideal.
In summary, the models used in the screening of personal credit data in the prior art have the problems of low performance and complex structure.
Based on this, according to the data feature acquisition method, the data feature acquisition system and the electronic device provided by the embodiment of the invention, the training set screening unit and the verification set screening unit of the screening model are utilized in the feature acquisition process, so that the importance indexes of the features can be synchronously acquired, the model performance is improved while the number of the features is reduced, and the extraction precision of the related features of the financial data is favorably improved.
To facilitate understanding of the embodiment, a detailed description is first given of a data feature acquisition method disclosed in the embodiment of the present invention.
Referring to a flowchart of a data feature obtaining method shown in fig. 1, the data feature obtaining method is applied to feature extraction of financial data, and specifically includes:
and step S101, acquiring financial data, and configuring a financial characteristic field and a marking field according to the financial data.
In the step, the financial data is acquired according to an actual business scenario in the acquisition process, and taking a personal credit evaluation scenario as an example, the financial data used in the step may include: people credit, historical behavior data of the customer at the financial institution, third party credit reports of the customer, and the like. The financial data needs to configure corresponding characteristic fields and label fields, and the content and the number of the characteristic fields and the label fields are used as configuration parameters for use in subsequent steps.
Step S102, inputting financial data into the trained screening model for feature screening; wherein, screening the model includes: a training set screening unit and a verification set screening unit; the training set screening unit screens the characteristics of the financial data according to the financial characteristic field to generate a training set; and the verification set screening unit screens the characteristics of the financial data according to the marking field to generate a verification set.
After the screening model in the step is trained, financial data is input into the screening model, and a training set and a verification set are generated through a training set screening unit and a verification set screening unit in the screening model. After the training set and the verification set are obtained, feature screening can be carried out according to actual requirements, and feature verification is carried out by utilizing the verification set in the screening process; the training set is used for generating the features and training the related submodels.
Step S103, determining the characteristic contribution degree of the financial data according to the training set and/or the verification set output by the screening model, and acquiring the characteristics of the financial data of which the characteristic contribution degree reaches a preset threshold value.
After a training set and a validation set of financial data are obtained, feature contribution degrees of features in the training set and the validation set are calculated. The feature contribution degree represents the importance degree of the feature and is related to the corresponding model performance index of the feature in the model. The larger the model performance index value corresponding to the feature is, the higher the importance degree of the feature is; conversely, the lower the model performance index value corresponding to the feature is, the lower the importance degree of the feature is.
The characteristic acquisition of the financial data is judged by the characteristic contribution degree of the characteristic, and when the characteristic contribution degree is higher than a preset threshold value, the characteristic of the financial data is acquired; and when the characteristic contribution degree is not higher than a preset threshold value, ignoring the characteristic of the financial data. Thus, feature extraction of financial data is achieved.
According to the data feature acquisition method in the embodiment, the training set screening unit and the verification set screening unit of the screening model are utilized in the feature acquisition process, the importance indexes of the features can be synchronously acquired, the number of the features is reduced, the model performance is improved, and the extraction accuracy of the related features of the financial data is improved.
In the specific implementation process, corresponding feature screening proportion and feature number need to be set for constraining the training set and the verification set. Specifically, the process of acquiring the training set, as shown in fig. 2, includes:
in step S201, a feature screening ratio and the number of features are set.
The characteristic screening ratio is a parameter in percentage, and is usually set between 50% and 75%; the larger the value is, the higher the number of the features to be screened is; smaller values indicate a lower number of features to be screened. The number of features is usually set to 20 to 100, and the number of features is also related to the training process of the screening model.
Step S202, inputting the financial data into a training set screening unit, and randomly screening the financial characteristic fields of the financial data by the training set screening unit according to the characteristic screening proportion.
The training set screening unit adopts a random selection mode according to the screening proportion set in the step S201.
Step S203, a training set screening unit carries out a plurality of times of iterative screening on financial characteristic fields of financial data to generate a training set; the number of times of iterative screening is the same as the number of features.
And obtaining a training set corresponding to the characteristics of the financial data after multiple iterative screening of the training set screening unit.
Specifically, the process of acquiring the verification set, as shown in fig. 3, includes:
step S301, setting a feature screening proportion and the number of features;
step S302, inputting the financial data into a verification set screening unit, and randomly screening the marking fields of the financial data by the verification set screening unit according to the characteristic screening proportion;
step S303, after the verification set screening unit carries out a plurality of times of iterative screening on the marking field of the financial data, a verification set is generated; the number of times of iterative screening is the same as the number of features.
The acquisition process of the verification set is similar to that of the training set, and is not described herein again.
In some embodiments, the determining the feature contribution of the financial data according to the training set and the validation set output by the screening model, as shown in fig. 4, includes:
step S401, obtaining the characteristics of the financial data in the training set and the verification set.
It should be noted that, the features of the financial data in the training set and the verification set are respectively calculated when performing the feature contribution degree, that is, the features related to all the financial data in the training set and the verification set are calculated through the post-creep step to obtain the contribution value and the weight value of the features in the screening model. See step S402 for details.
Step S402, calculating the contribution value and the weight value of the feature in the screening model; wherein the contribution value is used for representing the contribution degree of the characteristic in the financial data; the weight values are used to represent the importance of the features in the financial data.
In some embodiments, the contribution is calculated by the equation:
G=avg(S11~S1t)-avg(S1t+1~S1k),
wherein G is the contribution value of the feature; avg is a mean function; t is the number of the feature selected by the training set or the verification set; k is the number of training sets or verification sets used when the screening model is trained; and S1 is the result of the model performance index corresponding to the characteristic.
The weighted value is calculated by the formula:
I=avg(i1~it),
wherein I is a weight value of the characteristic; avg is a mean function; t is the number of the feature selected by the training set or the verification set; i is the feature importance result of the feature.
Step S403, determining a characteristic contribution degree of the financial data according to the contribution value and the weight value.
The following describes the training process of the screening model, which is shown in fig. 5 and includes:
step S501, a feature screening ratio and a feature number threshold are set.
In the process of training the screening model, input interfaces of a training set T1 and a verification set T2 need to be configured, and if only the training set exists, the training set needs to be divided into two parts, namely the training set and the verification set in advance. In this section, the feature fields and the label fields in the training set also need to be specified. The system automatically counts the number m of the characteristic fields. The feature fields are f 1-fm, respectively. And configuring the number n of final mode entering variables. Features that must be preserved may also be configured at this step. And (4) selecting a characteristic screening proportion p (p is 0.5-0.75, and is recommended to be 0.75). And the number k of training models in each round of calculation. The recommended value is 20-100, the more the feature quantity, the larger k is, and the larger the model training calculation amount is, the k is properly reduced.
And step S502, randomly screening the financial data according to the characteristic screening proportion to generate a training set and a verification set which are the same as the threshold value of the number of the characteristics.
According to the feature screening proportion p, randomly screening original m features of a training set T1 and a verification set T2, repeating the steps k times, and finally obtaining k training sets and/or verification sets which are named as T11-T1 k and T21-T2 k respectively. Each training/validation set combination contains about m × p features.
And S503, training the initialized convolutional neural network model by using the training set and the verification set, and performing performance evaluation on the convolutional neural network model by using the verification set.
Respectively training the models for k training sets and/or verification set combinations, and evaluating the model performance by using the verification sets, so as to obtain k models and corresponding model performance indexes S1-Sk, wherein the performance indexes of the models generally take the value of AUC (area under the curve).
For a feature fa, it is selected by t training/validation sets (t equals k × p). Let fa be selected from T11 to T1T, and T1(T +1) to T1k do not include characteristic fa. Then the characteristic integrated contribution G ═ avg of fa can be obtained (S1)1~S1t)-avg(S1t+1~S1k). Simultaneous fa at t modelsThe specific importance given in (1) is I1 to it, respectively, so that the overall importance of the characteristics of fa, I ═ avg (I1~it)。
And step S504, stopping training when the performance of the convolutional neural network model meets a preset performance threshold value, and obtaining a screening model.
And (3) performing normal distribution fitting on G values of all the characteristics to obtain a distribution average value and a distribution standard deviation, and removing the characteristics of which the G values are smaller than (average value-2 standard deviation). And meanwhile, performing normal distribution fitting on the I values of all the characteristics to obtain the average value and the standard deviation of the distribution, and removing the characteristics of which the I values are smaller than (average value-2 standard deviation). The remaining features are then formed into a new training set T1 'and validation set T2'. And repeating the step S502 to the step S503 until the feature quantity is smaller than the feature quantity threshold value, and stopping training to obtain the screening model.
Another training process flow diagram of the screening model is shown in fig. 6, and includes:
step S601, configuring parameters.
The parameter configuration in this step includes configuring input interfaces of a training set T1 and a validation set T2, and if only there is a training set, the training set needs to be divided into two parts, namely the training set and the validation set in advance. In this section, the feature fields and the label fields in the training set also need to be specified. The number of feature fields m. The feature fields are f 1-fm, respectively. And configuring the number n of final mode entering variables. Features that must be preserved may also be configured at this step.
Meanwhile, a feature screening proportion p (p is between 0.5 and 0.75, and is recommended to be 0.75) and the number k of training models in each calculation (the recommended value is 20 to 100, the more the feature number is, the larger k is, and the larger k is, the more k is, the proper reduction of the model training calculation amount is) are set.
Step S602, re-sampling the features, and constructing k training sets and/or verification sets.
According to the feature screening proportion p, randomly screening original m features of a training set T1 and a verification set T2, repeating the steps k times, and finally obtaining k training/verification sets which are named as T11-T1 k and T21-T2 k respectively. Each training/validation set combination contains about m × p features.
Step S603, training k models.
Respectively training the models for the k training/verification set combination, and using the verification set to evaluate the model performance, thus obtaining k models and corresponding model performance indexes S1-Sk, wherein the performance indexes of the models are generally AUC (area under the curve, model evaluation index).
Step S604, calculating G value and I value of each feature according to k model structures.
For a certain feature fa, it is selected by T training/verification sets (T is equal to k × p), and it is not assumed that fa is selected by T11 to T1T, and T1(T +1) to T1k do not include the feature fa. Then the characteristic integrated contribution G ═ avg of fa can be obtained (S1)1~S1t)-avg(S1t+1~S1k). Meanwhile, the special importance given by fa in t models is I1-it, so that the comprehensive importance of the characteristics of fa, I ═ avg (I)1~it)。
And step S605, performing feature screening according to the G value.
And (3) performing normal distribution fitting on G values of all the characteristics to obtain a distribution average value and a distribution standard deviation, and removing the characteristics of which the G values are smaller than (average value-2 standard deviation).
And step S606, performing feature screening according to the I value.
And (3) performing normal distribution fitting on the I values of all the characteristics to obtain the average value and the standard deviation of the distribution, and removing the characteristics of which the I values are smaller than (average value-2 standard deviation).
Step S607, forming a new training set and verification set from the remaining features.
The remaining features form a new training set T1 'and validation set T2'.
Step S608, determine whether the feature quantity meets the requirement.
If yes, obtaining a screening model; if not, step S602 is performed. Step S602 is repeated until the number of features is less than n.
In the prior art, m features need to be trained for m +1 models, and when m is large, the calculation overhead cost is high. As can be seen from the training process of the screening models mentioned in this embodiment, each model is trained after removing about 1/4 features, and thus the obtained result can only represent the influence of the removed feature combinations on the model accuracy. In order to calculate the influence of each feature on the model precision, k models are trained by using a resampling method, and the influence on the model precision is calculated according to the comprehensive performance of each feature in the k models. Thus, each iteration only needs to train k +1 models, which is far smaller than m + 1. And each iteration can remove more than one feature, and the total number of iterations is controlled within an acceptable range.
The training process and the data feature acquisition process of the screening model are described below by using the personal online loan pre-loan credit evaluation case of a certain consumption finance company. The data sources used for modeling comprise human investigation credit, company owned data and part of third party data products, the total dimensionality of the feature library reaches 5500 dimensionality, and multiple collinearity conditions exist in a large number of features. In order to control the resource occupation and the need for routine maintenance management when the model is actually deployed, the features used do not exceed 500 dimensions. Therefore, the original 5500 dimensional characteristics need to be screened. Since the multiple collinearity condition is serious, methods other than heuristic search are difficult to work.
When the screening model mentioned in the above embodiment is used for screening, the lightgbm models are trained in the first round of screening 100, then the G values and the I values of all the features are calculated according to the model prediction results, finally the features are removed in 21 dimensions according to the G value rule, and the features are removed in 1600 dimensions according to the I value rule.
After the first round of screening, 3900 dimensions of the features remain, model building is carried out by using the 3900-dimensional features, and the performance index KS of the model is not reduced completely compared with that of the model using 5500 dimensions. And then, carrying out a second round of screening, and so on, wherein the features removed according to the G value in each round are gradually increased along with the increase of the iteration times, and the features removed according to the I value are obviously reduced. This reduces the feature to 370 dimensions over 14 iterations. The model performance index KS dropped from the initial 0.39 to around 0.38, within an acceptable range.
According to the data characteristic obtaining method and the training process of the screening model in the embodiment, the method uses a random sampling method to repeatedly sample the sample, and can approximately evaluate the influence of the characteristics on the model precision by using a small number of model training times; 2. the influence of feature importance is also considered in the feature screening process, and invalid features can be removed, so that the calculation amount is further reduced. In the feature obtaining process, the training set screening unit and the verification set screening unit of the screening model are utilized, the importance indexes of the features can be synchronously obtained, the number of the features is reduced, the performance of the model is improved, and the extraction precision of the related features of the financial data is improved.
In a second aspect, an embodiment of the present invention provides a data feature acquiring system, which is applied to feature extraction of financial data, and as shown in fig. 7, the system includes:
the data configuration module 710 is used for acquiring financial data and configuring financial characteristic fields and marking fields according to the financial data;
the feature screening module 720 is used for inputting the financial data into the trained screening model for feature screening; wherein, screening the model includes: a training set screening unit and a verification set screening unit; the training set screening unit screens the characteristics of the financial data according to the financial characteristic field to generate a training set; the verification set screening unit screens the characteristics of the financial data according to the marking field to generate a verification set;
the feature obtaining module 730 is configured to determine a feature contribution degree of the financial data according to the training set and/or the verification set output by the screening model, and obtain a feature of the financial data of which the feature contribution degree reaches a preset threshold.
The data feature acquisition system provided by the embodiment of the invention has the same technical features as the data feature acquisition method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved. For the sake of brevity, where not mentioned in the section of the embodiments, reference may be made to the corresponding matters in the embodiments of the data feature acquisition method described above.
The embodiment also provides an electronic device, a schematic structural diagram of which is shown in fig. 8, and the electronic device includes a processor 101 and a memory 102; the memory 102 is used for storing one or more computer instructions, which are executed by the processor to implement the data feature obtaining method.
The electronic device shown in fig. 8 further comprises a bus 103 and a communication interface 104, and the processor 101, the communication interface 104 and the memory 102 are connected through the bus 103.
The Memory 102 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Bus 103 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 8, but that does not indicate only one bus or one type of bus.
The communication interface 104 is configured to connect with at least one user terminal and other network units through a network interface, and send the packaged IPv4 message or IPv4 message to the user terminal through the network interface.
The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The Processor 101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present disclosure may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 102, and the processor 101 reads the information in the memory 102 and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the method of the foregoing embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention or a part thereof, which essentially contributes to the prior art, can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A data feature acquisition method is applied to feature extraction of financial data and comprises the following steps:
acquiring financial data, and configuring a financial characteristic field and a marking field according to the financial data;
inputting the financial data into a trained screening model for feature screening; wherein the screening model comprises: a training set screening unit and a verification set screening unit; the training set screening unit screens the characteristics of the financial data according to the financial characteristic field to generate a training set; the verification set screening unit screens the characteristics of the financial data according to the marking field to generate a verification set;
and determining the characteristic contribution degree of the financial data according to the training set and/or the verification set output by the screening model, and acquiring the characteristics of the financial data of which the characteristic contribution degree reaches a preset threshold value.
2. The method according to claim 1, wherein the training set acquisition process comprises:
setting a feature screening proportion and the number of features;
inputting the financial data into the training set screening unit, and randomly screening financial characteristic fields of the financial data by the training set screening unit according to the characteristic screening proportion;
the training set screening unit generates the training set after performing multiple iterative screening on the financial characteristic fields of the financial data; and the number of times of iterative screening is the same as the number of the characteristics.
3. The data feature obtaining method according to claim 1, wherein the obtaining process of the verification set includes:
setting a feature screening proportion and the number of features;
inputting the financial data into the verification set screening unit, and randomly screening the marking fields of the financial data by the verification set screening unit according to the characteristic screening proportion;
the verification set screening unit generates the verification set after performing multiple iterative screening on the mark fields of the financial data; and the number of times of iterative screening is the same as the number of the characteristics.
4. The method of claim 1, wherein determining the feature contribution of the financial data based on the training set and the validation set output by the screening model comprises:
obtaining characteristics of the financial data in the training set and the verification set;
calculating the contribution value and the weight value of the features in the screening model; wherein the contribution value is used to represent a degree of contribution of the feature in the financial data; the weight value is used to represent the importance of the feature in the financial data;
and determining the characteristic contribution degree of the financial data according to the contribution value and the weight value.
5. The data feature acquisition method according to claim 4, wherein the contribution value is calculated by the following equation:
G=avg(S11~S1t)-avg(S1t+1~S1k),
wherein G is a contribution value of the feature; avg is a mean function; t is the number of the features selected by the training set or the verification set; k is the number of the training set or the verification set used for training the screening model; and S1 is the model performance index result corresponding to the characteristic.
6. The data feature obtaining method according to claim 4, wherein the weighted value is calculated by the following formula:
I=avg(i1~it),
wherein I is a weight value of the feature; avg is a mean function; t is the number of the features selected by the training set or the verification set; i is the feature importance result of the feature.
7. The method according to claim 1, wherein the training process of the screening model includes:
setting a feature screening proportion and a feature number threshold;
randomly screening the financial data according to the characteristic screening proportion to generate a training set and a verification set which are the same as the threshold value of the number of the characteristics;
training the initialized convolutional neural network model by using the training set and the verification set, and performing performance evaluation on the convolutional neural network model by using the verification set;
and stopping training when the performance of the convolutional neural network model meets a preset performance threshold value to obtain the screening model.
8. A data feature acquisition system for feature extraction of financial data, the system comprising:
the data configuration module is used for acquiring financial data and configuring financial characteristic fields and marking fields according to the financial data;
the characteristic screening module is used for inputting the financial data into the trained screening model for characteristic screening; wherein the screening model comprises: a training set screening unit and a verification set screening unit; the training set screening unit screens the characteristics of the financial data according to the financial characteristic field to generate a training set; the verification set screening unit screens the characteristics of the financial data according to the marking field to generate a verification set;
and the characteristic acquisition module is used for determining the characteristic contribution degree of the financial data according to the training set and/or the verification set output by the screening model and acquiring the characteristics of the financial data of which the characteristic contribution degree reaches a preset threshold value.
9. An electronic device, comprising: a processor and a storage device; the storage means has stored thereon a computer program which, when executed by the processor, carries out the steps of the data feature acquisition method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the data feature acquisition method according to any one of the preceding claims 1 to 7.
CN202110487432.XA 2021-04-30 2021-04-30 Data feature acquisition method and system and electronic equipment Pending CN113191877A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110487432.XA CN113191877A (en) 2021-04-30 2021-04-30 Data feature acquisition method and system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110487432.XA CN113191877A (en) 2021-04-30 2021-04-30 Data feature acquisition method and system and electronic equipment

Publications (1)

Publication Number Publication Date
CN113191877A true CN113191877A (en) 2021-07-30

Family

ID=76983517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110487432.XA Pending CN113191877A (en) 2021-04-30 2021-04-30 Data feature acquisition method and system and electronic equipment

Country Status (1)

Country Link
CN (1) CN113191877A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117972379A (en) * 2023-12-11 2024-05-03 南通先进通信技术研究院有限公司 Microwave antenna beam regulation and control method and system based on machine learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117972379A (en) * 2023-12-11 2024-05-03 南通先进通信技术研究院有限公司 Microwave antenna beam regulation and control method and system based on machine learning

Similar Documents

Publication Publication Date Title
CN111177792B (en) Method and device for determining target business model based on privacy protection
CN107040397B (en) Service parameter acquisition method and device
CN108833458B (en) Application recommendation method, device, medium and equipment
AU2021203338A1 (en) Automated Model Development Process
CN108364106A (en) A kind of expense report Risk Forecast Method, device, terminal device and storage medium
CN106295351B (en) A kind of Risk Identification Method and device
CN108021982A (en) Data transmission method and system, electronic equipment
CN108681751B (en) Method for determining event influence factors and terminal equipment
CN112396211B (en) Data prediction method, device, equipment and computer storage medium
CN111428866A (en) Incremental learning method and device, storage medium and electronic equipment
US20190220924A1 (en) Method and device for determining key variable in model
CN115062734A (en) Wind control modeling method, device, equipment and medium capable of outputting explanatory information
CN111611390B (en) Data processing method and device
CN113537370A (en) Cloud computing-based financial data processing method and system
CN112182056A (en) Data detection method, device, equipment and storage medium
CN114428748B (en) Simulation test method and system for real service scene
CN112783747B (en) Execution time prediction method and device for application program
CN113191877A (en) Data feature acquisition method and system and electronic equipment
CN113240259A (en) Method and system for generating rule policy group and electronic equipment
CN110858368A (en) Data evaluation service value determination system and method
CN111353860A (en) Product information pushing method and system
CN110544166A (en) Sample generation method, device and storage medium
CN111815442B (en) Link prediction method and device and electronic equipment
CN110458707B (en) Behavior evaluation method and device based on classification model and terminal equipment
CN115221663A (en) Data processing method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210730

RJ01 Rejection of invention patent application after publication