CN112818028B - Data index screening method and device, computer equipment and storage medium - Google Patents

Data index screening method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112818028B
CN112818028B CN202110037835.4A CN202110037835A CN112818028B CN 112818028 B CN112818028 B CN 112818028B CN 202110037835 A CN202110037835 A CN 202110037835A CN 112818028 B CN112818028 B CN 112818028B
Authority
CN
China
Prior art keywords
data
index
user
key index
characteristic data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110037835.4A
Other languages
Chinese (zh)
Other versions
CN112818028A (en
Inventor
牛犇
张莉
陈弘
吴志成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110037835.4A priority Critical patent/CN112818028B/en
Publication of CN112818028A publication Critical patent/CN112818028A/en
Application granted granted Critical
Publication of CN112818028B publication Critical patent/CN112818028B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and provides a data index screening method, a device, computer equipment and a storage medium, wherein the data index screening method comprises the following steps: carrying out standardization processing on user data under corresponding data indexes according to data distribution; generating a standard data vector of each user according to standard data obtained by standardization, and extracting a plurality of index feature data in the standard data vector of each user; screening a plurality of first key index characteristic data according to the relevance indexes of the plurality of index characteristic data; extracting a plurality of second key index characteristic data and the index weight of each second key index characteristic data by adopting minimum absolute shrinkage and a selection operator; and according to the index weight, performing simulation training on the user grade prediction model for multiple times by using a Monte Carlo simulation method, and screening out a plurality of target key index characteristic data from the second key index characteristic data according to the prediction result corresponding to the simulation training. The method can screen out the optimal data index, and the screening efficiency of the data index is high.

Description

Data index screening method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a data index screening method and device, computer equipment and a storage medium.
Background
When a large number of data indexes are faced, the manual screening or the traditional correlation analysis screening method is easy to miss the data indexes which are low in saturation but very important, and the effect of a subsequently constructed system or model is inaccurate due to the selection error of the data indexes.
Especially, when the data indexes are increased in a geometric multiple manner, the potential relevance among a large number of data indexes is difficult to find out through manual screening or traditional relevance analysis, and the screened data indexes are mixed with more useless data indexes, so that the effect of the model cannot be improved, the number of the screened data indexes is large, and the construction efficiency of the model is low easily.
Disclosure of Invention
In view of the above, it is necessary to provide a data index screening method, device, computer device and storage medium, which can improve the screening efficiency of data indexes, and the screened data indexes contribute to improving the effect of the model.
The first aspect of the present invention provides a data index screening method, including:
determining data distribution of a plurality of user data under each data index, and carrying out standardization processing on the plurality of user data under the corresponding data index according to the data distribution;
generating a standard data vector of each user according to the standard data obtained by standardization, and performing feature extraction on the standard data vector of each user to obtain a plurality of index feature data;
calculating the relevance indexes of the index characteristic data, and screening a plurality of first key index characteristic data from the index characteristic data according to the relevance indexes;
extracting a plurality of second key index feature data and the index weight of each second key index feature data by adopting a minimum absolute shrinkage and selection operator based on the plurality of first key index feature data;
and according to the index weight of each second key index characteristic data, simulating and training a user grade prediction model for multiple times by using a Monte Carlo simulation method, and screening out a plurality of target key index characteristic data from the plurality of second key index characteristic data according to a prediction result corresponding to the simulation training.
In an optional embodiment, the calculating a relevance indicator of the plurality of indicator feature data, and the screening a plurality of first key indicator feature data from the plurality of indicator feature data according to the relevance indicator includes:
calculating group stability index values and information values of the index feature data;
and screening a plurality of first key index characteristic data from the plurality of index characteristic data according to the group stability index value and the information value.
In an optional embodiment, the screening the plurality of first key index feature data from the plurality of index feature data according to the population stability index value and the information value comprises:
acquiring first candidate index characteristic data corresponding to a group stability index value smaller than a preset group stability index threshold in the plurality of index characteristic data;
sorting the information values of the first candidate index feature data;
acquiring second candidate index characteristic data corresponding to the information values of the sorted previous preset number;
determining the second candidate index feature data as the plurality of first key index feature data.
In an optional embodiment, the training of the user grade prediction model by using multiple simulations according to the index weight of each second key index feature data using a monte carlo simulation method, and the screening of multiple target key index feature data from the multiple second key index feature data according to the prediction result corresponding to the simulation training includes:
according to the index weight, performing reverse ordering on the plurality of second key index feature data to obtain a key index feature data sequence;
selecting second key index characteristic data with a first preset proportion from the key index characteristic data sequence as first training data from first second key index characteristic data for the first time;
training a first user level prediction model based on the first training data and calculating a first prediction accuracy of the first user level prediction model;
selecting second key index characteristic data of a second preset proportion from the key index characteristic data sequence as second training data from the first second key index characteristic data for the second time;
training a second user level prediction model based on the second training data, and calculating a second prediction accuracy of the second user level prediction model;
determining whether the second prediction accuracy is greater than the first prediction accuracy;
when the second prediction accuracy is higher than the first prediction accuracy, selecting third key index characteristic data of a third preset proportion from the key index characteristic data sequence as third training data from the first second key index characteristic data for the third time;
training a third user grade prediction model based on the third training data, and obtaining third prediction accuracy of the third user grade prediction model;
determining whether the third prediction accuracy is greater than the second prediction accuracy;
and when the third prediction accuracy is higher than the second prediction accuracy, determining the selected second key index characteristic data of a second preset proportion as a plurality of target key index characteristic data.
In an optional embodiment, the normalizing, according to the data distribution, the user data under the corresponding data index includes:
when the data distribution of first user data under a first data index is discrete data distribution, performing function fitting on the first user data to obtain a fitting function, fitting missing data by using the fitting function, and filling the missing data to the corresponding position of the first user data to obtain a plurality of standard data under the first data index;
and when the data distribution of the second user data under the second data index is continuous data distribution, performing power operation on the second user data by using a preset power function to obtain a plurality of standard data under the second data index.
In an optional embodiment, the extracting the features of the standard data vector of each user to obtain a plurality of index feature data includes:
randomly initializing an initial high-dimensional feature vector for the standard data vector of each user;
taking a plurality of standard data vectors as training data and corresponding high-dimensional feature vectors as training targets, and iteratively training a neural network model;
and acquiring a plurality of index characteristic data in the trained neural network model.
In an optional embodiment, the method further comprises:
training a user grade prediction model based on the plurality of target key index feature data;
obtaining prediction data of a user to be tested according to the target key index characteristic data;
and predicting by using the user grade prediction model based on the prediction data of the user to be detected to obtain the grade of the user to be detected.
A second aspect of the present invention provides a data index screening apparatus, comprising:
the data processing module is used for determining data distribution of the user data under each data index and carrying out standardization processing on the user data under the corresponding data index according to the data distribution;
the characteristic extraction module is used for generating a standard data vector of each user according to the standard data obtained by the standardization processing, and extracting the characteristics of the standard data vector of each user to obtain a plurality of index characteristic data;
the first screening module is used for calculating the relevance indexes of the index characteristic data and screening a plurality of first key index characteristic data from the index characteristic data according to the relevance indexes;
the second screening module is used for extracting a plurality of second key index characteristic data and the index weight of each second key index characteristic data by adopting minimum absolute shrinkage and a selection operator based on the plurality of first key index characteristic data;
and the third screening module is used for simulating and training the user grade prediction model for multiple times by using a Monte Carlo simulation method according to the index weight of each second key index characteristic data, and screening out a plurality of target key index characteristic data from the plurality of second key index characteristic data according to the prediction result corresponding to the simulated training.
A third aspect of the invention provides a computer apparatus comprising a processor for implementing the data index screening method when executing a computer program stored in a memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data index screening method.
In summary, the data index screening method, the data index screening device, the computer device and the storage medium of the present invention perform the standardized processing on the plurality of user data under the corresponding data indexes according to the data distribution, so as to implement the differentiated processing on the user data under different data indexes; generating a standard data vector of each user according to standard data obtained by standardization, and extracting a plurality of index feature data in the standard data vector of each user; screening a plurality of first key index characteristic data for the first time according to the correlation indexes of the plurality of index characteristic data; extracting a plurality of second key index characteristic data and the index weight of each second key index characteristic data by adopting minimum absolute shrinkage and a selection operator, simulating and training a user grade prediction model for a plurality of times by using a Monte Carlo simulation method according to the index weights, and screening a plurality of target key index characteristic data from the plurality of second key index characteristic data for the second time according to a prediction result corresponding to the simulation training, namely through the process of twice screening, the selected plurality of target key index characteristic data are optimal choices, so that the effect of the model trained on the basis of the plurality of target key index characteristic data is optimal.
Drawings
Fig. 1 is a flowchart of a data index screening method according to an embodiment of the present invention.
Fig. 2 is a structural diagram of a data index screening apparatus according to a second embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The data index screening method provided by the embodiment of the invention is executed by computer equipment, and correspondingly, the data index screening device runs in the computer equipment.
Fig. 1 is a flowchart of a data index screening method according to an embodiment of the present invention. The data index screening method specifically comprises the following steps, and the sequence of the steps in the flow chart can be changed and some steps can be omitted according to different requirements.
And S11, determining the data distribution of the user data under each data index, and normalizing the user data under the corresponding data index according to the data distribution.
The user data for each user may include, but is not limited to, local economic data, basic data, business data, and the like. Wherein the local economic data may include: GDP, population size, etc., the underlying data may include: academic calendar, working years, etc., the business data may include: the turnover is averaged daily, the number of people is increased daily, and the like. The user data of each user is only an example and is not used as any limitation to the present invention, and the user data may be determined according to an actual application scenario.
The user data may be extracted from a database internal to the enterprise.
The user data of different data indexes have different data distributions, and the user data under the corresponding data indexes are subjected to standardized processing according to the data distributions, so that the user data under different data indexes can be subjected to differentiated processing.
In an optional embodiment, the normalizing, according to the data distribution, the user data under the corresponding data index includes:
when the data distribution of first user data under a first data index is discrete data distribution, performing function fitting on the first user data to obtain a fitting function, fitting missing data by using the fitting function, and filling the missing data to the corresponding position of the first user data to obtain a plurality of standard data under the first data index;
and when the data distribution of the second user data under the second data index is continuous data distribution, performing power operation on the second user data by using a preset power function to obtain a plurality of standard data under the second data index.
Since some of the user data under some data indexes may be discrete (e.g., school calendar, age) and some of the user data under some data indexes may be continuous (e.g., average daily turnover, average daily increment of staff), the user data needs to be standardized according to the data distribution.
In specific implementation, the computer device firstly acquires user data of a plurality of users belonging to the same data index, then determines data distribution of the plurality of user data under each data index, and adopts different standardized processing modes for the plurality of user data under the data index according to the data distribution.
In order to solve the problem that discrete data are missing, when the data distribution of user data corresponding to a certain data index is the discrete data distribution, function fitting is performed according to the existing user data, and then the missing data is fitted by using a fitting function, so that the missing data obtained by fitting is used for data filling, and the integrity of the user data is ensured. For example, for the data index of age, age data of 10 users, which are 21, 22, 23, 24, 26, 27, 28, 29, 30, respectively, are obtained, and age data of the 5 th user is missing, 9 age data of (1, 21), (2, 22), (3, 23), (4, 24) (6, 26), (7, 27), (8, 28), (9, 29), and (10, 30) are subjected to function fitting, and then an argument 5 is substituted into a function obtained by the fitting, so as to obtain missing data 25.
When the data distribution of the user data corresponding to a certain data index is a continuous data distribution, a power function, for example, a log function, may be used to perform a power operation on each user data under the data index, so as to achieve the purpose of discretizing the continuous data.
In other embodiments, when the data distribution of the user data corresponding to a certain data index is a continuous data distribution, the user data under the data index may be subjected to binning operation, so as to achieve the purpose of discretizing the continuous data. Wherein the binning operation may include chi-square binning, equidistant binning, equal frequency binning, and the like.
In the optional embodiment, the data distribution of the plurality of user data under different data indexes is different, and different standardized processing is performed on the plurality of user data under the corresponding data indexes according to the type of the data distribution, so that different user data are processed in a differentiated mode, the processing effect of the user data is good, the plurality of index feature data can be conveniently extracted subsequently, and the integrated prediction model can be trained conveniently.
And S12, generating a standard data vector of each user according to the standard data obtained by the standardization processing, and extracting the characteristics of the standard data vector of each user to obtain a plurality of index characteristic data.
If there are M users and n data indexes, each user has n user data, M user data exists under each data index, and M users have M × n user data.
Since the plurality of user data under each data index are standardized, the standard data under all data indexes belonging to the same user are spliced to obtain the standard data vector of the user. The standard data vector of the ith user is recorded as (Mi1, Mi2, …, Min), and Min represents the user data of the ith user under the nth data index.
In an optional embodiment, the extracting the features of the standard data vector of each user to obtain a plurality of index feature data includes:
randomly initializing an initial high-dimensional feature vector for the standard data vector of each user;
taking a plurality of standard data vectors as training data and corresponding high-dimensional feature vectors as training targets, and iteratively training a neural network model;
and acquiring a plurality of index characteristic data in the trained neural network model.
Due to the fact that differences exist among the combination features of different users, when the differences among the combination features are large, the integrated prediction model cannot be converged when the integrated prediction model is trained based on the combination features of the users, and therefore for the purpose of rapid convergence when the user level prediction model is trained subsequently, feature extraction is conducted on the standard data vector of each user, and a plurality of index feature data are obtained.
S13, calculating the relevance indexes of the index characteristic data, and screening a plurality of first key index characteristic data from the index characteristic data according to the relevance indexes.
Through correlation analysis, a plurality of first key index characteristic data can be extracted from the index characteristic data, so that when the model is trained based on the first key index characteristic data, the training effect of the model can be improved.
In an optional embodiment, the calculating a relevance indicator of the plurality of indicator feature data, and the screening a plurality of first key indicator feature data from the plurality of indicator feature data according to the relevance indicator includes:
calculating group stability index values and information values of the index feature data;
and screening a plurality of first key index characteristic data from the plurality of index characteristic data according to the group stability index value and the information value.
In financial forecasting scenarios, stability overwhelms everything. The reason is that it often takes a long time (usually more than one year) for a set of prediction models to be replaced off-line after formal on-line operation. If the model is unstable, it means that the model is uncontrollable, which is an uncertainty risk for the business itself, and will directly affect the rationality of the decision.
The Population Stability Index (PSI) and Information Value (IV) reflect the Stability of the data. The smaller the PSI, the better the stability of the index characteristic data is, and the larger the PSI, the worse the stability of the index characteristic data is. The larger the IV, the better the stability of the index characteristic data, and the smaller the IV, the worse the stability of the index characteristic data. The calculation process of PSI and IV indicators is prior art and the present invention will not be described in detail herein.
In an optional embodiment, the screening the plurality of first key index feature data from the plurality of index feature data according to the population stability index value and the information value comprises:
acquiring first candidate index characteristic data corresponding to a group stability index value smaller than a preset group stability index threshold in the plurality of index characteristic data;
sorting the information values of the first candidate index feature data;
acquiring second candidate index characteristic data corresponding to the information values of the sorted previous preset number;
determining the second candidate index feature data as the plurality of first key index feature data.
Because most of data are based on the month dimension, the PSI and IV values are compared month by month for a plurality of index characteristic data, more stable index characteristic data are selected, unstable index characteristic data are eliminated, and the model is trained based on the stable index characteristic data, so that the stability of the model can be effectively ensured.
And S14, extracting a plurality of second key index feature data and the index weight of each second key index feature data based on the plurality of first key index feature data by adopting a minimum absolute shrinkage and selection operator.
The minimum Absolute Shrinkage and Selection Operator (Lasso) is a linear regression method using L1 regularization (L1-regularization), and the L1 regularization is used to make part of the learned feature weights 0, so as to achieve the purpose of sparsification and feature Selection.
And S15, according to the index weight of each second key index characteristic data, performing simulation training on the user grade prediction model for multiple times by using a Monte Carlo simulation method, and screening out a plurality of target key index characteristic data from the plurality of second key index characteristic data according to the prediction results corresponding to the simulation training.
After the index weight of each second key index feature data is obtained, the target key index feature data can be determined according to the index weight, but because the effect of how many training models of the target key index feature data are selected is uncertain to be the best, the Monte Carlo simulation method is adopted for simulation training, and therefore the proper amount of target key index feature data are selected according to the simulation training result.
In an optional embodiment, the training of the user grade prediction model by using multiple simulations according to the index weight of each second key index feature data using a monte carlo simulation method, and the screening of multiple target key index feature data from the multiple second key index feature data according to the prediction result corresponding to the simulation training includes:
according to the index weight, performing reverse ordering on the plurality of second key index feature data to obtain a key index feature data sequence;
selecting second key index characteristic data with a first preset proportion from the key index characteristic data sequence as first training data from first second key index characteristic data for the first time;
training a first user level prediction model based on the first training data and calculating a first prediction accuracy of the first user level prediction model;
selecting second key index characteristic data of a second preset proportion from the key index characteristic data sequence as second training data from the first second key index characteristic data for the second time;
training a second user level prediction model based on the second training data, and calculating a second prediction accuracy of the second user level prediction model;
determining whether the second prediction accuracy is greater than the first prediction accuracy;
when the second prediction accuracy is higher than the first prediction accuracy, selecting third key index characteristic data of a third preset proportion from the key index characteristic data sequence as third training data from the first second key index characteristic data for the third time;
training a third user grade prediction model based on the third training data, and obtaining third prediction accuracy of the third user grade prediction model;
determining whether the third prediction accuracy is greater than the second prediction accuracy;
and when the third prediction accuracy is higher than the second prediction accuracy, determining the selected second key index characteristic data of a second preset proportion as a plurality of target key index characteristic data.
And performing reverse ordering on the plurality of second key index characteristic data according to the index weight to obtain a key index characteristic data sequence, wherein the larger the index weight is, the more the corresponding second key index characteristic data is ordered in the front, the smaller the index weight is, and the more the corresponding second key index characteristic data is ordered in the back. The index weight represents the importance degree of the corresponding second key index characteristic data, and can play a main role in the training of the model.
And after the second key index feature data are sorted in the reverse order, selecting the second key index feature data with a first preset proportion backward from the first second key index feature data sorted in the reverse order for the first time to serve as first training data. Training a first user level prediction model based on the first training data, and calculating first prediction accuracy of the first user level prediction model, wherein the prediction accuracy is a ratio of a first number with a correct predicted value to a second number with a true value.
And selecting second key index characteristic data of a second preset proportion backwards from the first second key index characteristic data after the reverse sequencing for the second time to serve as second training data, training a second user grade prediction model based on the second training data, and calculating second prediction accuracy of the second user grade prediction model.
Wherein the second preset proportion is greater than the first preset proportion. That is, the number of second training data selected at the second time is greater than the number of first training data selected at the first time.
When the second prediction accuracy is higher than the first prediction accuracy, the selected second training data plays a positive role in the model training, and the second key index characteristic data which is selected and taken out on the basis of the first time is selected for the second time, so that the model training is played.
When the second prediction accuracy is not higher than the first prediction accuracy, the selected second training data plays a negative role in the model training, and the second key index characteristic data which is selected in multiple ways on the basis of the first time plays a negative role in the model training.
As long as the prediction accuracy of the user grade prediction model trained by the training data selected at the next time is higher than that trained by the training data selected at the previous time, the above process is continuously repeated until the prediction accuracy of the user grade prediction model trained by the training data selected at a certain time is less than or equal to that trained by the training data selected at the previous time, and the Monte Carlo simulation process is ended.
In this optional embodiment, after the second key index feature data are sorted in the reverse order according to the index weight to obtain a key index feature data sequence, a user grade prediction model is simulated and trained multiple times by using a monte carlo simulation method, and whether to continue to screen out a plurality of target key index feature data from the second key index feature data is determined according to prediction results of the user grade prediction models trained twice before and after, and the selected target key index feature data can effectively improve the prediction accuracy of the user grade prediction model, so that the number of the selected target key index feature data is the minimum while the prediction accuracy of the user grade prediction model achieves the maximum effect.
In an optional embodiment, the method further comprises:
training a user grade prediction model based on the plurality of target key index feature data;
obtaining prediction data of a user to be tested according to the target key index characteristic data;
and predicting by using the user grade prediction model based on the prediction data of the user to be detected to obtain the grade of the user to be detected.
After the plurality of target key index feature data are selected, a user grade prediction model can be trained according to the plurality of target key index feature data, so that grade prediction is carried out on the user to be detected.
Determining a plurality of target key indexes according to a plurality of target key index feature data, then obtaining a plurality of prediction key index feature data corresponding to the plurality of target key indexes from a plurality of user data of a user to be tested, determining the plurality of prediction key index feature data as the prediction data, inputting the prediction data into the user grade prediction model for prediction, and obtaining a plurality of grades and the prediction probability corresponding to each grade. And determining the grade corresponding to the maximum prediction probability as the grade of the user to be detected.
In this optional embodiment, since the selected multiple target key index feature data are the optimal selection, the prediction accuracy of the user level prediction model obtained based on the training of the multiple target key index feature data is the highest, so that the accuracy of predicting the level of the user to be detected by using the user level prediction model is higher.
It is emphasized that, to further ensure the privacy and security of the plurality of target key indicator characteristic data, the plurality of target key indicator characteristic data may be stored in the nodes of the blockchain.
Fig. 2 is a structural diagram of a data index screening apparatus according to a second embodiment of the present invention.
In some embodiments, the data index screening apparatus 20 may include a plurality of functional modules comprising computer program segments. The computer program of each program segment in the data index screening apparatus 20 may be stored in a memory of a computer device and executed by at least one processor to perform the function of data index screening (described in detail in fig. 1).
In this embodiment, the data index screening apparatus 20 may be divided into a plurality of functional modules according to the functions performed by the apparatus. The functional module may include: a data processing module 201, a feature extraction module 202, a first screening module 203, a second screening module 204, a third screening module 205, and a rank prediction module 206. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The data processing module 201 is configured to determine data distribution of the plurality of user data under each data index, and perform normalization processing on the plurality of user data under the corresponding data index according to the data distribution.
The user data for each user may include, but is not limited to, local economic data, basic data, business data, and the like. Wherein the local economic data may include: GDP, population size, etc., the underlying data may include: academic calendar, working years, etc., the business data may include: the turnover is averaged daily, the number of people is increased daily, and the like. The user data of each user is only an example and is not used as any limitation to the present invention, and the user data may be determined according to an actual application scenario.
The user data may be extracted from a database internal to the enterprise.
The user data of different data indexes have different data distributions, and the user data under the corresponding data indexes are subjected to standardized processing according to the data distributions, so that the user data under different data indexes can be subjected to differentiated processing.
In an optional embodiment, the data processing module 201, performing normalization processing on the user data under the corresponding data index according to the data distribution includes:
when the data distribution of first user data under a first data index is discrete data distribution, performing function fitting on the first user data to obtain a fitting function, fitting missing data by using the fitting function, and filling the missing data to the corresponding position of the first user data to obtain a plurality of standard data under the first data index;
and when the data distribution of the second user data under the second data index is continuous data distribution, performing power operation on the second user data by using a preset power function to obtain a plurality of standard data under the second data index.
Since some of the user data under some data indexes may be discrete (e.g., school calendar, age) and some of the user data under some data indexes may be continuous (e.g., average daily turnover, average daily increment of staff), the user data needs to be standardized according to the data distribution.
In specific implementation, the computer device firstly acquires user data of a plurality of users belonging to the same data index, then determines data distribution of the plurality of user data under each data index, and adopts different standardized processing modes for the plurality of user data under the data index according to the data distribution.
In order to solve the problem that discrete data are missing, when the data distribution of user data corresponding to a certain data index is the discrete data distribution, function fitting is performed according to the existing user data, and then the missing data is fitted by using a fitting function, so that the missing data obtained by fitting is used for data filling, and the integrity of the user data is ensured. For example, for the data index of age, age data of 10 users, which are 21, 22, 23, 24, 26, 27, 28, 29, 30, respectively, are obtained, and age data of the 5 th user is missing, 9 age data of (1, 21), (2, 22), (3, 23), (4, 24) (6, 26), (7, 27), (8, 28), (9, 29), and (10, 30) are subjected to function fitting, and then an argument 5 is substituted into a function obtained by the fitting, so as to obtain missing data 25.
When the data distribution of the user data corresponding to a certain data index is a continuous data distribution, a power function, for example, a log function, may be used to perform a power operation on each user data under the data index, so as to achieve the purpose of discretizing the continuous data.
In other embodiments, when the data distribution of the user data corresponding to a certain data index is a continuous data distribution, the user data under the data index may be subjected to binning operation, so as to achieve the purpose of discretizing the continuous data. Wherein the binning operation may include chi-square binning, equidistant binning, equal frequency binning, and the like.
In the optional embodiment, the data distribution of the plurality of user data under different data indexes is different, and different standardized processing is performed on the plurality of user data under the corresponding data indexes according to the type of the data distribution, so that different user data are processed in a differentiated mode, the processing effect of the user data is good, the plurality of index feature data can be conveniently extracted subsequently, and the integrated prediction model can be trained conveniently.
The feature extraction module 202 is configured to generate a standard data vector of each user according to the standard data obtained through the standardization process, and perform feature extraction on the standard data vector of each user to obtain a plurality of index feature data.
If there are M users and n data indexes, each user has n user data, M user data exists under each data index, and M users have M × n user data.
Since the plurality of user data under each data index are standardized, the standard data under all data indexes belonging to the same user are spliced to obtain the standard data vector of the user. The standard data vector of the ith user is recorded as (Mi1, Mi2, …, Min), and Min represents the user data of the ith user under the nth data index.
In an optional embodiment, the feature extraction module 202 performs feature extraction on the standard data vector of each user to obtain a plurality of index feature data, including:
randomly initializing an initial high-dimensional feature vector for the standard data vector of each user;
taking a plurality of standard data vectors as training data and corresponding high-dimensional feature vectors as training targets, and iteratively training a neural network model;
and acquiring a plurality of index characteristic data in the trained neural network model.
Due to the fact that differences exist among the combination features of different users, when the differences among the combination features are large, the integrated prediction model cannot be converged when the integrated prediction model is trained based on the combination features of the users, and therefore for the purpose of rapid convergence when the user level prediction model is trained subsequently, feature extraction is conducted on the standard data vector of each user, and a plurality of index feature data are obtained.
The first screening module 203 is configured to calculate a relevance index of the plurality of index feature data, and screen a plurality of first key index feature data from the plurality of index feature data according to the relevance index.
Through correlation analysis, a plurality of first key index characteristic data can be extracted from the index characteristic data, so that when the model is trained based on the first key index characteristic data, the training effect of the model can be improved.
In an optional embodiment, the first filtering module 203 calculates a relevance index of the index feature data, and the filtering out a plurality of first key index feature data from the index feature data according to the relevance index includes:
calculating group stability index values and information values of the index feature data;
and screening a plurality of first key index characteristic data from the plurality of index characteristic data according to the group stability index value and the information value.
In financial forecasting scenarios, stability overwhelms everything. The reason is that it often takes a long time (usually more than one year) for a set of prediction models to be replaced off-line after formal on-line operation. If the model is unstable, it means that the model is uncontrollable, which is an uncertainty risk for the business itself, and will directly affect the rationality of the decision.
The Population Stability Index (PSI) and Information Value (IV) reflect the Stability of the data. The smaller the PSI, the better the stability of the index characteristic data is, and the larger the PSI, the worse the stability of the index characteristic data is. The larger the IV, the better the stability of the index characteristic data, and the smaller the IV, the worse the stability of the index characteristic data. The calculation process of PSI and IV indicators is prior art and the present invention will not be described in detail herein.
In an optional embodiment, the screening the plurality of first key index feature data from the plurality of index feature data according to the population stability index value and the information value comprises:
acquiring first candidate index characteristic data corresponding to a group stability index value smaller than a preset group stability index threshold in the plurality of index characteristic data;
sorting the information values of the first candidate index feature data;
acquiring second candidate index characteristic data corresponding to the information values of the sorted previous preset number;
determining the second candidate index feature data as the plurality of first key index feature data.
Because most of data are based on the month dimension, the PSI and IV values are compared month by month for a plurality of index characteristic data, more stable index characteristic data are selected, unstable index characteristic data are eliminated, and the model is trained based on the stable index characteristic data, so that the stability of the model can be effectively ensured.
The second filtering module 204 extracts a plurality of second key index feature data and an index weight of each second key index feature data based on the plurality of first key index feature data by using a minimum absolute shrinkage and selection operator.
The minimum Absolute Shrinkage and Selection Operator (Lasso) is a linear regression method using L1 regularization (L1-regularization), and the L1 regularization is used to make part of the learned feature weights 0, so as to achieve the purpose of sparsification and feature Selection.
The third screening module 205 is configured to perform multiple simulation training on the user level prediction model by using a monte carlo simulation method according to the index weight of each second key index feature data, and screen out a plurality of target key index feature data from the plurality of second key index feature data according to a prediction result corresponding to the simulation training.
After the index weight of each second key index feature data is obtained, the target key index feature data can be determined according to the index weight, but because the effect of how many training models of the target key index feature data are selected is uncertain to be the best, the Monte Carlo simulation method is adopted for simulation training, and therefore the proper amount of target key index feature data are selected according to the simulation training result.
In an optional embodiment, the third filtering module 205 performs multiple simulation training on the user level prediction model using a monte carlo simulation method according to the index weight of each second key index feature data, and filters a plurality of target key index feature data from the plurality of second key index feature data according to the prediction result corresponding to the simulation training includes:
according to the index weight, performing reverse ordering on the plurality of second key index feature data to obtain a key index feature data sequence;
selecting second key index characteristic data with a first preset proportion from the key index characteristic data sequence as first training data from first second key index characteristic data for the first time;
training a first user level prediction model based on the first training data and calculating a first prediction accuracy of the first user level prediction model;
selecting second key index characteristic data of a second preset proportion from the key index characteristic data sequence as second training data from the first second key index characteristic data for the second time;
training a second user level prediction model based on the second training data, and calculating a second prediction accuracy of the second user level prediction model;
determining whether the second prediction accuracy is greater than the first prediction accuracy;
when the second prediction accuracy is higher than the first prediction accuracy, selecting third key index characteristic data of a third preset proportion from the key index characteristic data sequence as third training data from the first second key index characteristic data for the third time;
training a third user grade prediction model based on the third training data, and obtaining third prediction accuracy of the third user grade prediction model;
determining whether the third prediction accuracy is greater than the second prediction accuracy;
and when the third prediction accuracy is higher than the second prediction accuracy, determining the selected second key index characteristic data of a second preset proportion as a plurality of target key index characteristic data.
And performing reverse ordering on the plurality of second key index characteristic data according to the index weight to obtain a key index characteristic data sequence, wherein the larger the index weight is, the more the corresponding second key index characteristic data is ordered in the front, the smaller the index weight is, and the more the corresponding second key index characteristic data is ordered in the back. The index weight represents the importance degree of the corresponding second key index characteristic data, and can play a main role in the training of the model.
And after the second key index feature data are sorted in the reverse order, selecting the second key index feature data with a first preset proportion backward from the first second key index feature data sorted in the reverse order for the first time to serve as first training data. Training a first user level prediction model based on the first training data, and calculating first prediction accuracy of the first user level prediction model, wherein the prediction accuracy is a ratio of a first number with a correct predicted value to a second number with a true value.
And selecting second key index characteristic data of a second preset proportion backwards from the first second key index characteristic data after the reverse sequencing for the second time to serve as second training data, training a second user grade prediction model based on the second training data, and calculating second prediction accuracy of the second user grade prediction model.
Wherein the second preset proportion is greater than the first preset proportion. That is, the number of second training data selected at the second time is greater than the number of first training data selected at the first time.
When the second prediction accuracy is higher than the first prediction accuracy, the selected second training data plays a positive role in the model training, and the second key index characteristic data which is selected and taken out on the basis of the first time is selected for the second time, so that the model training is played.
When the second prediction accuracy is not higher than the first prediction accuracy, the selected second training data plays a negative role in the model training, and the second key index characteristic data which is selected in multiple ways on the basis of the first time plays a negative role in the model training.
As long as the prediction accuracy of the user grade prediction model trained by the training data selected at the next time is higher than that trained by the training data selected at the previous time, the above process is continuously repeated until the prediction accuracy of the user grade prediction model trained by the training data selected at a certain time is less than or equal to that trained by the training data selected at the previous time, and the Monte Carlo simulation process is ended.
In this optional embodiment, after the second key index feature data are sorted in the reverse order according to the index weight to obtain a key index feature data sequence, a user grade prediction model is simulated and trained multiple times by using a monte carlo simulation method, and whether to continue to screen out a plurality of target key index feature data from the second key index feature data is determined according to prediction results of the user grade prediction models trained twice before and after, and the selected target key index feature data can effectively improve the prediction accuracy of the user grade prediction model, so that the number of the selected target key index feature data is the minimum while the prediction accuracy of the user grade prediction model achieves the maximum effect.
The grade prediction module 206 is configured to train a user grade prediction model based on the plurality of target key index feature data; obtaining prediction data of a user to be tested according to the target key index characteristic data; and predicting by using the user grade prediction model based on the prediction data of the user to be detected to obtain the grade of the user to be detected.
After the plurality of target key index feature data are selected, a user grade prediction model can be trained according to the plurality of target key index feature data, so that grade prediction is carried out on the user to be detected.
Determining a plurality of target key indexes according to a plurality of target key index feature data, then obtaining a plurality of prediction key index feature data corresponding to the plurality of target key indexes from a plurality of user data of a user to be tested, determining the plurality of prediction key index feature data as the prediction data, inputting the prediction data into the user grade prediction model for prediction, and obtaining a plurality of grades and the prediction probability corresponding to each grade. And determining the grade corresponding to the maximum prediction probability as the grade of the user to be detected.
In this optional embodiment, since the selected multiple target key index feature data are the optimal selection, the prediction accuracy of the user level prediction model obtained based on the training of the multiple target key index feature data is the highest, so that the accuracy of predicting the level of the user to be detected by using the user level prediction model is higher.
It is emphasized that, to further ensure the privacy and security of the plurality of target key indicator characteristic data, the plurality of target key indicator characteristic data may be stored in the nodes of the blockchain.
Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the computer device 3 includes a memory 31, at least one processor 32, at least one communication bus 33, and a transceiver 34.
It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 3 does not constitute a limitation of the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer device 3 may include more or less hardware or software than those shown, or a different arrangement of components.
In some embodiments, the computer device 3 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 3 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.
It should be noted that the computer device 3 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
In some embodiments, the memory 31 has stored therein a computer program that, when executed by the at least one processor 32, performs all or part of the steps of the data index screening method as described. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In some embodiments, the at least one processor 32 is a Control Unit (Control Unit) of the computer device 3, connects various components of the entire computer device 3 by using various interfaces and lines, and executes various functions and processes data of the computer device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31. For example, the at least one processor 32, when executing the computer program stored in the memory, implements all or a portion of the steps of the data index screening method described in embodiments of the present invention; or realize all or part of the functions of the data index screening device. The at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.
In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.
Although not shown, the computer device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the present invention can also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (9)

1. A method for screening data indexes is characterized by comprising the following steps:
determining data distribution of a plurality of user data under each data index, and carrying out standardization processing on the plurality of user data under the corresponding data index according to the data distribution;
generating a standard data vector of each user according to the standard data obtained by standardization, and performing feature extraction on the standard data vector of each user to obtain a plurality of index feature data;
calculating the relevance indexes of the index characteristic data, and screening a plurality of first key index characteristic data from the index characteristic data according to the relevance indexes;
extracting a plurality of second key index feature data and the index weight of each second key index feature data by adopting a minimum absolute shrinkage and selection operator based on the plurality of first key index feature data;
according to the index weight of each second key index characteristic data, a user grade prediction model is simulated and trained for multiple times by using a Monte Carlo simulation method, and a plurality of target key index characteristic data are screened out from the plurality of second key index characteristic data according to the prediction result corresponding to the simulated training, wherein the method comprises the following steps: according to the index weight, performing reverse ordering on the plurality of second key index feature data to obtain a key index feature data sequence; selecting second key index characteristic data with a first preset proportion from the key index characteristic data sequence as first training data from first second key index characteristic data for the first time; training a first user level prediction model based on the first training data and calculating a first prediction accuracy of the first user level prediction model; selecting second key index characteristic data of a second preset proportion from the key index characteristic data sequence as second training data from the first second key index characteristic data for the second time; training a second user level prediction model based on the second training data, and calculating a second prediction accuracy of the second user level prediction model; determining whether the second prediction accuracy is greater than the first prediction accuracy; when the second prediction accuracy is higher than the first prediction accuracy, selecting third key index characteristic data of a third preset proportion from the key index characteristic data sequence as third training data from the first second key index characteristic data for the third time; training a third user grade prediction model based on the third training data, and obtaining third prediction accuracy of the third user grade prediction model; determining whether the third prediction accuracy is greater than the second prediction accuracy; and when the third prediction accuracy is higher than the second prediction accuracy, determining the selected second key index characteristic data of a second preset proportion as a plurality of target key index characteristic data.
2. The method of claim 1, wherein the calculating a relevance indicator for the plurality of indicator features and the screening a plurality of first key indicator features from the plurality of indicator features according to the relevance indicator comprises:
calculating group stability index values and information values of the index feature data;
and screening a plurality of first key index characteristic data from the plurality of index characteristic data according to the group stability index value and the information value.
3. The data index screening method of claim 2, wherein the screening a plurality of first key index feature data from the plurality of index feature data according to the population stability index value and the information value comprises:
acquiring first candidate index characteristic data corresponding to a group stability index value smaller than a preset group stability index threshold in the plurality of index characteristic data;
sorting the information values of the first candidate index feature data;
acquiring second candidate index characteristic data corresponding to the information values of the sorted previous preset number;
determining the second candidate index feature data as the plurality of first key index feature data.
4. The method for screening data indexes according to claim 3, wherein the step of standardizing the user data under the corresponding data indexes according to the data distribution comprises:
when the data distribution of first user data under a first data index is discrete data distribution, performing function fitting on the first user data to obtain a fitting function, fitting missing data by using the fitting function, and filling the missing data to the corresponding position of the first user data to obtain a plurality of standard data under the first data index;
and when the data distribution of the second user data under the second data index is continuous data distribution, performing power operation on the second user data by using a preset power function to obtain a plurality of standard data under the second data index.
5. The data index screening method of claim 4, wherein the extracting the features of the standard data vector of each user to obtain a plurality of index feature data comprises:
randomly initializing an initial high-dimensional feature vector for the standard data vector of each user;
taking a plurality of standard data vectors as training data and corresponding high-dimensional feature vectors as training targets, and iteratively training a neural network model;
and acquiring a plurality of index characteristic data in the trained neural network model.
6. The method of any of claims 1 to 5, wherein the method further comprises:
training a user grade prediction model based on the plurality of target key index feature data;
obtaining prediction data of a user to be tested according to the target key index characteristic data;
and predicting by using the user grade prediction model based on the prediction data of the user to be detected to obtain the grade of the user to be detected.
7. A data index screening apparatus, the apparatus comprising:
the data processing module is used for determining data distribution of the user data under each data index and carrying out standardization processing on the user data under the corresponding data index according to the data distribution;
the characteristic extraction module is used for generating a standard data vector of each user according to the standard data obtained by the standardization processing, and extracting the characteristics of the standard data vector of each user to obtain a plurality of index characteristic data;
the first screening module is used for calculating the relevance indexes of the index characteristic data and screening a plurality of first key index characteristic data from the index characteristic data according to the relevance indexes;
the second screening module is used for extracting a plurality of second key index characteristic data and the index weight of each second key index characteristic data by adopting minimum absolute shrinkage and a selection operator based on the plurality of first key index characteristic data;
the third screening module is used for simulating and training the user grade prediction model for multiple times by using a Monte Carlo simulation method according to the index weight of each second key index characteristic data, and screening out a plurality of target key index characteristic data from the plurality of second key index characteristic data according to the prediction result corresponding to the simulated training, and comprises the following steps: according to the index weight, performing reverse ordering on the plurality of second key index feature data to obtain a key index feature data sequence; selecting second key index characteristic data with a first preset proportion from the key index characteristic data sequence as first training data from first second key index characteristic data for the first time; training a first user level prediction model based on the first training data and calculating a first prediction accuracy of the first user level prediction model; selecting second key index characteristic data of a second preset proportion from the key index characteristic data sequence as second training data from the first second key index characteristic data for the second time; training a second user level prediction model based on the second training data, and calculating a second prediction accuracy of the second user level prediction model; determining whether the second prediction accuracy is greater than the first prediction accuracy; when the second prediction accuracy is higher than the first prediction accuracy, selecting third key index characteristic data of a third preset proportion from the key index characteristic data sequence as third training data from the first second key index characteristic data for the third time; training a third user grade prediction model based on the third training data, and obtaining third prediction accuracy of the third user grade prediction model; determining whether the third prediction accuracy is greater than the second prediction accuracy; and when the third prediction accuracy is higher than the second prediction accuracy, determining the selected second key index characteristic data of a second preset proportion as a plurality of target key index characteristic data.
8. A computer device, characterized in that the computer device comprises a processor for implementing the data index screening method according to any one of claims 1 to 6 when executing a computer program stored in a memory.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a data index screening method according to any one of claims 1 to 6.
CN202110037835.4A 2021-01-12 2021-01-12 Data index screening method and device, computer equipment and storage medium Active CN112818028B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110037835.4A CN112818028B (en) 2021-01-12 2021-01-12 Data index screening method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110037835.4A CN112818028B (en) 2021-01-12 2021-01-12 Data index screening method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112818028A CN112818028A (en) 2021-05-18
CN112818028B true CN112818028B (en) 2021-09-17

Family

ID=75868875

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110037835.4A Active CN112818028B (en) 2021-01-12 2021-01-12 Data index screening method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112818028B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298448B (en) * 2021-07-26 2021-12-03 广东新禾道信息科技有限公司 Lease index analysis method and system based on Internet and cloud platform
CN116757334B (en) * 2023-08-16 2023-11-24 江西科技学院 Financial data processing method, system, readable storage medium and computer

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105634787A (en) * 2014-11-26 2016-06-01 华为技术有限公司 Evaluation method, prediction method and device and system for network key indicator
CN106269573A (en) * 2016-08-19 2017-01-04 广东溢达纺织有限公司 Knitting needle screening technique
CN106778032A (en) * 2016-12-14 2017-05-31 南京邮电大学 Ligand molecular magnanimity Feature Selection method in drug design
CN107315775A (en) * 2017-05-27 2017-11-03 国信优易数据有限公司 A kind of index calculating platform and method
CN107704880A (en) * 2017-10-12 2018-02-16 安徽大学 Crop disease identification method based on feature selection
CN111291816A (en) * 2020-02-17 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for carrying out feature processing aiming at user classification model
CN112102011A (en) * 2020-10-13 2020-12-18 平安科技(深圳)有限公司 User grade prediction method, device, terminal and medium based on artificial intelligence

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5843104B2 (en) * 2012-05-11 2016-01-13 ソニー株式会社 Information processing apparatus, information processing method, and program
US9430532B2 (en) * 2013-07-30 2016-08-30 NETFLIX Inc. Media content rankings for discovery of novel content
CN103871246B (en) * 2014-02-10 2016-05-04 南京大学 Based on the Short-time Traffic Flow Forecasting Methods of road network spatial relation constraint Lasso
CN109389310B (en) * 2018-10-12 2021-08-27 合肥工业大学 Electric vehicle charging facility maturity evaluation method based on Monte Carlo simulation
CN110059763A (en) * 2019-04-26 2019-07-26 数景智能科技(宁波)有限公司 A kind of Feature Selection method and device
CN111614491B (en) * 2020-05-06 2022-10-04 国网电力科学研究院有限公司 Power monitoring system oriented safety situation assessment index selection method and system
CN112199559B (en) * 2020-12-07 2021-02-19 上海冰鉴信息科技有限公司 Data feature screening method and device and computer equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105634787A (en) * 2014-11-26 2016-06-01 华为技术有限公司 Evaluation method, prediction method and device and system for network key indicator
CN106269573A (en) * 2016-08-19 2017-01-04 广东溢达纺织有限公司 Knitting needle screening technique
CN106778032A (en) * 2016-12-14 2017-05-31 南京邮电大学 Ligand molecular magnanimity Feature Selection method in drug design
CN107315775A (en) * 2017-05-27 2017-11-03 国信优易数据有限公司 A kind of index calculating platform and method
CN107704880A (en) * 2017-10-12 2018-02-16 安徽大学 Crop disease identification method based on feature selection
CN111291816A (en) * 2020-02-17 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for carrying out feature processing aiming at user classification model
CN112102011A (en) * 2020-10-13 2020-12-18 平安科技(深圳)有限公司 User grade prediction method, device, terminal and medium based on artificial intelligence

Also Published As

Publication number Publication date
CN112818028A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN111950738B (en) Machine learning model optimization effect evaluation method, device, terminal and storage medium
CN109993233B (en) Method and system for predicting data auditing objective based on machine learning
CN112818028B (en) Data index screening method and device, computer equipment and storage medium
CN111950625A (en) Risk identification method and device based on artificial intelligence, computer equipment and medium
CN114997263B (en) Method, device, equipment and storage medium for analyzing training rate based on machine learning
CN112016905B (en) Information display method and device based on approval process, electronic equipment and medium
CN112199417B (en) Data processing method, device, terminal and storage medium based on artificial intelligence
CN112948275A (en) Test data generation method, device, equipment and storage medium
CN111738778B (en) User portrait generation method and device, computer equipment and storage medium
CN112328646B (en) Multitask course recommendation method and device, computer equipment and storage medium
WO2021139432A1 (en) Artificial intelligence-based user rating prediction method and apparatus, terminal, and medium
CN113435998A (en) Loan overdue prediction method and device, electronic equipment and storage medium
CN111639706A (en) Personal risk portrait generation method based on image set and related equipment
CN114398669A (en) Joint credit scoring method and device based on privacy protection calculation and cross-organization
CN111984898A (en) Label pushing method and device based on big data, electronic equipment and storage medium
CN112632179A (en) Model construction method and device, storage medium and equipment
CN112365051A (en) Agent retention prediction method and device, computer equipment and storage medium
CN117194382A (en) Middle-stage data processing method and device, electronic equipment and storage medium
Shino et al. Implementation of Data Mining with Naive Bayes Algorithm for Eligibility Classification of Basic Food Aid Recipients
CN111651452A (en) Data storage method and device, computer equipment and storage medium
CN116108276A (en) Information recommendation method and device based on artificial intelligence and related equipment
CN115860562A (en) Software workload rationality evaluation method, device and equipment
CN114490590A (en) Data warehouse quality evaluation method and device, electronic equipment and storage medium
CN115204501A (en) Enterprise evaluation method and device, computer equipment and storage medium
CN114968336A (en) Application gray level publishing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant