CN115713345A

CN115713345A - Variable screening method and device, nonvolatile storage medium and processor

Info

Publication number: CN115713345A
Application number: CN202211413169.0A
Authority: CN
Inventors: 槐正; 徐冬冬; 张涛; 姜承祥; 付迎鑫; 张哲�; 姬照中; 徐锐; 王健; 徐蕾
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-02-24

Abstract

The invention discloses a variable screening method, a variable screening device, a nonvolatile storage medium and a processor. Wherein, the method comprises the following steps: obtaining independent variables for evaluating target variables, wherein the target variables correspond to one or more independent variables; evaluating the linear relation between the independent variable and the target variable by using a preset Pearson correlation coefficient model, and determining the linear correlation degree; and under the condition that the target variable corresponds to a plurality of independent variables with linear correlation degrees higher than a preset correlation degree threshold value, selecting the independent variable with the highest linear correlation degree as a sample variable of the target variable, wherein the sample variable and the target variable are used as training data for training a target prediction model, and the target prediction model is used for analyzing and determining the prediction variable for the independent variables. The invention solves the technical problem of low efficiency of determining the portrait of the consumer due to the fact that the independent variables for determining the portrait of the consumer cannot be screened.

Description

Variable screening method and device, nonvolatile storage medium and processor

Technical Field

The invention relates to the field of computers, in particular to a variable screening method and device, a nonvolatile storage medium and a processor.

Background

In recent years, with the rise of big data, research on consumer behavior analysis has been underway, and students in many fields such as database and data mining, information system and information management, image processing and computer vision, social network analysis, and electronic commerce have been added to a team of consumer behavior research. Meanwhile, the research field is also highly concerned by enterprises in digital economic forms such as electronic commerce and social networks, and the analysis of consumer behaviors is regarded as an effective means for enterprises in digital economic forms to know consumers and develop marketing activities. In these new fields, consumer behavior research is called consumer portrayal, and is also important in research fields such as social computing.

However, because the prior art does not screen the arguments that identify a consumer representation, the angle of the argument used to identify a consumer representation affects the efficiency of identifying a consumer representation.

In order to solve the problem that the efficiency of determining the portrait of the consumer is low due to the fact that the independent variables for determining the portrait of the consumer cannot be screened, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention provides a variable screening method and device, a nonvolatile storage medium and a processor, which at least solve the technical problem of low efficiency of determining a consumer portrait due to the fact that independent variables for determining the consumer portrait cannot be screened.

According to an aspect of an embodiment of the present invention, there is provided a variable screening method, including: obtaining independent variables for evaluating target variables, wherein the target variables correspond to one or more independent variables; evaluating the linear relation between the independent variable and the target variable by using a preset Pearson correlation coefficient model, and determining the linear correlation degree; and under the condition that the target variable corresponds to a plurality of independent variables with linear correlation degrees higher than a preset correlation degree threshold value, selecting the independent variable with the highest linear correlation degree as a sample variable of the target variable, wherein the sample variable and the target variable are used as training data for training a target prediction model, and the target prediction model is used for analyzing and determining a prediction variable for the independent variables.

Optionally, after selecting the independent variable with the highest linear correlation as the sample variable of the target variable, the method further includes: identifying a variable type of the argument, wherein the variable type comprises at least: proportional, interval, quantity category and binary variables; in the case where the independent variable belongs to the quantity-class type variable, checking the association of the independent variable and the predicted variable of the target prediction model using a preset chi-square test model; and in the case that the independent variable does not belong to the type variable, checking the relevance of the independent variable and the predictive variable of the target predictive model by using a preset regression model.

Optionally, obtaining the independent variable for evaluating the target variable comprises: acquiring an attribute value of the independent variable; analyzing the attribute values of the independent variables by using a preset evaluation algorithm, and determining the predicted values of the independent variables, wherein the predicted values are used for expressing the coincidence degree of the predicted variables determined according to the independent variables and the target variables corresponding to the independent variables; and selecting the independent variable with the predicted value higher than a preset value threshold value as the independent variable for evaluating the target variable.

Optionally, analyzing the attribute values of the independent variables by using a preset evaluation algorithm, and determining the predicted values of the independent variables includes: identifying a variable type of the argument, wherein the variable type comprises at least: proportional, interval, quantity category and binary variables; under the condition that the independent variable belongs to the interval type variable, the independent variable is classified into a plurality of interval variables; and analyzing the attribute value of each interval variable by using the preset evaluation algorithm to determine the prediction value of the interval variable.

Optionally, analyzing the attribute value of each interval variable by using the preset evaluation algorithm, and determining the predicted value of the interval variable includes: analyzing each interval variable by using a preset evidence weight algorithm, and determining an evidence weight of each interval variable, wherein the evidence weight is used for expressing the logarithm of the ratio of the good variable proportion and the bad variable proportion of each interval variable, the good variable proportion is the proportion of the good variable in each interval variable relative to the good variable in all interval variables, and the bad variable proportion is the proportion of the bad variable in each interval variable relative to the bad variable in all interval variables; and analyzing the attribute value of each interval variable by using the preset evaluation algorithm to determine the prediction value of the interval variable.

Optionally, analyzing the attribute value of each interval variable by using the preset evaluation algorithm, and determining the predicted value of the interval variable includes: analyzing the evidence weight of each interval variable by using a preset information value evaluation model, determining the information value of each interval variable, and taking the information value as the predicted value, wherein the information value represents the capability of the independent variable in distinguishing events and non-events in the target variable; or analyzing the evidence weight of each interval variable by using a preset Gini index model, determining the Gini index of each interval variable, and taking the Gini index as the predicted value, wherein the Gini index is used for evaluating the impure degree of the interval variable.

Optionally, analyzing the attribute value of each interval variable by using the preset evaluation algorithm, and determining the predicted value of the interval variable includes: analyzing the evidence weight of each interval variable by using a preset information value evaluation model, and determining the information value of each interval variable, wherein the information value represents the capability of the independent variable in distinguishing events and non-events in the target variable; analyzing the evidence weight of each interval variable by using a preset Gini index model, and determining the Gini index of each interval variable, wherein the Gini index is used for evaluating the purity of the interval variable; determining a product of the information value and a first preset weight, and determining a first price value; determining the product of the information value and a second preset weight, and determining a second price value; determining the predicted value based on a sum of the first value and the second value.

According to another aspect of the embodiments of the present invention, there is also provided a variable screening apparatus, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring independent variables used for evaluating target variables, and the target variables correspond to one or more independent variables; the evaluation module is used for evaluating the linear relation between the independent variable and the target variable by using a preset Pearson correlation coefficient model and determining the linear correlation degree; and the selection module is used for selecting the independent variable with the highest linear correlation degree as a sample variable of the target variable under the condition that the target variable corresponds to a plurality of independent variables with the linear correlation degrees higher than a preset correlation degree threshold value, wherein the sample variable and the target variable are used as training data for training a target prediction model, and the target prediction model is used for analyzing and determining the prediction variable for the independent variable.

According to another aspect of the embodiments of the present invention, a nonvolatile storage medium is further provided, in which a program is stored, and when the program runs, a device where the nonvolatile storage medium is located is controlled to execute the variable screening method.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including: the variable screening method comprises a memory and a processor, wherein the processor is used for running a program stored in the memory, and the program is used for executing the variable screening method during running.

In the embodiment of the invention, the independent variables used for evaluating the target variable are obtained, wherein the target variable corresponds to one or more independent variables; evaluating the linear relation between the independent variable and the target variable by using a preset Pearson correlation coefficient model, and determining the linear correlation degree; under the condition that the target variable corresponds to a plurality of independent variables with linear correlation degrees higher than a preset correlation degree threshold value, the independent variable with the highest linear correlation degree is selected as a sample variable of the target variable, wherein the sample variable and the target variable are used as training data of a training target prediction model, and the target prediction model is used for analyzing the independent variables to determine the prediction variables, so that the purpose of screening the independent variables is achieved, the data volume of the training data required by the training of the training target prediction model is reduced, the technical effect of improving the training efficiency of the target prediction model is achieved, and the technical problem that the efficiency of determining the portrait of the consumer is low due to the fact that the independent variables for determining the portrait of the consumer cannot be screened is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of a variable screening method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a variable screening apparatus according to an embodiment of the present invention;

fig. 3 is a block diagram of a computer terminal according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present invention, there is provided a variable screening method embodiment, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

Fig. 1 is a flowchart of a variable screening method according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:

step S102, obtaining independent variables for evaluating target variables, wherein the target variables correspond to one or more independent variables;

step S104, evaluating the linear relation between the independent variable and the target variable by using a preset Pearson correlation coefficient model, and determining the linear correlation degree;

and S106, under the condition that the target variable corresponds to a plurality of independent variables of which the linear correlation degrees are higher than a preset correlation degree threshold value, selecting the independent variable with the highest linear correlation degree as a sample variable of the target variable, wherein the sample variable and the target variable are used as training data for training a target prediction model, and the target prediction model is used for analyzing the independent variables to determine the prediction variables.

The variable screening method can be used in an electronic commerce environment, and the consumer portrait is used for representing the consumption behavior of the user in the electronic shopping process.

In the above step S102, the target variable may indicate "whether the user purchases a certain product", and the argument may indicate "history browsing record of the user" or "history purchasing record of the user".

For example, if the user browses a certain product for a plurality of times, it indicates that the user has an intention to purchase the product, and therefore, the number of times the user browses the certain product is used as an argument, and whether the user purchases the product is used as a target variable, i.e., whether the user purchases the product can be predicted according to the number of times the user browses the certain product.

For another example, if the user browses a certain product for a long time, it indicates that the user has an intention to purchase the product, so that the browsing time of the user browsing the certain product is used as an argument, and whether the user purchases the product is used as a target variable, i.e., whether the user purchases the product can be predicted according to the browsing time of the user browsing the certain product.

For example, when the user purchases the product a and the product B is a related product of the product a, and the user purchases the product a, the probability that the user purchases the product B is described to be high, and therefore, it is possible to predict whether the user purchases the product B based on the behavior of the user purchasing the product a, which is regarded as an argument, and whether the user purchases the product B, which is regarded as a target variable.

In the above step S102, the target variable may represent "whether the user has credit fraud", and the argument may represent "annual income of the user".

For example, the credit fraud of the user may affect the credit of the user, if the user has a high and stable annual income, the credit fraud of the user will have a large negative effect on the credit of the user, and the fraud profit and the fraud cost are not in direct proportion, so that the probability of the credit fraud existing for the user with the high and stable annual income is low, so that the annual income of the user is used as an independent variable, whether the user has the credit fraud is used as a target variable, and whether the user has the credit fraud can be predicted according to the annual income of the user.

In step S104, the pearson correlation coefficient model is preset mainly for describing the linear relationship between the proportional variable and the proportional variable, the interval variable and the interval variable, and the binary variable and the interval variable.

In step S104, the pearson correlation coefficient model is preset as follows:

wherein, x represents independent variable, y represents target variable, the range of correlation coefficient r of linear correlation is [ -1, +1], in view of experience, r with different sizes represents linear correlation with different degrees, | r | represents linear correlation degree, | r | <0.3, represents low degree linear correlation; 0.3= < | <0.5, representing medium to low degree linear correlation; 0.5= < | <0.8, representing a moderate linear correlation; 0.8= < | <1, representing a highly linear correlation.

Alternatively, if | r | >0.6 and there are multiple arguments for evaluating the target variable, only one of the arguments needs to be retained.

Alternatively, an argument whose linear correlation | r | is greater than a preset correlation threshold may be selected as an argument for evaluating the target variable.

In the above step S106, the target prediction model is used to predict a predicted variable related to a known independent variable from the independent variable.

For example, if the user views a certain product for a plurality of times, the user has an intention to purchase the product, and therefore, if the number of times the user views the certain product is used as an argument, the argument is predicted by the target prediction model, and the predictor is determined, the predictor can be expressed as the product to be purchased by the user.

Optionally, in a training phase of the target prediction model, performing model training by using the independent variables and the target variables as known training data; in the case where the independent variables are analyzed using the target predictive model, the unknown predictive variables may be determined from the known independent variables.

Optionally, both the target variable and the predicted variable may be associated variables of the independent variable, and if the associated variables of the independent variable do exist, the associated variables are the target variables; if the associated variable of the independent variable has not occurred, the associated variable is the predicted variable.

As an alternative embodiment, after selecting the independent variable with the highest linear correlation as the sample variable of the target variable, the method further includes: identifying a variable type of the argument, wherein the variable type comprises at least: proportional, interval, quantity category and binary variables; under the condition that the independent variable belongs to the variable type variable, a preset chi-square test model is used for testing the relevance of the independent variable and the predictive variable of the target predictive model; and in the case that the independent variable does not belong to the category type variable, checking the relevance of the independent variable and the predicted variable of the target prediction model by using a preset regression model.

Optionally, the predetermined chi-square test model is:

wherein the content of the first and second substances,

an observation (e.g. an argument) representing each cross-classification frequency,

an expected value (e.g., a target variable) representing each cross-classification frequency, and a deviation of each cross-classification frequency observed value (e.g., an independent variable) from the expected value (e.g., the target variable) is

When the sample size is large, x ² The statistics approximate x with the degree of freedom (R-1) (C-1) ² (chi fang) distribution; x is the number of ² The larger the correlation between the independent variable and the target variable.

In the above embodiments of the present invention, chi-square test is used to measure the categorical variables, including the correlation between qualitative variables such as order variables and the like, and the comparison of the sample rates of two or more.

Optionally, the preset regression model may be constructed based on an R-square method, and the independent variables having important predictive significance and value for the target variables are judged and selected by using an analysis algorithm of multiple linear regression through the R-square method.

Optionally, the regression model is preset:

wherein, the first and the second end of the pipe are connected with each other,R ² expressing the quality of the regression equation fitting, R ² E (0, 1), the larger R represents the better the fitting degree of the regression equation to the observed values (such as independent variables) of the samples. R is also referred to as the sample complex correlation coefficient of the target variable Y with the independent variables X1, X2, \8230, xp, which represents the linear relationship of X1, X2, \8230, xp and Y as a whole.

Wherein SSR denotes regression square, SSE denotes residual sum of squares, SST denotes total sum of squared deviations.

In the above embodiment of the present invention, in the case where the independent variable belongs to the quantity-class-type variable, the preset chi-square test model may be used to test the correlation between the independent variable and the predicted variable of the target prediction model; under the condition that the independent variable does not belong to the category variable, the preset regression model can be used for checking the relevance between the independent variable and the predictive variable of the target predictive model, so that the relevance between the independent variable and the predictive variable of different types can be determined through the preset chi-square checking model and the preset regression model, the relevance between the independent variable and the predictive variable of the training target predictive model is checked, and the independent variable used for training the target predictive model and the target variable are ensured to have higher relevance.

As an alternative embodiment, obtaining the independent variable for evaluating the target variable includes: acquiring an attribute value of an independent variable; analyzing the attribute values of the independent variables by using a preset evaluation algorithm, and determining the prediction values of the independent variables, wherein the prediction values are used for expressing the conformity degree of the prediction variables determined according to the independent variables and the target variables corresponding to the independent variables; and selecting the independent variable with the prediction value higher than the preset value threshold value as the independent variable for evaluating the target variable.

According to the embodiment of the invention, the attribute values of the independent variables are analyzed by using the preset evaluation algorithm, the prediction values of the independent variables can be determined, the independent variables with higher prediction values can be screened out from a plurality of independent variables used for evaluating the target variables according to the prediction values, the independent variables with higher prediction values can be used for training the target prediction model, and the accuracy of the trained target prediction model is ensured.

Optionally, the independent variables needing to train the target prediction model are screened out through IV, WOE and Gini and have high prediction value and are put into the target prediction model to be trained, so that more accurate analysis and prediction are provided for potential value in a commercial scene of data mining.

As an alternative embodiment, the analyzing the attribute values of the independent variables by using a preset evaluation algorithm, and the determining the predicted values of the independent variables comprises: identifying a variable type of the argument, wherein the variable type comprises at least: proportional type variables, interval type variables, quantity category type variables and binary variables; under the condition that the independent variable belongs to the interval type variable, the independent variable is classified into a plurality of interval variables; and analyzing the attribute value of each interval variable by using a preset evaluation algorithm to determine the prediction value of the interval variable.

In the above embodiment of the present invention, when the independent variable is the interval type variable, the interval type variable may be binned into a plurality of interval variables in a binning manner, and the predictive value of the interval variable may be determined by using a preset evaluation algorithm, so that the predictive value of the interval type variable may be determined.

Optionally, in a "predict whether the user is suspected of credit fraud on credit card usage", the objective variable is "whether there is credit fraud", which is a binary variable (0, 1), 0 represents no fraud and 1 represents fraud; meanwhile, there is a field "the annual income of the user" in the argument, in the original record of the data warehouse, the field "the annual income of the user" is a regional variable (Interval), if the evidence weight WOE and the index method of the information value IV are adopted to judge whether the measured value is available, i.e. whether the measured value is suitable to be put into the model as the argument for prediction, the regional variable "the annual income of the user" needs to be converted first, so that the regional variable becomes a category type variable (order type variable), for example, "binning" becomes a type variable with 4 regions, and the variables are respectively less than 20000 yuan, [20000, 60000 ], [60000, 100000), and more than 100000 yuan, and totally 4 types. The 4 types of intervals are also called 4 attribute values of an argument "annual income of a user", and for each attribute value, an evidence weight WOE in sample data can be calculated.

As an alternative embodiment, analyzing the attribute value of each interval variable by using a preset evaluation algorithm, and determining the predicted value of the interval variable includes: analyzing each interval variable by using a preset evidence weight algorithm, and determining an evidence weight of each interval variable, wherein the evidence weight is used for expressing the logarithm of the ratio of the good variable proportion and the bad variable proportion of the interval variable, the good variable proportion is the proportion of the good variable in each interval variable relative to the good variable in all the interval variables, and the bad variable proportion is the proportion of the bad variable in each interval variable relative to the bad variable in all the interval variables; and analyzing the attribute value of each interval variable by using a preset evaluation algorithm to determine the prediction value of the interval variable.

Optionally, the preset evidence weighting algorithm is:

wherein the content of the first and second substances,

number of good variables, N, in interval variables ^event The number of good variables in the variables of all intervals,

number of bad variables in interval variables, N ^nonevent The number of bad variables in all interval variables,

in order to have a good variable ratio,

ratio of bad variablesFor example.

Optionally, the good variable is a predicted event and the bad variable is a non-predicted event.

As an alternative embodiment, analyzing the attribute value of each interval variable by using a preset evaluation algorithm, and determining the predicted value of the interval variable includes: analyzing the evidence weight of each interval variable by using a preset information value evaluation model, determining the information value of each interval variable, and taking the information value as a prediction value, wherein the information value represents the capability of an independent variable in distinguishing events and non-events in a target variable; or analyzing the evidence weight of each interval variable by using a preset Gini index model, determining the Gini index of each interval variable, and taking the Gini index as a prediction value, wherein the Gini index is used for evaluating the impurity degree of the interval variable.

According to the embodiment of the invention, the information value of each interval variable can be determined by using the preset information value evaluation model, the Gini index of each interval variable can be determined by using the preset Gini index model, and one of the preset information value evaluation model and the preset Gini index model is selected for determining the prediction value of the interval variable, so that the value evaluation of the interval variable is realized.

Optionally, a preset information value evaluation model is used for measuring the prediction capability of each variable on the target variable y, and is used for screening the independent variables.

Optionally, the preset information value evaluation model is as follows:

optionally, the preset kini index model is:

optionally, the kini index represents the probability that a randomly selected sample in the sample set is misclassified. The smaller the kini index is, the smaller the probability that the selected samples in the set are mistakenly sorted is, that is, the higher the purity of the set is, otherwise, the less pure the set is, and when all the samples in the set are in one class, the kini index is 0.

In the above embodiment of the present invention, when the information value IV, the evidence weight WOE, and the kini index Gini are used as the indexes, the following objectives can be achieved in the data mining practice:

1. the optimal binning threshold is adjusted by the change in evidence weight WOE. The general method is that firstly, an interval type variable is divided into 10-20 temporary intervals, respective WOE values are calculated respectively, then corresponding combination is carried out according to the change trend of the WOE in each interval, and finally, reasonable interval division is realized.

2. And screening independent variables with higher prediction value through the information value IV value or the Gini index score, and putting the independent variables into model training.

As an alternative embodiment, analyzing the attribute value of each interval variable by using a preset evaluation algorithm, and determining the predicted value of the interval variable includes: analyzing the evidence weight of each interval variable by using a preset information value evaluation model, and determining the information value of each interval variable, wherein the information value represents the capability of an independent variable for distinguishing events and non-events in a target variable; analyzing the evidence weight of each interval variable by using a preset Gini index model, and determining the Gini index of each interval variable, wherein the Gini index is used for evaluating the impurity degree of the interval variable; determining the product of the information value and a first preset weight, and determining a first price value; determining the product of the information value and a second preset weight, and determining a second price value; a predicted value is determined based on a sum of the first value and the second value.

According to the embodiment of the invention, the information value of each interval variable can be determined by using the preset information value evaluation model, the Gini index of each interval variable can be determined by using the preset Gini index model, and the prediction value is determined according to the preset information value evaluation model and the preset Gini index model together, so that the value evaluation of the interval variables is realized.

The method of the invention combines AI with linear correlation index as a data screening scheme. Pearson Correlation coefficient (Pearson Correlation) is mainly used for describing linear relations between proportional variables and proportional variables, between interval variables and interval variables, and between binary variables and interval variables. Independent variables with important predictive significance and value to the target variables are judged and selected by an R square method by using an analysis algorithm of multiple linear regression for reference. The chi-squared test is then used to measure categorical variables, including correlations between qualitative variables such as order variables and to compare sample rates on two or more. And (4) screening independent variables with higher prediction value from variables to be put into the model through IV, WOE and Gini, and putting the independent variables into the model for training. Therefore, more accurate analysis and prediction are provided for potential values in a commercial scene of data mining.

According to the embodiment of the present invention, an embodiment of a variable screening apparatus is further provided, and it should be noted that the variable screening apparatus may be configured to execute the variable screening method in the embodiment of the present invention, and the variable screening method in the embodiment of the present invention may be executed in the variable screening apparatus.

Fig. 2 is a schematic diagram of a variable screening apparatus according to an embodiment of the present invention, which may include, as shown in fig. 2: an obtaining module 22, configured to obtain an argument used for evaluating a target variable, where the target variable corresponds to one or more arguments; the evaluation module 24 is configured to evaluate a linear relationship between the independent variable and the target variable by using a preset pearson correlation coefficient model, and determine a linear correlation degree; the selecting module 26 is configured to select, when the target variable corresponds to multiple independent variables whose linear correlation degrees are higher than a preset correlation degree threshold, the independent variable with the highest linear correlation degree as a sample variable of the target variable, where the sample variable and the target variable are used as training data for training a target prediction model, and the target prediction model is used for analyzing and determining the prediction variable for the independent variables.

It should be noted that the obtaining module 22 in this embodiment may be configured to execute step S102 in this embodiment, the evaluating module 24 in this embodiment may be configured to execute step S104 in this embodiment, and the selecting module 26 in this embodiment may be configured to execute step S106 in this embodiment. The modules are the same as the corresponding steps in the implementation examples and application scenarios, but are not limited to the disclosure of the above embodiments.

In the embodiment of the invention, the independent variables used for evaluating the target variable are obtained, wherein the target variable corresponds to one or more independent variables; evaluating the linear relation between the independent variable and the target variable by using a preset Pearson correlation coefficient model, and determining the linear correlation degree; under the condition that the target variable corresponds to a plurality of independent variables with linear correlation degrees higher than a preset correlation degree threshold value, the independent variable with the highest linear correlation degree is selected as a sample variable of the target variable, wherein the sample variable and the target variable are used as training data of a training target prediction model, and the target prediction model is used for analyzing the independent variables to determine the prediction variables, so that the purpose of screening the independent variables is achieved, the data quantity of the training data required by the training target prediction model is reduced, the technical effect of improving the training efficiency of the target prediction model is achieved, and the technical problem that the efficiency of determining the portrait of a consumer is low due to the fact that the independent variables determining the portrait of the consumer cannot be screened is solved.

As an alternative embodiment, the apparatus further comprises: the identification module is used for identifying the variable type of the independent variable after selecting the independent variable with the highest linear correlation as the sample variable of the target variable, wherein the variable type at least comprises the following steps: proportional, interval, quantity category and binary variables; the first checking module is used for checking the relevance of the independent variable and the predictive variable of the target predictive model by using a preset chi-square checking model under the condition that the independent variable belongs to the variable of the quantity category; and the second checking module is used for checking the relevance of the independent variable and the predictive variable of the target predictive model by using a preset regression model under the condition that the independent variable does not belong to the category type variable.

As an alternative embodiment, the obtaining module includes: an acquisition unit configured to acquire an attribute value of an argument; the determining unit is used for analyzing the attribute values of the independent variables by using a preset evaluation algorithm and determining the prediction values of the independent variables, wherein the prediction values are used for expressing the conformity degree of the prediction variables determined according to the independent variables and the target variables corresponding to the independent variables; and the selecting unit is used for selecting the independent variable with the prediction value higher than the preset value threshold value as the independent variable for evaluating the target variable.

As an alternative embodiment, the determining unit includes: the identification unit is used for identifying the variable type of the independent variable, wherein the variable type at least comprises the following components: proportional type variables, interval type variables, quantity category type variables and binary variables; a binning unit for binning the independent variable into a plurality of interval variables when the independent variable belongs to the interval type variable; and the analysis unit is used for analyzing the attribute value of each interval variable by using a preset evaluation algorithm and determining the prediction value of the interval variable.

As an alternative embodiment, the analysis unit comprises: the first analysis subunit is used for analyzing each interval variable by using a preset evidence weight algorithm and determining the evidence weight of each interval variable, wherein the evidence weight is used for expressing the logarithm of the ratio of the good variable proportion and the bad variable proportion of the interval variable, the good variable proportion is the proportion of the good variable in each interval variable relative to the good variable in all the interval variables, and the bad variable proportion is the proportion of the bad variable in each interval variable relative to the bad variable in all the interval variables; and the second analysis subunit is used for analyzing the attribute value of each interval variable by using a preset evaluation algorithm and determining the prediction value of the interval variable.

As an alternative embodiment, the second analysis subunit comprises: the third analysis subunit is used for analyzing the evidence weight of each interval variable by using a preset information value evaluation model, determining the information value of each interval variable, and taking the information value as the prediction value, wherein the information value represents the capability of the independent variable in distinguishing events and non-events in the target variable; or the fourth analysis subunit is used for analyzing the evidence weight of each interval variable by using a preset Gini index model, determining the Gini index of each interval variable and taking the Gini index as the prediction value, wherein the Gini index is used for evaluating the impurity degree of the interval variable.

As an alternative embodiment, the second analysis subunit comprises: the fifth analysis subunit is used for analyzing the evidence weight of each interval variable by using a preset information value evaluation model and determining the information value of each interval variable, wherein the information value represents the capability of the independent variable in distinguishing events and non-events in the target variable; a sixth analysis subunit, configured to analyze the evidence weight of each interval variable using a preset kini index model, and determine a kini index of each interval variable, where the kini index is used to evaluate an impurity degree of the interval variable; the first determining subunit is used for determining the product of the information value and a first preset weight and determining a first price value; the second determining subunit is used for determining the product of the information value and a second preset weight and determining a second price value; a third determining subunit for determining the predicted value based on a sum of the first value and the second value.

The embodiment of the invention can provide a computer terminal which can be any computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may execute the program code of the following steps in the variable filtering method: obtaining independent variables for evaluating target variables, wherein the target variables correspond to one or more independent variables;

evaluating the linear relation between the independent variable and the target variable by using a preset Pearson correlation coefficient model, and determining the linear correlation degree;

and under the condition that the target variable corresponds to a plurality of independent variables with linear correlation degrees higher than a preset correlation degree threshold value, selecting the independent variable with the highest linear correlation degree as a sample variable of the target variable, wherein the sample variable and the target variable are used as training data for training a target prediction model, and the target prediction model is used for analyzing and determining the prediction variable for the independent variables.

Alternatively, fig. 3 is a block diagram of a computer terminal according to an embodiment of the present invention. As shown in fig. 3, the computer terminal 30 may include: one or more (only one shown) processors 32, and a memory 34.

The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the variable screening method and apparatus in the embodiments of the present invention, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, so as to implement the variable screening method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory located remotely from the processor, which may be connected to the terminal 30 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: obtaining independent variables for evaluating target variables, wherein the target variables correspond to one or more independent variables; evaluating the linear relation between the independent variable and the target variable by using a preset Pearson correlation coefficient model, and determining the linear correlation degree; and under the condition that the target variable corresponds to a plurality of independent variables with linear correlation degrees higher than a preset correlation degree threshold value, selecting the independent variable with the highest linear correlation degree as a sample variable of the target variable, wherein the sample variable and the target variable are used as training data for training a target prediction model, and the target prediction model is used for analyzing and determining the prediction variable for the independent variables.

Optionally, the processor may further execute the program code of the following steps: after selecting the independent variable with the highest linear correlation as a sample variable of a target variable, identifying the variable type of the independent variable, wherein the variable type at least comprises the following steps: proportional, interval, quantity category and binary variables; in the case that the independent variable belongs to a quantity category type variable, a preset chi-square test model is used for testing the relevance of the independent variable and the predictive variable of the target predictive model; and in the case that the independent variable does not belong to the category type variable, checking the relevance of the independent variable and the predicted variable of the target prediction model by using a preset regression model.

Optionally, the processor may further execute the program code of the following steps: acquiring an attribute value of an independent variable; analyzing the attribute values of the independent variables by using a preset evaluation algorithm, and determining the prediction values of the independent variables, wherein the prediction values are used for expressing the conformity degree of the prediction variables determined according to the independent variables and the target variables corresponding to the independent variables; and selecting the independent variable with the prediction value higher than the preset value threshold value as the independent variable for evaluating the target variable.

Optionally, the processor may further execute the program code of the following steps: identifying a variable type of the argument, wherein the variable type comprises at least: proportional, interval, quantity category and binary variables; under the condition that the independent variable belongs to the interval type variable, the independent variable is classified into a plurality of interval variables; and analyzing the attribute value of each interval variable by using a preset evaluation algorithm to determine the prediction value of the interval variable.

Optionally, the processor may further execute the program code of the following steps: analyzing each interval variable by using a preset evidence weight algorithm, and determining an evidence weight of each interval variable, wherein the evidence weight is used for expressing the logarithm of the ratio of the good variable ratio and the bad variable ratio of the interval variable, the good variable ratio is the ratio of the good variable in each interval variable relative to the good variable in all interval variables, and the bad variable ratio is the ratio of the bad variable in each interval variable relative to the bad variable in all interval variables; and analyzing the attribute value of each interval variable by using a preset evaluation algorithm to determine the prediction value of the interval variable.

Optionally, the processor may further execute the program code of the following steps: analyzing the evidence weight of each interval variable by using a preset information value evaluation model, determining the information value of each interval variable, and taking the information value as the prediction value, wherein the information value represents the capability of an independent variable in distinguishing events and non-events in a target variable; or analyzing the evidence weight of each interval variable by using a preset Gini index model, determining the Gini index of each interval variable, and taking the Gini index as the prediction value, wherein the Gini index is used for evaluating the impurity degree of the interval variable.

Optionally, the processor may further execute the program code of the following steps: analyzing the evidence weight of each interval variable by using a preset information value evaluation model, and determining the information value of each interval variable, wherein the information value represents the capability of an independent variable for distinguishing events and non-events in a target variable; analyzing the evidence weight of each interval variable by using a preset Gini index model, and determining the Gini index of each interval variable, wherein the Gini index is used for evaluating the impurity degree of the interval variable; determining the product of the information value and a first preset weight, and determining a first price value; determining the product of the information value and a second preset weight, and determining a second price value; a predicted value is determined based on a sum of the first value and the second value.

The embodiment of the invention provides a variable screening scheme. Obtaining independent variables for evaluating target variables, wherein the target variables correspond to one or more independent variables; evaluating the linear relation between the independent variable and the target variable by using a preset Pearson correlation coefficient model, and determining the linear correlation degree; under the condition that the target variable corresponds to a plurality of independent variables with linear correlation degrees higher than a preset correlation degree threshold value, the independent variable with the highest linear correlation degree is selected as a sample variable of the target variable, wherein the sample variable and the target variable are used as training data of a training target prediction model, and the target prediction model is used for analyzing the independent variables to determine the prediction variables, so that the purpose of screening the independent variables is achieved, the data volume of the training data required by the training of the training target prediction model is reduced, the technical effect of improving the training efficiency of the target prediction model is achieved, and the technical problem that the efficiency of determining the portrait of the consumer is low due to the fact that the independent variables for determining the portrait of the consumer cannot be screened is solved.

It can be understood by those skilled in the art that the structure shown in fig. 3 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 3 is a diagram illustrating the structure of the electronic device. For example, the computer terminal 30 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 3, or have a different configuration than shown in FIG. 3.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.

The embodiment of the invention also provides a storage medium. Optionally, in this embodiment, the storage medium may be configured to store program codes executed by the variable filtering method provided in the foregoing embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: obtaining independent variables for evaluating target variables, wherein the target variables correspond to one or more independent variables; evaluating the linear relation between the independent variable and the target variable by using a preset Pearson correlation coefficient model, and determining the linear correlation degree; and under the condition that the target variable corresponds to a plurality of independent variables with linear correlation degrees higher than a preset correlation degree threshold value, selecting the independent variable with the highest linear correlation degree as a sample variable of the target variable, wherein the sample variable and the target variable are used as training data for training a target prediction model, and the target prediction model is used for analyzing and determining the prediction variable for the independent variables.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: after selecting the independent variable with the highest linear correlation as a sample variable of a target variable, identifying the variable type of the independent variable, wherein the variable type at least comprises the following steps: proportional type variables, interval type variables, quantity category type variables and binary variables; in the case that the independent variable belongs to a quantity category type variable, a preset chi-square test model is used for testing the relevance of the independent variable and the predictive variable of the target predictive model; and in the case that the independent variable does not belong to the category type variable, checking the relevance of the independent variable and the predicted variable of the target prediction model by using a preset regression model.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring an attribute value of an independent variable; analyzing the attribute values of the independent variables by using a preset evaluation algorithm, and determining the prediction values of the independent variables, wherein the prediction values are used for expressing the conformity degree of the prediction variables determined according to the independent variables and the target variables corresponding to the independent variables; and selecting the independent variable with the prediction value higher than the preset value threshold value as the independent variable for evaluating the target variable.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: identifying a variable type of the argument, wherein the variable type comprises at least: proportional, interval, quantity category and binary variables; under the condition that the independent variable belongs to the interval type variable, the independent variable is classified into a plurality of interval variables; and analyzing the attribute value of each interval variable by using a preset evaluation algorithm to determine the prediction value of the interval variable.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: analyzing each interval variable by using a preset evidence weight algorithm, and determining an evidence weight of each interval variable, wherein the evidence weight is used for expressing the logarithm of the ratio of the good variable ratio and the bad variable ratio of the interval variable, the good variable ratio is the ratio of the good variable in each interval variable relative to the good variable in all interval variables, and the bad variable ratio is the ratio of the bad variable in each interval variable relative to the bad variable in all interval variables; and analyzing the attribute value of each interval variable by using a preset evaluation algorithm to determine the prediction value of the interval variable.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: analyzing the evidence weight of each interval variable by using a preset information value evaluation model, determining the information value of each interval variable, and taking the information value as the prediction value, wherein the information value represents the capability of an independent variable in distinguishing events and non-events in a target variable; or analyzing the evidence weight of each interval variable by using a preset Gini index model, determining the Gini index of each interval variable, and taking the Gini index as the prediction value, wherein the Gini index is used for evaluating the impurity degree of the interval variable.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: analyzing the evidence weight of each interval variable by using a preset information value evaluation model, and determining the information value of each interval variable, wherein the information value represents the capability of an independent variable for distinguishing events and non-events in a target variable; analyzing the evidence weight of each interval variable by using a preset Gini index model, and determining the Gini index of each interval variable, wherein the Gini index is used for evaluating the impurity degree of the interval variable; determining the product of the information value and a first preset weight, and determining a first price value; determining the product of the information value and a second preset weight, and determining a second price value; a predicted value is determined based on a sum of the first value and the second value.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A variable screening method, comprising:

obtaining independent variables for evaluating target variables, wherein the target variables correspond to one or more independent variables;

and under the condition that the target variable corresponds to a plurality of independent variables of which the linear correlation degrees are higher than a preset correlation degree threshold value, selecting the independent variable with the highest linear correlation degree as a sample variable of the target variable, wherein the sample variable and the target variable are used as training data for training a target prediction model, and the target prediction model is used for analyzing and determining a prediction variable for the independent variable.

2. The method of claim 1, wherein after selecting the independent variable with the highest linear correlation as the sample variable of the target variable, the method further comprises:

identifying a variable type of the argument, wherein the variable type comprises at least: proportional type variables, interval type variables, quantity category type variables and binary variables;

in the case where the independent variable belongs to the quantity-class-type variable, checking the association of the independent variable and the predicted variable of the target prediction model using a preset chi-square test model;

and in the case that the independent variable does not belong to the type variable, checking the relevance of the independent variable and the predicted variable of the target prediction model by using a preset regression model.

3. The method of claim 1, wherein obtaining arguments for evaluating a target variable comprises:

acquiring an attribute value of the independent variable;

analyzing the attribute values of the independent variables by using a preset evaluation algorithm, and determining the prediction values of the independent variables, wherein the prediction values are used for representing the conformity degree of the prediction variables determined according to the independent variables and the target variables corresponding to the independent variables;

and selecting the independent variable with the predicted value higher than a preset value threshold value as the independent variable for evaluating the target variable.

4. The method of claim 3, wherein the analyzing the attribute values of the independent variables using a predetermined evaluation algorithm, and wherein determining the predicted values of the independent variables comprises:

under the condition that the independent variable belongs to the interval type variable, the independent variable is classified into a plurality of interval variables;

and analyzing the attribute value of each interval variable by using the preset evaluation algorithm to determine the prediction value of the interval variable.

5. The method of claim 4, wherein the analyzing the property value of each of the interval variables using the pre-determined evaluation algorithm, and determining the predicted value of the interval variable comprises:

analyzing each interval variable by using a preset evidence weight algorithm, and determining an evidence weight of each interval variable, wherein the evidence weight is used for expressing the logarithm of the ratio of the good variable proportion and the bad variable proportion of each interval variable, the good variable proportion is the proportion of the good variable in each interval variable relative to the good variable in all interval variables, and the bad variable proportion is the proportion of the bad variable in each interval variable relative to the bad variable in all interval variables;

6. The method of claim 5, wherein the analyzing the property value of each of the interval variables using the pre-determined evaluation algorithm, and determining the predicted value of the interval variable comprises:

analyzing the evidence weight of each interval variable by using a preset information value evaluation model, determining the information value of each interval variable, and taking the information value as the prediction value, wherein the information value represents the capability of the independent variable in distinguishing events and non-events in the target variable; or

Analyzing the evidence weight of each interval variable by using a preset Gini index model, determining the Gini index of each interval variable, and taking the Gini index as the predicted value, wherein the Gini index is used for evaluating the purity of the interval variable.

7. The method of claim 5, wherein the analyzing the property value of each of the interval variables using the pre-determined evaluation algorithm, and determining the predicted value of the interval variable comprises:

analyzing the evidence weight of each interval variable by using a preset information value evaluation model, and determining the information value of each interval variable, wherein the information value represents the capability of the independent variable in distinguishing events and non-events in the target variable;

analyzing the evidence weight of each interval variable by using a preset Gini index model, and determining the Gini index of each interval variable, wherein the Gini index is used for evaluating the purity of the interval variable;

determining a product of the information value and a first preset weight, and determining a first price value;

determining a product of the information value and a second preset weight, and determining a second price value;

determining the predicted value based on a sum of the first value and the second value.

8. A variable screening apparatus, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring independent variables used for evaluating target variables, and the target variables correspond to one or more independent variables;

the evaluation module is used for evaluating the linear relation between the independent variable and the target variable by using a preset Pearson correlation coefficient model and determining the linear correlation degree;

and the selection module is used for selecting the independent variable with the highest linear correlation degree as a sample variable of the target variable under the condition that the target variable corresponds to a plurality of independent variables with the linear correlation degrees higher than a preset correlation degree threshold value, wherein the sample variable and the target variable are used as training data for training a target prediction model, and the target prediction model is used for analyzing and determining the prediction variable for the independent variable.

9. A non-volatile storage medium, wherein a program is stored in the non-volatile storage medium, and wherein when the program runs, a device in which the non-volatile storage medium is located is controlled to execute the variable filtering method according to any one of claims 1 to 7.

10. An electronic device, comprising: a memory and a processor for executing a program stored in the memory, wherein the program when executed performs the variable screening method of any one of claims 1 to 7.