CN111654853A - Data analysis method based on user information - Google Patents

Data analysis method based on user information Download PDF

Info

Publication number
CN111654853A
CN111654853A CN202010769479.0A CN202010769479A CN111654853A CN 111654853 A CN111654853 A CN 111654853A CN 202010769479 A CN202010769479 A CN 202010769479A CN 111654853 A CN111654853 A CN 111654853A
Authority
CN
China
Prior art keywords
variables
variable
value
model
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010769479.0A
Other languages
Chinese (zh)
Other versions
CN111654853B (en
Inventor
邵俊
蔺静茹
张磊
曹新建
支磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Suoxinda Data Technology Co ltd
Soxinda Beijing Data Technology Co ltd
Original Assignee
Shenzhen Suoxinda Data Technology Co ltd
Soxinda Beijing Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Suoxinda Data Technology Co ltd, Soxinda Beijing Data Technology Co ltd filed Critical Shenzhen Suoxinda Data Technology Co ltd
Priority to CN202010769479.0A priority Critical patent/CN111654853B/en
Publication of CN111654853A publication Critical patent/CN111654853A/en
Application granted granted Critical
Publication of CN111654853B publication Critical patent/CN111654853B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W8/00Network data management
    • H04W8/18Processing of user or subscriber data, e.g. subscribed services, user preferences or user profiles; Transfer of user or subscriber data
    • H04W8/183Processing at user equipment or user record carrier
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2115Selection of the most significant subset of features by evaluating different subsets according to an optimisation criterion, e.g. class separability, forward selection or backward elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/56Allocation or scheduling criteria for wireless resources based on priority criteria
    • H04W72/566Allocation or scheduling criteria for wireless resources based on priority criteria of the information or information source or recipient

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data analysis method and system based on user information. The method comprises the following steps: receiving user information; converting and summarizing the user information into a user big data set; randomly dividing the user big data set into two sets, wherein the two sets comprise a first set and a second set; performing box separation correlation processing on the first set to obtain a third set; constructing a first model by adopting a factor analysis method based on the third set; validating the first model based on the second set. Compared with the prior art, the method eliminates the collinearity by using the factor analysis method under the condition of keeping the precision as much as possible, avoids losing important variables and precision by simply keeping a certain variable which is most representative in a cluster (for example, the variable has the largest correlation with the principal component) in order to eliminate the collinearity, and improves the accuracy of data analysis.

Description

Data analysis method based on user information
Technical Field
The invention belongs to the field of big data analysis and data mining, and particularly relates to a data analysis method and system based on user information.
Background
With the development of mobile communication technology, the variety of mobile communication services is increasing, and the demand for communication resources is also rapidly increasing, however, currently available wireless communication resources are limited, how to reasonably allocate resources under the condition of multiple users and multiple services, and improving the utilization efficiency of wireless resources are hot and difficult points of research in the field of mobile communication at present, and a key problem in the process of wireless resource scheduling is to determine the priority of users.
The determination of the user priority level is a multi-objective solving problem, and the constraints of multiple objectives, such as fairness of resources used by users, wireless resource use efficiency, system throughput, service quality and the like, need to be comprehensively considered at the same time. At present, the method for judging the priority of the user only considers the technical requirements or only considers the service requirements, and does not fully consider the influence factors of the user, so that the user has one-sidedness in the determination of the use and allocation of resources. Regression analysis is a statistical analysis method for determining the quantitative relationship of interdependence between two or more variables. The application is very wide, and regression analysis is divided into unitary regression analysis and multiple regression analysis according to the number of related variables; according to the number of independent variables, simple regression analysis and multiple regression analysis can be divided; according to the type of relationship between independent variables and dependent variables, linear regression analysis and nonlinear regression analysis can be classified. If a regression analysis includes only one independent variable and one dependent variable and the relationship between the independent variable and the dependent variable can be approximated by a straight line, the regression analysis is called a univariate linear regression analysis. If two or more independent variables are included in the regression analysis and there is a linear correlation between the independent variables, it is referred to as a multiple linear regression analysis.
An optimization analysis method for eliminating the problem of collinearity of regression data in a complex system is provided in Chinese patent ZL201510881058.6, and the essence of the optimization analysis method is a method for continuously screening variables based on principal component analysis. The method mainly comprises the steps of selecting the variable with the maximum correlation after calculating the principal component each time, simultaneously removing other variables highly correlated with the principal component, and calculating the next principal component. Although it selects variables, the above method may have two drawbacks: the contribution degree of the selected variables to the model may not be high; in the process of eliminating the variables, the highly relevant judgment has strong subjectivity, and the important variables are easy to lose. Due to the fact that the selected variables are not typical and the important variables are lost, data analysis of the system is inaccurate, and the credibility of the system is low. Therefore, how to rapidly and efficiently classify, sort and model the obtained massive data information and extract valuable or concerned data information meeting preset conditions is a technical problem in the field of big data analysis and data mining.
Disclosure of Invention
In view of the above-mentioned drawbacks in the prior art, an object of the present invention is to provide a method and a system for effectively improving mining accuracy based on user information.
In order to achieve the above object, the present invention provides a data analysis method based on user information, comprising the steps of:
receiving user information;
converting and summarizing the user information into a user big data set;
randomly dividing the user big data set into two sets, wherein the two sets comprise a first set and a second set, the first set is stored in a first database, and the second set is stored in a second database;
performing box-dividing correlation processing on the first set in the first database to obtain a third set, and storing the third set in a third database;
extracting a third set in the third database, and constructing a first model by adopting a factor analysis method based on the third set;
extracting a second set in the second database, and verifying the first model based on the second set;
wherein the constructing the first model based on the third set by using a factor analysis method specifically comprises:
carrying out variable clustering by using a factor analysis method;
performing a first variable screening on the variables in each class so that the number of remaining variables is not greater than a first threshold;
and iterating the variables left after the first screening by adopting a plurality of backward elimination methods to carry out second variable screening until a preset condition is met.
The method includes the following steps that the user big data set is randomly divided into two sets, and the method specifically includes the following steps:
combining all the information of the users into a wide list;
and randomly dividing the wide table into two sets according to a certain proportion.
Wherein the first set is a training set used for modeling and model parameter estimation, and the second set is a test set used for model evaluation.
Wherein the first model is a logistic regression model.
Wherein the performing of binning correlation processing on the first set in the first database to obtain a third set specifically includes:
binning the first set of data;
and performing WOE transformation on each box to obtain a WOE value to obtain a third set.
Wherein the factor analysis method specifically comprises:
assuming feature vectors of N candidate variables, calculating a covariance matrix of the feature vectors, wherein the covariance matrix is an N-N matrix M, and the value of M _ ij is the covariance of X _ i of the ith row and X _ j of the jth column of the matrix M;
calculating N characteristic roots and characteristic vectors of the covariance matrix M;
the N feature roots are respectively denoted as λ _1, λ _2, …, λ _ N in descending order, and the N normalized feature vectors corresponding to the feature roots sorted in the above manner are sequentially denoted as v _1, v _2, …, v _ N.
And the user big data set is obtained by a service party after the user authorization in response to the user request.
Acquiring a first threshold value based on the factor analysis method, wherein the first threshold value is
Figure 167457DEST_PATH_IMAGE001
It means that the sum of the first k largest eigenvalues is greater than 0.75.
Wherein, the performing a first variable screening on the variables in each class to make the remaining variable number not greater than a first threshold specifically includes:
the number of the variable clustered classes is k, and the first threshold value is 2 k;
selecting two variables for each of the k classes, one with the highest value of variable IV and the other with the highest value of variable R2; a high value for IV means that the variable contributes more to the model result, and a high value for R2 means that the variable is most representative within the cluster.
Performing a second variable screening on the remaining variables after the first screening by adopting multiple backward elimination method iterations until a preset condition is met, wherein the method specifically comprises the following steps:
if the VIF value of the candidate variable is larger than 4, eliminating the variable with the highest p value;
rejecting variables with p values greater than a specified value;
and repeating the steps until the p values of all the variables are less than the specified value and the VIFs of all the variables are less than 4.
Compared with the prior art, the data analysis system provided by the invention has the advantages that the user information is digitally processed and converted into the data information in the specific format of the system, the co-linearity is eliminated by using the modeling module through the factor analysis method under the condition of keeping the precision as much as possible, and the loss of important variables and precision caused by simply keeping the most representative variable (for example, the maximum correlation with the principal component) in a cluster in order to eliminate the co-linearity is avoided, so that the accuracy of data analysis is improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
FIG. 1 is a flow chart illustrating a method for data analysis based on user information in accordance with an embodiment of the present invention;
FIG. 2 is a flow diagram illustrating building a first model according to an embodiment of the invention;
FIG. 3 is a flow diagram illustrating a method of improving the accuracy of data analysis according to an embodiment of the invention;
FIG. 4 is a flow diagram illustrating a method of estimating a load matrix according to an embodiment of the invention;
FIG. 5 is a flow diagram illustrating logistic regression modeling according to an embodiment of the invention;
FIG. 6 is a schematic diagram illustrating a load matrix according to an embodiment of the invention;
FIG. 7 is a block diagram illustrating a system architecture for improving data analysis accuracy based on user information, according to an embodiment of the present invention; and
fig. 8 is a schematic diagram showing an electronic apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.
It should be understood that although the terms first, second, third, etc. may be used to describe … … in embodiments of the present invention, these … … should not be limited to these terms. These terms are used only to distinguish … …. For example, the first … … can also be referred to as the second … … and similarly the second … … can also be referred to as the first … … without departing from the scope of embodiments of the present invention.
It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in the article or device in which the element is included.
Alternative embodiments of the present invention are described in detail below with reference to the accompanying drawings.
Take the communications industry as an example. The method of the embodiment of the invention firstly selects user information which may influence the authorization priority of the user service from two aspects of technology and service, and determines the object of the subsequent work; then, carrying out data preprocessing on the index data to provide complete and reliable data resources for subsequent work; and finally, screening the users by using a factor analysis method, and aiming at taking mass mobile communication field data as a basis and taking a data mining technology as a means, fully playing the advantages of the mass data, comprehensively considering various influencing factors and reasonably determining the priority of the users for enjoying the service so as to achieve the purpose of improving the use efficiency of wireless resources and the system throughput. Wherein, the user information influencing the user service authorization priority includes: the system comprises a channel quality indication, a user maximum transmission rate, a user historical average transmission rate, a user packet loss rate, a user time delay, a transmission rate required by a user, a service quality parameter identification, a total amount of system allocable resources, a system historical average throughput, a user charging total amount, a user current service type, a resource amount required by the user at the current moment and/or a current service completion progress, and user basic information (such as age, occupation, income and the like). The present invention is not limited to the field of communication, and can be applied to various industries such as medical treatment, health, finance, etc.
Example one
As shown in fig. 1, the present invention discloses a data analysis method based on user information, which comprises the following steps:
receiving user information;
converting and summarizing the user information into a user big data set;
randomly dividing the user big data set into two sets, wherein the two sets comprise a first set and a second set, the first set is stored in a first database, and the second set is stored in a second database;
performing box-dividing correlation processing on the first set in the first database to obtain a third set, and storing the third set in a third database;
extracting a third set in the third database, and constructing a first model by adopting a factor analysis method based on the third set;
extracting a second set in the second database, and verifying the first model based on the second set;
referring to fig. 2, the constructing the first model by using a factor analysis method based on the third set specifically includes:
carrying out variable clustering by using a factor analysis method;
performing a first variable screening on the variables in each class so that the number of remaining variables is not greater than a first threshold;
and iterating the variables left after the first screening by adopting a plurality of backward elimination methods to carry out second variable screening until a preset condition is met.
Example two
On the basis of the first embodiment, the present embodiment further includes the following contents:
personal information of the user is collected through a computer or a network, and an evaluation model is established according to the collected personal information so as to quantify whether the user is a potential user of the value added service and whether risks exist.
The metric is typically measured using a logistic regression model. Logistic regression is a supervised binary model, which is obtained by linearly summing values of a series of collected characteristic information (such as academic calendar level) about users after being subjected to WOE (Evidence weight) transformation (transformation formula is shown as formula 1) after being binned, and obtaining a value between 0 and 1 by using Sigmoid transformation on the summed value (Sigmoid transformation is a mapping f (x) =1/(1+ exp (-x))), and the value can be used for representing the probability of predicting whether the user is credible or not, and determining whether to authorize the corresponding operation according to the probability value.
More specifically, referring to fig. 3, the process from receiving a service application from a user to predicting the probability of reliability of the user and deciding whether to authorize the service can be written as:
step 1, after a user authorizes, receiving hundreds of characteristics related to the user; summarizing all the characteristics into a user big data set;
and 2, combining the previously accumulated user characteristics and corresponding labels (whether the user is over-weighted or not, the over-weighted label is a two-classification result and is marked as a dependent variable Y, Y =1 represents that the user is over-weighted, Y =0 represents that the user is not over-weighted) into a wide table, and randomly dividing the wide table into a training set and a test set according to a ratio of 7: 3. Wherein the training set data is used for modeling and model parameter estimation, and the test set is used for model evaluation.
And 3, in the training set data, separating the numerical variables and the text variables of the user characteristic variables into boxes, and performing WOE (world Wide area) conversion on each box to convert the WOE into a WOE value. The meaning of branch case lies in:
1) the value of the text variable which cannot be calculated is converted into a numerical value which can be calculated,
2) the stability of the model is increased, and the large change of the model result caused by the small disturbance of the numerical value is prevented.
Setting X as three boxes X, y and z after X is subjected to box separation, wherein the WOE value calculation formula of the X box is as follows:
WOE(X=x)=ln((#{Y=1,X=x}/#{Y=1})/(#{Y=0,X=x}/#{Y=0}))…(1)
where # (a) represents the number of samples satisfying condition a, # (a, B) represents the number of samples satisfying both conditions a and B, and ln () is a natural logarithmic function.
And 4, performing logistic regression modeling on the training set after the WOE conversion, and performing model evaluation.
In this step, since each variable contributes differently to the model, and there may be strong correlation between many variables. The simultaneous modulo entry of these strongly correlated variables results in the failure to complete the evaluation of the model parameters, and for this phenomenon is referred to as the co-linearity problem of the model. The decomposition step 4 will be emphasized hereinafter.
EXAMPLE III
On the basis of the second embodiment, the present embodiment further includes the following contents:
after the variables of the logistic regression model enter the final regression link, the effectiveness of the model is generally judged through two indexes: p-value (assumed value) and VIF (variance inflation factor) value. Wherein a p-value reflects the significance of a single variable, a larger p-value means a lower significance of the variable, and if the p-value >0.05, the variable is considered to be not significant and should be removed from the model; the VIF value reflects the degree of co-linearity of the variables, the higher the VIF value is, the larger the co-linearity is, and generally if the VIF value is greater than 4, the co-linearity is considered to exist in the model, and the variables need to be adjusted.
Wherein, VIF represents the co-linearity coefficient of the model and the formula is
VIF=1/(1-R2) Wherein R is a complex correlation coefficient of the independent variable to the rest independent variables for regression analysis.
The p-value is the degree of significance that logistic regression uses the z-statistic to characterize, i.e.,
p = Pr (| s | > | z |), where s obeys a standard normal distribution, and Pr is an operation to solve a probability, that is, to solve a probability of | s | > | z |.
If the p-value is greater than 0.05, the variable is considered to be not significant and should be removed from the model.
In order to facilitate understanding of the above-described co-linearity coefficient, complex correlation coefficient, and significance, detailed descriptions thereof will be given below, respectively.
In which co-linearity coefficients are used in the invention
Figure 337407DEST_PATH_IMAGE002
The relationship between the VIF value and the complex correlation coefficient is as follows:
Figure 794934DEST_PATH_IMAGE003
wherein the complex correlation coefficient is
Figure 671623DEST_PATH_IMAGE004
The square root of (a). The larger the complex correlation coefficient is, the larger the complex correlation coefficient is
Figure 559332DEST_PATH_IMAGE004
The larger, so the greater the co-linear coefficient of the variables, i.e.
Figure 623103DEST_PATH_IMAGE002
Strong correlation with other variables exists, which can result in that stable parameter estimation cannot be obtained during model training.
The above
Figure 884320DEST_PATH_IMAGE002
The complex correlation coefficients for other variables have the specific meaning: in all the independent variables, to
Figure 412253DEST_PATH_IMAGE002
As dependent variables, all others
Figure 467934DEST_PATH_IMAGE005
Establishing a linear regression model of the coefficients of a solution as independent variables
Figure 284580DEST_PATH_IMAGE006
The square root of (a). In a linear regression model, let y be the dependent variable and X be the independent variable, then
Figure 83909DEST_PATH_IMAGE007
Wherein
Figure 395243DEST_PATH_IMAGE008
Is the average value of the samples and is,
Figure 152983DEST_PATH_IMAGE009
to estimate y by the linear model, the equation characterizes the percentage that can be interpreted using the linear model in the overall compilation of the y values, with the remaining unexplained proportion being due to random perturbations caused by sampling. The larger the value, the more interpretable y is by the model, and the stronger the correlation between y and the argument. In the context of the present invention, it is then the use
Figure 191346DEST_PATH_IMAGE002
As the dependent variable y, use
Figure 325524DEST_PATH_IMAGE005
As an independent variable, the above calculation may be made.
In addition, the significance of the above-mentioned degrees of significance specifically means: whether an index of the original hypothesis should be rejected in the statistical hypothesis testing process. For example:
h0 (null hypothesis) is assumed, the coefficient of variable X is 0, and the model result has no interpretation ability, namely X should not enter the model;
let H1 (alternative hypothesis) assume that the coefficient of variable X is not 0 and should enter the model;
the P value is used to refer to the probability that H0 holds, and if the P value is greater than a set significance level of 0.05, then it is considered that there is insufficient reason to reject the original hypothesis, i.e., X should not enter the model. The larger the value of P, the more likely the contribution of the variable to the model is due solely to sampling errors, and the more the model should be rejected.
Example four
On the basis of the third embodiment, before the decomposition of step 4, the following principle of factor analysis is introduced in this embodiment:
suppose there are N candidate variables X _1, X _2, …, X _ N that need to be factored. The factorial analysis method assumes the presence of k common factors F _1, F _2, …, F _ k, so that each original variable can be written as a linear sum of these k common factors and one particular factor, i.e. for any variable X _ i, X _ i can be written as X _ i = a _ i 1F _1+ a _ i 2F _2+ … + a _ ik F _ k + _ i
Where the coefficients a _ i1, a _ i2, …, a _ ik are called load factors, then for all i belonging to [1, N ], a matrix a of size N × k is formed, called load matrix.
The estimation method of the load matrix may adopt a principal component method, a principal factor method or a maximum likelihood estimation method, which is not discussed in detail herein.
EXAMPLE five
On the basis of the fourth embodiment, the present embodiment further includes the following contents:
referring to fig. 4, for the estimation method of the load matrix, the invention adopts a principal component method for estimation, which specifically includes:
in the following expressions, _ i denotes a subscript of i, Ʃ denotes a sum,
Figure 765733DEST_PATH_IMAGE010
indicates that the sum is from 1 to N according to the value of i.
In the factor analysis model construction, the estimation of the number k of common factors and the estimation of a load matrix are involved. The present invention below uses a principal component method to estimate the above parameters.
Assuming the original eigenvectors of the N candidate variables, a covariance matrix is calculated, wherein the covariance matrix is an N × N matrix M, and the value of M _ ij is the covariance of X _ i in the ith row and X _ j in the jth column of the matrix M.
N characteristic roots and characteristic vectors of the covariance matrix M are calculated. The N characteristic roots are respectively marked as lambda _1, lambda _2, … and lambda _ N according to the descending order, and N standardized characteristic vectors corresponding to the characteristic roots sorted by the method are sequentially marked as v _1, v _2, … and v _ N;
wherein the number of common factors
Figure 959954DEST_PATH_IMAGE011
That is, the present invention selects such a minimum k that the sum of the first k largest eigenvalues is greater than 0.75.
The load matrix estimated using the principal component method is as follows:
Figure 954455DEST_PATH_IMAGE012
EXAMPLE six
Referring to fig. 5, next, on the basis of the fifth embodiment, the present embodiment returns to step 4, and proposes the following decomposition sub-steps for this step (assuming that N =10 and k =3 below):
step 4.1, after the WOE transform is performed, the present invention now has 10 candidate variables in total. Factoring all these candidate variables yields 3 common variables, and the load matrix size is then 10 x 3. Assume that the load matrix solved by the present invention is shown in fig. 6 below.
And 4.2, judging which class the variable belongs to by the common variable with the largest load matrix coefficient value.
As shown in fig. 6, the first row encircled by a horizontal frame indicates that the variable X _1 is a coefficient a _11=0.82, a _12=0.13, and a _13=0.22 of three common variables, and the present invention finds out that the largest value thereof, i.e., a _11=0.82, is a coefficient of the first common variable (common factor) F _1, and then the present invention classifies the variable X _1 into the first category. The invention uses the vertical frame to frame the position of the maximum coefficient corresponding to all 10 variables, namely the group which should be divided, and divides 10 variables into 3 groups, wherein the variables X _1, X _2 and X _3 are divided into a first group, the variables X _4, X _5, X _6 and X _7 are divided into a second group, and the variables X _8, X _9 and X _10 are divided into a third group.
Step 4.3, in the k classes (in this example, class 3), two variables are selected for each class, one of which is the variable with the highest IV value and the other of which is the variable R2 with the highest value. A high value of IV means that the variable contributes more to the model result, a high value of R2 means that the variable is most representative in the cluster, and further, a high value of contribution means that the variable has a large influence on the probability value of the model output, for simplicity, the influence means that the correlation between the variable and the output probability value is the largest, and the representative means that the pearson correlation coefficient with the principal component in the cluster is the largest, wherein the formula of the IV value is as follows:
IV:Ʃ_x((#{Y=1,X=x}/#{Y=1})-(#{Y=0,X=x}/#{Y=0}))*WOE(X=x)
where # { A } denotes the count, i.e., the number of samples satisfying condition A, # { A, B } denotes the number of samples satisfying both A and B conditions.
R2 represents a representative metric within a cluster that can be obtained by squaring the pearson correlation coefficient of the variable with the first principal component to which it belongs.
The two variables selected for each class may be the same variable. This leaves a maximum of 2k variables, all of which are rejected.
And 4.4, iterating and screening the variables of the 2k variables by adopting a backward elimination method, specifically, performing logistic regression modeling on all candidate variables entering the process, observing the VIF values of all the variables, and if the VIF values of the variables are greater than 4, indicating that collinearity exists, eliminating the variable with the highest p value.
Step 4.5, rejecting variables with p values larger than the specified value;
and 4.6, repeating the steps until the p values of all the variables are less than the specified value (0.05) and the VIFs of all the variables are less than 4, namely the collinearity of the model is completely eliminated.
EXAMPLE seven
With reference to fig. 1 to fig. 6, on the basis of the above embodiments, an embodiment of the present invention provides a data analysis method based on user information, including the following steps:
receiving user information; converting and summarizing the user information into a user big data set; randomly dividing the user big data set into two sets, wherein the two sets comprise a first set and a second set, the first set is stored in a first database, and the second set is stored in a second database; in order to improve the normative and the legality of obtaining a user big data set, the user big data set is obtained by a server after the server is authorized by a user in response to a user request;
performing box-dividing correlation processing on the first set in the first database to obtain a third set, and storing the third set in a third database;
extracting a third set in the third database, and constructing a first model by adopting a factor analysis method based on the third set;
extracting a second set in the second database, and verifying the first model based on the second set;
wherein the constructing the first model based on the third set by using a factor analysis method specifically comprises:
carrying out variable clustering by using a factor analysis method;
performing a first variable screening on the variables in each class so that the number of remaining variables is not greater than a first threshold;
and iterating the variables left after the first screening by adopting a plurality of backward elimination methods to carry out second variable screening until a preset condition is met.
The data analysis system of the embodiment of the invention carries out digital processing on user information, converts the user information into data information in a system specific format, eliminates the collinearity of the model by using the modeling module through a factor analysis method under the condition of keeping the precision as much as possible, and avoids losing important variables and precision because some variable which is most representative in a cluster (for example, the variable has the maximum correlation with a main component) is simply kept in order to eliminate the collinearity, thereby improving the accuracy of data analysis.
In order to test the accuracy of the model established by the training set, a test set can be set, so that the model established by the training set is tested on the test set and a test effect is obtained, and the test effect is compared with an actual effect, so that the evaluation of the model can be completed. Further, randomly dividing the user big data set into two sets, specifically including:
combining all the information of the users into a wide list;
and randomly dividing the wide table into two sets according to a certain proportion.
In a practical application scenario, the first set, which is divided into two sets, is typically used as a training set for modeling and model parameter estimation, and the second set is used as a test set for model evaluation.
In order to enable the numerical variables and the text variables of the user characteristic variables to adopt a factor analysis method to construct a first model, the characteristic variables of the user can be subjected to box-dividing related operation. In an actual application scenario, the performing binning related processing on the first set in the first database to obtain a third set specifically includes:
binning the first set of data;
and performing WOE transformation on each box to obtain a WOE value to obtain a third set.
Further, the WOE transform performed on each bin has the following advantages:
WOE can reflect the contribution of the independent variable. The variation (fluctuation) condition of the WOE value in the independent variable can be combined with the coefficient fitted by the model to construct the contribution rate and relative importance of each independent variable. In general, the larger the fitted coefficient, the larger the WOE variance, and the greater the contribution rate of the independent variable.
2. And (5) standardizing functions. After WOE encoding, the arguments have some standardized nature. Specifically, the respective values within the independent variables can be directly compared with each other (comparison between WOEs), and the respective values between different independent variables can also be directly compared with each other by the WOEs.
3. Insensitive to abnormal values. Many extreme variables can be changed to non-outliers by WOE, and many less frequent variables can also be combined by WOE transformation.
As can be seen from the above description, the WOE transformation greatly improves the intelligibility of the data, which is important to the accuracy of data analysis. WOE in essence describes the current grouping of variables, which has an effect on the direction and magnitude of the determination of whether an individual will respond (or belong to which class). Further, when the WOE is positive, the current value of the variable has a positive influence on judging whether the individual will respond, and when the WOE is negative, the negative influence is exerted. The magnitude of the WOE value is representative of the magnitude of this effect.
In addition, the first model constructed based on the third set by the factor analysis method is a logistic regression model. The logistic regression model has the following advantages:
1. the form is simple, and the model has good interpretability. The influence of different features on the final result can be seen from the weight of the features, for example, if the weight of a certain feature is higher, the influence of the feature on the final result is larger.
2. The model effect is good. The method is generally accepted in engineering (as baseline), if the characteristic engineering is well done, the model effect is not too poor correspondingly, and the characteristic engineering can be developed in parallel, thereby greatly accelerating the development speed.
3. The training speed is faster. In classification, the computational effort is only related to the number of features. And the distributed optimization sgd (Stochastic Gradient Descent) development of logistic regression is mature, and the training speed can be further improved by a heap machine, so that several versions of models can be iterated in a short time.
4. The resource occupation is small, and the resource occupation is particularly embodied in a memory. As it only needs to store the eigenvalues of the various dimensions.
5. The output result is convenient to adjust. Logistic regression can conveniently obtain the final classification result, because the probability scores of each sample are output, and the probability scores can be easily subjected to cutoff, namely, dividing threshold values (one class is larger than a certain threshold value, and one class is smaller than a certain threshold value).
After the characteristic variables of the user are subjected to binning correlation operation, a first model can be constructed through a factor analysis method. In an actual application scenario, the factor analysis method specifically includes:
assuming feature vectors of N candidate variables (contents of the third set), calculating a covariance matrix, wherein the covariance matrix is a matrix M of N × N, and the value of M _ ij is the covariance of X _ i of the ith row and X _ j of the jth column of the matrix M;
calculating N characteristic roots and characteristic vectors of the covariance matrix M;
the N feature roots are respectively denoted as λ _1, λ _2, …, λ _ N in descending order, and the N normalized feature vectors corresponding to the feature roots sorted in the above manner are sequentially denoted as v _1, v _2, …, v _ N.
In the embodiment of the present invention, when the first variable screening is performed on the variable in each class, a first threshold needs to be obtained based on the factor analysis method, where the first threshold is
Figure 95586DEST_PATH_IMAGE013
It means that the sum of the first k largest eigenvalues is greater than 0.75.
After the first threshold is obtained, performing first variable screening on the variables in each class so that the number of remaining variables is not greater than the first threshold, specifically including:
the number of the variable clustered classes is k, and the first threshold value is 2 k;
selecting two variables for each of the k classes, one with the highest value of variable IV and the other with the highest value of variable R2; a high value for IV means that the variable contributes more to the model result, and a high value for R2 means that the variable is most representative within the cluster.
After the second variable screening, all variables are required to satisfy a certain condition. Correspondingly, the step of performing the second variable screening on the variables left after the first screening by adopting multiple backward elimination method iterations until the preset conditions are met specifically comprises the following steps:
if the VIF value of the candidate variable is larger than 4, eliminating the variable with the highest p value;
rejecting variables with p values greater than a specified value;
and repeating the steps until the p values of all the variables are less than the specified value and the VIFs of all the variables are less than 4.
Example eight
As shown in fig. 7, the present invention also provides a data analysis system 700 based on user information, which includes:
a data decomposition module 703 for randomly dividing the user big data set into two sets, the two sets including a first set and a second set;
a binning processing module 704, configured to perform binning correlation processing on the first set to obtain a third set;
a modeling module 705 for building a first model using factor analysis based on the third set;
a verification module 706 for verifying the first model based on the second set;
wherein the constructing the first model based on the third set by using a factor analysis method specifically comprises:
carrying out variable clustering by using a factor analysis method;
performing a first variable screening on the variables in each class so that the number of remaining variables is not greater than a first threshold;
and iterating the variables left after the first screening by adopting a plurality of backward elimination methods to carry out second variable screening until a preset condition is met.
The system 700 further comprises:
a data receiving module 701, configured to receive user information;
a data summarization module 702 for converting and summarizing the user information into a user big data set.
Example nine
As shown in fig. 8, this embodiment further provides an electronic device 800, where the electronic device 800 includes: at least one processor 801; and a memory 802 communicatively coupled to the at least one processor 801; wherein the content of the first and second substances,
the memory 802 stores instructions executable by the one processor 801 to cause the at least one processor 801 to perform method steps as described in the above embodiments.
Example ten
The disclosed embodiments provide a non-volatile computer storage medium having stored thereon computer-executable instructions that may perform the method steps as described in the embodiments above.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local Area Network (AN) or a Wide Area Network (WAN), or the connection may be made to AN external computer (for example, through the internet using AN internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.
The foregoing describes preferred embodiments of the present invention, and is intended to provide a clear and concise description of the spirit and scope of the invention, and not to limit the same, but to include all modifications, substitutions, and alterations falling within the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A data analysis method based on user information comprises the following steps:
receiving user information;
converting and summarizing the user information into a user big data set;
randomly dividing the user big data set into two sets, wherein the two sets comprise a first set and a second set, the first set is stored in a first database, and the second set is stored in a second database;
performing box-dividing correlation processing on the first set in the first database to obtain a third set, and storing the third set in a third database;
extracting a third set in the third database, and constructing a first model by adopting a factor analysis method based on the third set;
extracting a second set in the second database, and verifying the first model based on the second set;
wherein the constructing the first model based on the third set by using a factor analysis method specifically comprises:
carrying out variable clustering by using a factor analysis method;
performing a first variable screening on the variables in each class so that the number of remaining variables is not greater than a first threshold;
and iterating the variables left after the first screening by adopting a plurality of backward elimination methods to carry out second variable screening until a preset condition is met.
2. The method of claim 1, wherein randomly dividing the user big data set into two sets specifically comprises:
combining all the information of the users into a wide list;
and randomly dividing the wide table into two sets according to a certain proportion.
3. The method of claim 2, wherein the first set is a training set used for modeling and model parameter estimation and the second set is a test set used for model evaluation.
4. The method of claim 1, wherein the first model is a logistic regression model.
5. The method according to claim 4, wherein the performing binning-related processing on the first set in the first database to obtain a third set specifically comprises:
binning the first set of data;
and performing WOE transformation on each box to obtain a WOE value to obtain a third set.
6. The method of claim 5, wherein the factor analysis method specifically comprises:
assuming feature vectors of N candidate variables, calculating a covariance matrix of the feature vectors, wherein the covariance matrix is an N-N matrix M, and the value of M _ ij is the covariance of X _ i of the ith row and X _ j of the jth column of the matrix M;
calculating N characteristic roots and characteristic vectors of the covariance matrix M;
the N feature roots are respectively denoted as λ _1, λ _2, …, λ _ N in descending order, and the N normalized feature vectors corresponding to the feature roots sorted in the above manner are sequentially denoted as v _1, v _2, …, v _ N.
7. The method of claim 6, wherein the first threshold is obtained based on the factorial analysis, the first threshold being
Figure 546856DEST_PATH_IMAGE001
It means that the sum of the first k largest eigenvalues is greater than 0.75.
8. The method of claim 7, wherein the performing a first variable screening on the variables in each class such that the number of remaining variables is not greater than a first threshold value, specifically comprises:
the number of the variable clustered classes is k, and the first threshold value is 2 k;
selecting two variables for each of the k classes, one with the highest value of variable IV and the other with the highest value of variable R2; a high value for IV means that the variable contributes more to the model result, and a high value for R2 means that the variable is most representative within the cluster.
9. The method according to claim 5, wherein the step of performing a second variable screening on the variables remaining after the first screening by using a plurality of backward elimination method iterations until a preset condition is satisfied specifically comprises:
if the VIF value of the candidate variable is larger than 4, eliminating the variable with the highest p value;
rejecting variables with p values greater than a specified value;
and repeating the steps until the p values of all the variables are less than the specified value and the VIFs of all the variables are less than 4.
10. The method of claim 1, wherein the user big data set is obtained by a service party after user authorization in response to a user request.
CN202010769479.0A 2020-08-04 2020-08-04 Data analysis method based on user information Active CN111654853B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010769479.0A CN111654853B (en) 2020-08-04 2020-08-04 Data analysis method based on user information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010769479.0A CN111654853B (en) 2020-08-04 2020-08-04 Data analysis method based on user information

Publications (2)

Publication Number Publication Date
CN111654853A true CN111654853A (en) 2020-09-11
CN111654853B CN111654853B (en) 2020-11-10

Family

ID=72352607

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010769479.0A Active CN111654853B (en) 2020-08-04 2020-08-04 Data analysis method based on user information

Country Status (1)

Country Link
CN (1) CN111654853B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022227644A1 (en) * 2021-04-26 2022-11-03 深圳前海微众银行股份有限公司 Data processing method and apparatus, and device, storage medium and program product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095391A (en) * 2016-05-31 2016-11-09 携程计算机技术(上海)有限公司 Based on big data platform and the computational methods of algorithm model and system
CN107103050A (en) * 2017-03-31 2017-08-29 海通安恒(大连)大数据科技有限公司 A kind of big data Modeling Platform and method
CN108399255A (en) * 2018-03-06 2018-08-14 中国银行股份有限公司 A kind of input data processing method and device of Classification Data Mining model
WO2019047790A1 (en) * 2017-09-08 2019-03-14 第四范式(北京)技术有限公司 Method and system for generating combined features of machine learning samples
CN110415111A (en) * 2019-08-01 2019-11-05 信雅达系统工程股份有限公司 Merge the method for logistic regression credit examination & approval with expert features based on user data
CN110728453A (en) * 2019-10-14 2020-01-24 山东嘉熙信息科技有限公司 Big data based policy automatic matching analysis system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095391A (en) * 2016-05-31 2016-11-09 携程计算机技术(上海)有限公司 Based on big data platform and the computational methods of algorithm model and system
CN107103050A (en) * 2017-03-31 2017-08-29 海通安恒(大连)大数据科技有限公司 A kind of big data Modeling Platform and method
WO2019047790A1 (en) * 2017-09-08 2019-03-14 第四范式(北京)技术有限公司 Method and system for generating combined features of machine learning samples
CN108399255A (en) * 2018-03-06 2018-08-14 中国银行股份有限公司 A kind of input data processing method and device of Classification Data Mining model
CN110415111A (en) * 2019-08-01 2019-11-05 信雅达系统工程股份有限公司 Merge the method for logistic regression credit examination & approval with expert features based on user data
CN110728453A (en) * 2019-10-14 2020-01-24 山东嘉熙信息科技有限公司 Big data based policy automatic matching analysis system and method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022227644A1 (en) * 2021-04-26 2022-11-03 深圳前海微众银行股份有限公司 Data processing method and apparatus, and device, storage medium and program product

Also Published As

Publication number Publication date
CN111654853B (en) 2020-11-10

Similar Documents

Publication Publication Date Title
CN109587713B (en) Network index prediction method and device based on ARIMA model and storage medium
CN110928764A (en) Automated mobile application crowdsourcing test report evaluation method and computer storage medium
CN113792825A (en) Fault classification model training method and device for electricity information acquisition equipment
CN111796957B (en) Transaction abnormal root cause analysis method and system based on application log
CN110874744B (en) Data anomaly detection method and device
CN108833139B (en) OSSEC alarm data aggregation method based on category attribute division
CN110222733B (en) High-precision multi-order neural network classification method and system
CN113408548A (en) Transformer abnormal data detection method and device, computer equipment and storage medium
CN116596095B (en) Training method and device of carbon emission prediction model based on machine learning
CN112711757A (en) Data security centralized management and control method and system based on big data platform
CN111797320A (en) Data processing method, device, equipment and storage medium
CN114139931A (en) Enterprise data evaluation method and device, computer equipment and storage medium
CN111654853B (en) Data analysis method based on user information
CN113112188B (en) Power dispatching monitoring data anomaly detection method based on pre-screening dynamic integration
CN116842240B (en) Data management and control system based on full-link management and control
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN110704614B (en) Information processing method and device for predicting user group type in application
CN116719714A (en) Training method and corresponding device for screening model of test case
CN116185797A (en) Method, device and storage medium for predicting server resource saturation
CN116383645A (en) Intelligent system health degree monitoring and evaluating method based on anomaly detection
CN115423600A (en) Data screening method, device, medium and electronic equipment
CN115221955A (en) Multi-depth neural network parameter fusion system and method based on sample difference analysis
CN114399407A (en) Power dispatching monitoring data anomaly detection method based on dynamic and static selection integration
CN114238062A (en) Board card burning device performance analysis method, device, equipment and readable storage medium
CN114330720A (en) Knowledge graph construction method and device for cloud computing and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant