CN109635010B - User characteristic and characteristic factor extraction and query method and system - Google Patents

User characteristic and characteristic factor extraction and query method and system Download PDF

Info

Publication number
CN109635010B
CN109635010B CN201811619624.6A CN201811619624A CN109635010B CN 109635010 B CN109635010 B CN 109635010B CN 201811619624 A CN201811619624 A CN 201811619624A CN 109635010 B CN109635010 B CN 109635010B
Authority
CN
China
Prior art keywords
characteristic
user
important
behavior
factor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811619624.6A
Other languages
Chinese (zh)
Other versions
CN109635010A (en
Inventor
慕畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Mengwang Video Co ltd
Original Assignee
Shenzhen Mengwang Video Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Mengwang Video Co ltd filed Critical Shenzhen Mengwang Video Co ltd
Priority to CN201811619624.6A priority Critical patent/CN109635010B/en
Publication of CN109635010A publication Critical patent/CN109635010A/en
Application granted granted Critical
Publication of CN109635010B publication Critical patent/CN109635010B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for extracting and inquiring user characteristics and characteristic factors. The method comprises the steps of firstly preprocessing user behavior data of a certain scene, discretizing and reducing the dimension of user characteristics, then optimizing the user characteristics after dimension reduction to obtain important user characteristics, then further filtering characteristic factors in the important user characteristics to find important characteristic factors, and creating a two-dimensional matrix of the important user characteristics and the important characteristic factors of the scene; and then according to the created important user characteristics and important characteristic factor two-dimensional matrixes of different scenes, associating the same or similar user behavior prediction characteristics to create a scene user behavior prediction characteristic two-dimensional matrix. By the method, when characteristics and factors which are possibly important to decision results in similar scenes are checked, a large amount of cost is not spent on searching data again, the searching range can be shortened, relatively accurate training results are obtained, and a large amount of resources and training cost are saved.

Description

User characteristic and characteristic factor extraction and query method and system
Technical Field
The invention relates to the technical field of data mining, in particular to a method and a system for extracting and inquiring user characteristics and characteristic factors.
Background
The existing feature screening technologies, such as PCA principal component analysis, Logistic regression, feature importance judgment importance technology of random forest, generalized weight technology evaluation of BP back propagation neural network on features, and the like, have two defects:
depth: generally, dimension reduction only considers dimensions in features, but does not consider influences of different factors in the dimensions on output, for example, in purchasing decision, age is an important feature of influence, but ages are classified into children, young, middle-aged and old, the influence of different ages is not classified, practical landing cannot be caused, only the age is known to influence purchasing, but how the age is unknown, if the important feature is analyzed to be age, the important forward influence factor is young of 20-30 years, and the important reverse factor is old of more than 60 years, the method is quite clear, and a feature screening process is only one-dimensional and is not two-dimensional.
Cost performance: the validity of the characteristics and factors is not combined and systematically induced and recorded to form a coding library, and data needs to be searched again when similar scenes are encountered. For example, if it is known that the age and the young in the age have an important influence on whether to purchase a store membership card, if a similar scene such as whether to purchase a food coupon refers to the coding record of the effective characteristic factor library of the similar scene, a similar prediction result characteristic of the similar scene is found to form a basic frame and is supplemented again on the basis, a large amount of development cost can be saved, if the characteristic + factor two-dimensional library is not available, a large amount of resources and time cost can be consumed again to train trial and error continuously and repeatedly every time data mining faces a new scene, and if no effective characteristic variable exists, the direction of effort is likely to be wrong, and the cost is consumed but a more accurate result cannot be obtained; the design and research and development resource tuning algorithm is continuously consumed, but the problem that the algorithm is not used is not thought, but effective characteristics are not found, so that the result is half the result, even the result is double the result and is idle.
Disclosure of Invention
The embodiment of the invention aims to provide a method for extracting and querying user characteristics and characteristic factors, and aims to solve the problems of low precision and resource and cost waste caused by the fact that data characteristics are screened, the validity of the characteristics and the factors is not combined, systematical induction and recording are not carried out, and data is searched again in similar scenes in the prior art.
The embodiment of the invention is realized as follows, and provides a method for extracting and inquiring user characteristics and characteristic factors, which comprises the following steps:
s1, creating a feature-factor two-dimensional matrix library of important user features and important factors of a plurality of scenes;
s2, correlating the created feature-factor two-dimensional matrix library under different scenes according to the same or similar behavior prediction features to construct a scene-behavior two-dimensional matrix;
s3, searching the same or similar behavior prediction characteristics under the related scene according to the scene-behavior two-dimensional matrix, searching the related characteristic-factor two-dimensional matrix according to the same or similar behavior prediction characteristics, and acquiring the important user characteristics and the important characteristic factors.
A second objective of the embodiments of the present invention is to provide a method for extracting user features and feature factors, including the following steps (S101-S108):
s101, extracting a user behavior data set from a user behavior statistical database of a first scene;
s102, preprocessing the user behavior data set;
s103, carrying out normalization and discretization processing on the preprocessed user behavior data set to obtain a first user behavior feature set;
s104, performing user characteristic dimension reduction processing on the first user behavior feature set to obtain a second user behavior feature set after dimension reduction;
s105, extracting a training set and a test set from the second user behavior feature set, establishing a candidate data prediction model according to the training set and the test set, and evaluating to obtain an excellent data prediction model;
s106, screening the user characteristics in the second user behavior characteristic set according to the selected excellent data prediction model, and selecting important user characteristics;
s107, filtering the characteristic factors of the important user characteristics to obtain important characteristic factors;
and S108, constructing a feature-factor two-dimensional matrix library of the first scene according to the important user features and the important feature factor combination.
A third objective of an embodiment of the present invention is to provide a system for extracting and querying user characteristics and characteristic factors, where the system includes:
the characteristic-factor two-dimensional matrix base creating device is used for creating the characteristic-factor two-dimensional matrix base of the important user characteristics and the important factors of a plurality of scenes;
the scene-behavior two-dimensional matrix creating device is used for correlating the created feature-factor two-dimensional matrix library under different scenes according to the same or similar behavior prediction features to construct a scene-behavior two-dimensional matrix;
and the important user characteristic and important characteristic factor inquiry device is used for searching the same or similar behavior prediction characteristics under the associated scene according to the scene-behavior two-dimensional matrix, searching the associated characteristic-factor two-dimensional matrix according to the same or similar behavior prediction characteristics, and acquiring the important user characteristics and the important characteristic factors.
A fourth object of an embodiment of the present invention is to provide a user feature and feature factor extracting apparatus, where the apparatus includes:
the user behavior data set extraction module of the first scene is used for extracting a user behavior data set from a user behavior statistical database of the first scene; the user behavior data set M1 comprises at least one user characteristic, behavior prediction characteristic; the user characteristics include at least one characteristic factor; the behavior prediction characteristics are generated by taking user characteristics as input variables according to a data prediction model; setting the user characteristics as an input variable x and the behavior prediction characteristics as an output variable y, wherein y is a model (x); the data prediction model comprises one or more of a neural network, a random forest, a support vector machine, a decision tree, logistic regression, ensemble learning, a K nearest neighbor model and Bayesian linear discrimination;
the data preprocessing module is used for preprocessing the user behavior data set; the preprocessing comprises missing value processing, abnormal data processing and data redundancy processing;
the normalization and discretization processing module is used for performing normalization and discretization processing on the preprocessed user behavior data set to obtain a first user behavior feature set;
the user characteristic dimension reduction processing module is used for carrying out user characteristic dimension reduction processing on the first user behavior characteristic set to obtain a second user behavior characteristic set after dimension reduction; the dimension reduction processing method comprises the following steps: multiple collinearity dimension reduction method, regression dimension reduction method;
the excellent data prediction model acquisition device is used for extracting a training set and a test set from the second user behavior feature set, establishing a candidate data prediction model according to the training set and the test set, evaluating the candidate data prediction model and acquiring an excellent data prediction model; the training set and test set acquisition method adopts a non-return random sampling, equidistant sampling, layered sampling and classified sampling method;
the important user characteristic obtaining device is used for screening the user characteristics in the second user behavior characteristic set according to the selected excellent data prediction model to select the important user characteristics;
the important characteristic factor acquisition device is used for filtering the characteristic factors of the important user characteristics to acquire important characteristic factors;
and the characteristic-factor two-dimensional matrix base creation module is used for constructing a characteristic-factor two-dimensional matrix base of the first scene according to the important user characteristics and the important characteristic factor combination.
The invention has the advantages of
The invention provides a method and a system for extracting and inquiring user characteristics and characteristic factors. The method comprises the steps of preprocessing user behavior data of a scene, discretizing and reducing dimensions of user characteristics, optimizing the user characteristics subjected to dimension reduction to obtain important user characteristics, further filtering characteristic factors in the important user characteristics to find the important characteristic factors, and creating a two-dimensional matrix of the important user characteristics and the important characteristic factors of the scene; and then according to the created important user characteristics and important characteristic factor two-dimensional matrixes of different scenes, associating the same or similar user behavior prediction characteristics to create a scene user behavior prediction characteristic two-dimensional matrix. By the method, when characteristics and factors which are possibly important to decision results in similar scenes are checked, a large amount of cost is not spent on re-exploring, data are searched again, and only the related important user characteristics and important characteristic factor two-dimensional matrix is further found by the scene user behavior prediction characteristic two-dimensional matrix library, so that the search range can be shortened, relatively accurate training results are obtained, and a large amount of resources and training cost are saved.
Drawings
FIG. 1 is a flow chart of a method for extracting and querying user characteristics and characteristic factors according to a preferred embodiment of the present invention;
FIG. 2 is a flowchart of the method of creating a feature-factor two-dimensional matrix library of important user features and important factors for one of the scenes in FIG. 1;
FIG. 3 is a plot of false positive quartile-boxed plot for each candidate data prediction model in accordance with embodiments of the present invention;
FIG. 4 is a graph of the misjudgment rate after the user features are removed according to the embodiment of the present invention;
FIG. 5 is a line graph of age-liability misjudgment rate for an embodiment of the present invention;
FIG. 6 is a schematic diagram of important user characteristics and important characteristic factor storage in a credit card consumption scenario according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of storing scene and user behavior prediction feature data of different scenes according to an embodiment of the present invention;
FIG. 8 is a diagram of a system for extracting and querying user features and feature factors according to a preferred embodiment of the present invention;
FIG. 9 is a block diagram of the feature-factor two-dimensional matrix library creating apparatus of FIG. 8;
FIG. 10 is a block diagram of an excellent data prediction model acquisition apparatus of FIG. 9;
FIG. 11 is a diagram showing a structure of an important user feature acquiring apparatus in FIG. 9;
FIG. 12 is a structural diagram of an important characteristic factor acquiring apparatus in FIG. 9;
FIG. 13 is a view showing the construction of a candidate data prediction model evaluation apparatus in FIG. 10;
FIG. 14 is a diagram of a feature factor dimension reduction apparatus for the important user features of FIG. 12;
fig. 15 is a structural diagram of the third misjudgment rate matrix creating apparatus in fig. 12.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples, and for convenience of description, only parts related to the examples of the present invention are shown. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a method and a system for extracting and inquiring user characteristics and characteristic factors. The method comprises the steps of firstly carrying out user characteristic dimension reduction on user behavior data of a scene, then carrying out optimization on the user characteristics after dimension reduction to obtain important user characteristics, then further filtering characteristic factors in the important user characteristics to find the important characteristic factors, and creating a two-dimensional matrix of the important user characteristics and the important characteristic factors of the scene; and then according to the created important user characteristics and important characteristic factor two-dimensional matrixes of different scenes, associating the same or similar user behavior prediction characteristics to create a scene user behavior prediction characteristic two-dimensional matrix. By the method, when characteristics and factors which are possibly important to decision results in similar scenes are checked, a large amount of cost is not spent on re-exploring, data is searched again, and only the related important user characteristics and important characteristic factor two-dimensional matrix is further found by the scene user behavior prediction characteristic two-dimensional matrix library, so that the search range can be shortened, relatively accurate training results are obtained, and a large amount of resources and training cost are saved.
Example one
FIG. 1 is a flow chart of a method for extracting and querying user characteristics and characteristic factors according to a preferred embodiment of the present invention; the method includes (S1-S3):
s1, creating a feature-factor two-dimensional matrix library of important user features and important factors of a plurality of scenes;
specifically, the method for creating the feature-factor two-dimensional matrix library of the important user features and the important factors of one scene comprises the following steps (S101-S108):
FIG. 2 is a flowchart of the method of creating a feature-factor two-dimensional matrix library of important user features and important factors for one of the scenes in FIG. 1;
s101, extracting a user behavior data set from a user behavior statistical database of a first scene;
the user behavior data set comprises at least one user characteristic and a behavior prediction characteristic; the behavior prediction characteristics are generated by taking user characteristics as input variables according to a data prediction model; setting the user characteristics as an input variable x and the behavior prediction characteristics as an output variable y, wherein y is a model (x); the data prediction model comprises one or more of a neural network, a random forest, a support vector machine, a decision tree, logistic regression, ensemble learning, a K nearest neighbor model and Bayesian linear discrimination.
Specifically, the embodiment of the invention is explained in detail by using the user behavior statistical data of the scene of credit card consumption; table 1 shows a segment of a user behavior data set in a credit card consumption scenario;
age (age) Education Age of the job Address Income (R) Rate of liability of Credit card liability Other liability Default
41 3 17 12 176 9.3 11.36 5.01 1
27 1 10 6 31 17.3 1.36 4 0
40 1 15 14 55 5.5 0.86 2.17 0
41 1 15 14 120 2.9 2.66 0.82 0
24 2 2 0 28 17.3 1.79 3.06 1
41 2 5 5 25 10.2 0.39 2.16 0
39 1 20 9 67 30.6 3.83 16.67 0
43 1 12 11 38 3.6 0.13 1.24 0
24 1 3 4 19 24.4 1.36 3.28 1
36 1 0 13 25 19.7 2.78 2.15 0
TABLE 1
In table 1, the user characteristics include age, education, work age, address, income, liability rate, credit card liability, other liability; the behavior prediction feature is a violation;
s102, preprocessing the user behavior data set;
the preprocessing comprises missing value processing, abnormal data processing and data redundancy processing; are all the prior art.
S103, carrying out normalization and discretization processing on the preprocessed user behavior data set to obtain a first user behavior feature set;
the normalization and discretization processing methods are all known methods in the technical field;
further, the user characteristics include at least one characteristic factor;
specifically, in the embodiment of the present invention, it is assumed that after data preprocessing, normalization, and discretization, user feature values are classified according to interval values, and the classified categories are feature factors; the characteristic factors are explained by taking user characteristic education as an example, and the education characteristics comprise the subjects, major experts, high schools and the following; the age characteristic factors comprise teenagers, young adults, old young adults, middle-aged adults and old people;
s104, performing user characteristic dimension reduction processing on the first user behavior feature set to obtain a second user behavior feature set after dimension reduction;
the dimension reduction processing method comprises the following steps:
step A1, finding out the highly-associated user characteristics (input variable x) by using a multiple collinearity dimension reduction method, and deleting and retaining one of the highly-associated user characteristics;
step A2, gradually optimizing by using a regression dimensionality reduction method (linear, nonlinear, Logistic), and deleting the user characteristics (input variable x) irrelevant to the influence of the behavior prediction characteristics.
The multiple collinearity dimension reduction method and the regression dimension reduction method are all known methods in the technical field;
s105, extracting a training set and a test set from the second user behavior feature set, establishing a candidate data prediction model according to the training set and the test set, and evaluating to obtain an excellent data prediction model;
the ratio of the number of samples in the training set to the number of samples in the test set is about 7/3; the training set and test set acquisition methods can adopt non-return random sampling, equidistant sampling, layered sampling and classified sampling methods which are all known in the technical field.
The method for establishing the candidate data prediction model according to the training set and the test set and evaluating the candidate data prediction model to obtain the excellent data prediction model comprises the following steps of:
step B1, constructing a candidate data prediction model according to the user characteristic variables (input variables) and the behavior prediction characteristic variables (output variables) in the training set;
the candidate data prediction model comprises one or more of a neural network, a random forest, a support vector machine, a decision tree, logistic regression, ensemble learning, a K neighbor model, Bayes and linear discrimination;
step B2, evaluating the candidate data prediction model; the method specifically comprises the following steps of (b201-b 202):
b 201: substituting the user characteristic variables in the test set into the candidate data prediction model, and calculating a behavior prediction characteristic value (called a first behavior prediction characteristic value); comparing the first behavior prediction characteristic value with the original behavior prediction characteristic value of the test set, and establishing a confusion matrix according to the compared prediction errors;
each column of the confusion matrix represents a prediction category, the total number of each column representing the number of data predicted for that category; each row represents a true attribution category of data, and the total number of data in each row represents the number of data instances for that category. The value in each column represents the number of classes for which real data is predicted;
b 202: calculating the misjudgment rate of the candidate prediction model, and storing the misjudgment rate as a first misjudgment rate matrix;
the misjudgment rate is (data amount of prediction error/total sample) 100%;
wherein the misjudgment rate of the candidate prediction model is less than or equal to a first threshold value, the first threshold value is set by a user and generally does not exceed 50%; the smaller the misjudgment rate value of the candidate prediction model is, the better the data prediction model effect is;
in the embodiment of the invention, table 2 shows a credit card consumption prediction error confusion matrix;
Figure BDA0001922473860000081
TABLE 2
Assuming that the number of real un-default users is 10000 people and the number of default users is 500 people under a certain candidate prediction model in the embodiment of the invention; the number of predicted users without default data is 9240, and the number of default users is 1260. The data amount of the prediction error is the number of the users who are predicted to be error, the total sample amount is the total number of the users in the test set, and the error judgment rate is (800+40)/(10000+500) is 8%.
In the embodiment of the invention, the misjudgment rate of each candidate data prediction model is shown in table 3,
neural network Random forest Support vector machine Decision tree Logistic regression Integrated learning K nearest neighbor
33.2% 33.7% 35.4% 32.1% 29.9% 31.5% 23.8%
TABLE 3
Step B3: selecting an excellent data prediction model;
the method specifically comprises the following steps: screening excellent data prediction models by adopting a quartile and box plot method for the misjudgment rate of each candidate prediction model;
in the embodiment of the invention, the misjudgment rate quartile of each candidate data prediction model is assumed to correspond to the value shown in the table 4;
position of Minimum value Lower quartile Median number Mean value Upper quartile Maximum value
Rate of misjudgment 23.8% 30.7% 32.10% 31.37% 33.45% 35.40%
TABLE 4
The misjudgment rate quartile-box plot of each candidate data prediction model is shown in FIG. 3;
the effect of the quartile misjudgment rate is shown in table 5:
position of Minimum value Lower quartile Median number Mean value Upper quartile Maximum value
Rate of misjudgment 23.8% 30.7% 32.10% 31.37% 33.45% 35.40%
Influence the effect Is very remarkable Is remarkable in that Is not significant Is not significant Is not significant Is not significant
TABLE 5
Wherein, the quartile and the box plot method are both the prior art; from the quartile-box line graph, the influence effect of the K neighbor model is most obvious and the K neighbor model can be selected as an excellent data prediction model;
variables (data prediction model) Rate of misjudgment of variables Whether to opt-in excellent data model
Neural network 33.2%
Random forest 33.7%
Support vector machine 35.4%
Decision tree 32.1%
Logistic regression 29.9%
Integrated learning 31.5%
K nearest neighbor 23.8% Is that
S106, screening the user characteristics in the second user behavior characteristic set according to the selected excellent data prediction model, and selecting important user characteristics;
the method specifically comprises the following steps:
step C1: establishing a user characteristic cycle model, performing cycle iteration, calculating the misjudgment rate after the user characteristics are removed, and storing the misjudgment rate as a second misjudgment rate matrix;
the method specifically comprises the following steps: determining whether the misjudgment rate is increasing or decreasing based on the selected excellent data prediction model by assuming that any one of the user features is eliminated: if the misjudgment rate is increased after the user characteristic is removed, judging that the positive influence of the user characteristic on the predicted behavior result is more obvious; if the error judgment rate is reduced after the user characteristic is removed, judging that the negative influence of the user characteristic on the predicted behavior result is more obvious; if the error judgment rate is not changed greatly after the user characteristic is removed, judging that the user characteristic has no significant influence on the predicted behavior result; the above process is cyclically repeated.
The higher the misjudgment rate after the user features are removed, the more significant the influence of the corresponding user variables is, and the calculation method of the misjudgment rate is the same as that in the previous step.
FIG. 4 is a graph of the misjudgment rate after the elimination of the user features plotted against the results of the loop iteration; the horizontal coordinate is a rejection variable (user characteristic), and the vertical coordinate represents a rejection error rate (assuming the error rate after the user characteristic is rejected);
step C2: selecting important user characteristics;
the method for selecting the important user features under the excellent data prediction model by using the box line graph and the quartile is the same as that for selecting the excellent data prediction model, and is not repeated herein.
In the embodiment of the invention, the misjudgment rate quartile and the influence effect table after the user characteristics are removed are shown as table 6, and a quartile-box diagram for removing the user characteristics is not shown;
position of Minimum value Lower quartile Median number Mean value Upper quartile Maximum value
Rate of misjudgment 22.86% 25% 25% 26.51% 30% 30.71%
Influence the effect Significant negative effect Is not significant Is not significant Is not significant Is just showing Is very positive and remarkable
TABLE 6
If significant variables are studied, age, liability rates are included.
Figure BDA0001922473860000091
Figure BDA0001922473860000101
S107, filtering the characteristic factors of the important user characteristics to obtain important characteristic factors;
the method specifically comprises the following steps:
step D1: reducing the dimension of the characteristic factors of the important user characteristics;
the dimension reduction method comprises the following steps (d101-d 103):
d101, performing discretization processing on the feature factors in the important user features;
the discretization processing method is the prior art;
d102, converting the feature factors after the discretization treatment into the simulation user features;
setting the discretized characteristic factors as simulated user characteristics, carrying out interval division and classification on the values of the characteristic factors, and setting classified names (characteristic factor variables) as simulated user characteristic variables;
in the embodiment of the invention, a data set is set as data, and whether a behavior has a default of a credit card is predicted; the simulation user characteristic is a liability rate, variable values of the liability rate are divided into 1, 2, 3, 4, 5 and 6 categories according to numerical intervals, the categories represent the consumption liability levels of the user and are respectively named as surfer, sensor, medium, media and extra-low correspondingly, and the variable values are determined as simulation user characteristic variables (characteristic factor variables) through a program function factor:
data$debt<-factor(data$debt,levels=c(1,2,3,4,5,6),labels=c(“super”,“senior”,“medium”,“mediocre”,“low”,“extra-low”)
d103, using a regression dimensionality reduction method (linear, nonlinear, Logistic), deleting characteristic factors (simulating user characteristics) which are irrelevant to behavior prediction. The regression dimensionality reduction method is the prior art;
step D2: performing cyclic iteration on the feature factors after dimensionality reduction by using a feature factor cyclic iteration method, calculating the misjudgment rate after eliminating the feature factor combination, and storing the misjudgment rate as a third misjudgment rate matrix;
the method specifically comprises the following steps of (d201-d 202):
d201, vectorizing the characteristic factors;
in the credit card default prediction model of the embodiment of the invention, it is assumed that the influence of 2 important user characteristics on the behavior prediction result is more remarkable through the dimensionality reduction of the characteristic factors, and the user characteristics are respectively the liability rate (extremely high, medium, middle and lower, low and extremely low) and the age (teenager, young, old, middle-aged and old). The feature factors in the important user features need to be circulated to find out the important feature factor combination.
Vectorizing characteristic factors:
debt=c(″super″,″senior",″medium",″mediocre",″low",“extra-low”)
Age=c(“children”,“young”,“single youth”,″midlife″,“old”)
d202, establishing a characteristic factor cycle model and iterating, calculating the misjudgment rate after eliminating the characteristic factor combination, and storing the misjudgment rate as a third misjudgment rate matrix;
the method specifically comprises the following steps: judging whether the error rate is increased or decreased by assuming any characteristic factor combination in the removed important characteristics: if the error rate is increased after the characteristic factor combination is removed, judging that the positive influence of the characteristic factor combination on the predicted behavior result is more obvious, if the error rate is decreased after the characteristic factor combination is removed, judging that the negative influence of the characteristic factor combination on the behavior result is more obvious, and if the change is not large after the characteristic factor combination is removed, judging that the influence of the characteristic factor combination on the behavior result is not obvious; and performing multiple cycles on the characteristic factors of the important user characteristics, and repeating the process.
The higher the misjudgment rate after the characteristic factor combination is removed, the more obvious the influence of the corresponding characteristic factor combination is. The method of calculating the misjudgment rate is the same as that described above.
In the embodiment of the invention, when the liability rate is extremely high, 6 misjudgment rate values of extremely high liability rate, high misjudgment rate, medium misjudgment rate, low misjudgment rate and extremely low misjudgment rate are formed in sequence, and the like, and iteration is performed in sequence; if the liability rate is extremely high, sequentially forming 5 misjudgment rate values of the aged, the young, the old, the middle-aged and the old, and repeating the steps in the same way to iterate in sequence to form 30 numerical value matrixes of 6 multiplied by 5; the embodiment of the invention is characterized by double-layer loop iteration of age and liability rate;
matrix reduction can be performed on 30 matrix of the error rate values to establish a 6 x 5 matrix, and the error rate values are filled in, the method is as follows:
debt=c(″super″,″senior″,″medium″,″mediocre″,″low″,“extra-low”)
Age=c(“children”,“young”,“single youth”,″midlife″,“old”)
the important user characteristics-rejection characteristic factor numerical table is shown in table 7:
rate of liability of Children's cycle Young people The young of the elderly Middle-aged Middle and old aged people
Super high 29.28 24.28 23.57 24.28 25.71
Height of 26.42 25.71 27.14 26.42 26.42
Medium and high grade 25.71 24.28 25.71 25 24.28
Middle lower part 26.42 23.57 24.28 24.28 22.14
Is low in 23.57 24.28 25 25 25
Extremely low 24.28 24.28 24.28 24.28 24.28
TABLE 7
FIG. 5 is an age-liability misjudgment rate line graph plotted according to the above table; the horizontal coordinate is the combination of the removed characteristic factors, and the vertical coordinate represents the removal error rate (assuming the error rate after the combination of the removed characteristic factors);
step D3: selecting an important characteristic factor combination;
the box line graph and the quartile are used for selecting the important feature factor combination in the important user features, and the method is the same as that for selecting the important user features, and is not repeated herein.
In the embodiment of the invention, the misjudgment rate quartile and the influence effect table after the combination of each characteristic factor is removed are shown in a table 8, and a quartile-box diagram for removing the combination of the characteristic factors is not shown;
position of Minimum value Lower quartile Median number Mean value Upper quartile Maximum value
Numerical value 22.14% 24.29% 24.29% 24.98% 25.71% 29.28%
Influence the effect Significant negative effect Negative influence Is not significant Is not significant Is just showing Positive effect is significant
TABLE 8
According to the quartile-box diagram with the characteristic factor combination removed and the third misjudgment rate matrix, acquiring the characteristic factor combination with remarkable influence effect;
the embodiment of the invention can obtain that the characteristic factor combination (the liability rate is extremely high, and the teenagers) are relatively obvious, the misjudgment rate of the characteristic factor combination (the liability rate is extremely high, and the teenagers) which is removed is 29.28 percent, the probability of default of the credit card of the teenager group is positive influence, the probability of default of the credit card of the teenager group is high, and corresponding strategies can be made for the approval and the credit bank of the group.
S108, constructing a feature-factor two-dimensional matrix library of the first scene according to the important user features and the important feature factor combination;
in the embodiment of the invention, the important user characteristics and the important characteristic factors are liability rate (extremely high) and age (teenager).
Rate of liability of Age (age) Default
1 1 1
The liability ratio is 1, 2, 3, 4, 5, 6 respectively indicates extremely high, medium, low, and extremely low; the ages are 1, 2, 3, 4 and 5 respectively represent teenagers, young adults, old young adults, middle-aged adults and old adults; default is 1, with 0 indicating yes, no, respectively;
FIG. 6 is a schematic diagram of important user characteristics and important characteristic factor storage in a credit card consumption scenario; black dots indicate that the corresponding characteristic factors under the corresponding user characteristics have stored data;
s2, correlating the created feature-factor two-dimensional matrix library under different scenes according to the same or similar behavior prediction features to construct a scene-behavior two-dimensional matrix;
FIG. 7 is a schematic diagram of storing scene and user behavior prediction feature data of different scenes; the black dots indicate that the behavior prediction characteristics under the corresponding scene have stored data; the scene comprises shopping malls, automobiles, credit cards, real estate, recruitment, training and traveling; the same or similar behavioral prediction characteristics include purchase, membership, bill overdue;
s3, searching the same or similar behavior prediction characteristics under the related scene according to the scene-behavior two-dimensional matrix, searching the related characteristic-factor two-dimensional matrix according to the same or similar behavior prediction characteristics, and acquiring the important user characteristics and the important characteristic factors.
For example, if people and features of property purchases are known (i.e. user features and feature factors corresponding to the behavior prediction feature purchased in the property scene are searched for), it is known that property purchases are closely related to car purchases (the behavior prediction feature is the same); firstly, finding out purchasing behavior prediction characteristics under an automobile scene through a scene-behavior two-dimensional matrix, then finding out a characteristic-factor two-dimensional matrix library under the automobile scene corresponding to the purchasing behavior prediction characteristics under the automobile scene, and finding out related user characteristics and characteristic factors according to the characteristic-factor two-dimensional matrix library, thereby obtaining important user characteristics and important characteristic factors which obviously influence the property purchasing behavior; according to the scene shown in fig. 7 and the user behavior prediction characteristic data storage schematic diagram, it can be quickly and intuitively known that the purchasing behavior prediction characteristics in the automobile scene have storage data.
For example, if people and characteristics of training courses are to be purchased (i.e., user characteristics and characteristics factors associated with purchasing the behavior prediction characteristics in a training scenario are found), it is known that training courses are purchased and associated with members (similar behavior prediction characteristics) in a recruitment scenario; finding out member behavior prediction characteristics in a recruitment scene, then finding out a characteristic-factor two-dimensional matrix library in the recruitment scene corresponding to the member behavior prediction characteristics in the recruitment scene, and finding out related user characteristics and characteristic factors according to the characteristic-factor two-dimensional matrix library, so that important user characteristics and important characteristic factors which have obvious influence on the behavior of purchasing training courses can be obtained; according to the scene of fig. 7 and the storage schematic diagram of the user behavior prediction characteristic data, the fact that the member behavior prediction characteristic has the storage data in the recruitment scene can be rapidly and intuitively known.
Example two
A method for extracting user features and feature factors, which is the same as steps S101 to S108 in the first embodiment, and is not described herein again.
EXAMPLE III
FIG. 8 is a diagram of a system for extracting and querying user features and feature factors according to a preferred embodiment of the present invention;
the invention discloses a system for extracting and inquiring user characteristics and characteristic factors in a preferred embodiment; the system comprises:
the characteristic-factor two-dimensional matrix base creating device is used for creating the characteristic-factor two-dimensional matrix base of the important user characteristics and the important factors of a plurality of scenes;
the scene-behavior two-dimensional matrix creating device is used for correlating the created feature-factor two-dimensional matrix library under different scenes according to the same or similar behavior prediction features to construct a scene-behavior two-dimensional matrix;
and the important user characteristic and important characteristic factor inquiry device is used for searching the same or similar behavior prediction characteristics under the associated scene according to the scene-behavior two-dimensional matrix, searching the associated characteristic-factor two-dimensional matrix according to the same or similar behavior prediction characteristics, and acquiring the important user characteristics and the important characteristic factors.
Further, fig. 9 is a structural diagram of the feature-factor two-dimensional matrix library creating apparatus in fig. 8;
the feature-factor two-dimensional matrix library creating device includes:
the user behavior data set extraction module of the first scene is used for extracting a user behavior data set 1 from a user behavior statistical database of the first scene; the user behavior data set comprises at least one user characteristic and a behavior prediction characteristic; the user characteristics include at least one characteristic factor; the behavior prediction characteristics are generated by taking user characteristics as input variables according to a data prediction model; setting the user characteristics as an input variable x and the behavior prediction characteristics as an output variable y, wherein y is a model (x); the data prediction model comprises one or more of a neural network, a random forest, a support vector machine, a decision tree, logistic regression, ensemble learning, a K neighbor model, Bayes and linear discrimination.
The data preprocessing module is used for preprocessing the user behavior data set; the preprocessing comprises missing value processing, abnormal data processing and data redundancy processing; are all the prior art.
The normalization and discretization processing module is used for performing normalization and discretization processing on the preprocessed user behavior data set to obtain a first user behavior feature set; the normalization and discretization processing methods are all known methods in the technical field;
the user characteristic dimension reduction processing module is used for carrying out user characteristic dimension reduction processing on the first user behavior characteristic set to obtain a second user behavior characteristic set after dimension reduction; the dimension reduction processing method comprises the following steps: the multiple collinearity dimension reduction method and the regression dimension reduction method are all known methods in the technical field;
the excellent data prediction model acquisition device is used for extracting a training set and a test set from the second user behavior feature set, establishing a candidate data prediction model according to the training set and the test set, evaluating the candidate data prediction model and acquiring an excellent data prediction model; the ratio of the number of samples in the training set to the number of samples in the test set is about 7/3; the training set and test set acquisition methods can adopt non-return random sampling, equidistant sampling, layered sampling and classified sampling methods which are all known in the technical field.
The important user characteristic obtaining device is used for screening the user characteristics in the second user behavior characteristic set according to the selected excellent data prediction model to select the important user characteristics;
the important characteristic factor acquisition device is used for filtering the characteristic factors of the important user characteristics to acquire important characteristic factors;
the characteristic-factor two-dimensional matrix base creation module is used for constructing a characteristic-factor two-dimensional matrix base of the first scene according to the important user characteristics and the important characteristic factor combination;
further, fig. 10 is a structural view of the excellent data prediction model acquisition apparatus in fig. 9;
the excellent data prediction model acquisition device comprises a candidate data prediction model construction device, a candidate data prediction model evaluation device and an excellent data prediction model extraction device,
the candidate data prediction model construction device is used for constructing a candidate data prediction model according to the user characteristic variables (input variables) and the behavior prediction characteristic variables (output variables) in the training set; the candidate data prediction model comprises one or more of a neural network, a random forest, a support vector machine, a decision tree, logistic regression, ensemble learning, a K neighbor model, Bayes and linear discrimination;
the candidate data prediction model evaluation device is used for evaluating the candidate data prediction model;
excellent data prediction model extraction means for selecting an excellent data prediction model; the method specifically comprises the following steps: screening excellent data prediction models by adopting a quartile and box plot method for the misjudgment rate of each candidate prediction model;
further, fig. 11 is a structural diagram of the important user characteristic acquiring apparatus in fig. 9;
the important user characteristic acquisition device comprises a second misjudgment rate matrix creation module and an important user characteristic acquisition module;
the second misjudgment rate matrix creating module is used for creating a user characteristic cycle model, performing cycle iteration, calculating the misjudgment rate after the user characteristics are removed, and storing the misjudgment rate as a second misjudgment rate matrix;
the method specifically comprises the following steps: determining whether the misjudgment rate is increasing or decreasing based on the selected excellent data prediction model by assuming that any one of the user features is eliminated: if the misjudgment rate is increased after the user characteristic is removed, judging that the positive influence of the user characteristic on the predicted behavior result is more obvious; if the error judgment rate is reduced after the user characteristic is removed, judging that the negative influence of the user characteristic on the predicted behavior result is more obvious; if the error judgment rate is not changed greatly after the user characteristic is removed, judging that the user characteristic has no significant influence on the predicted behavior result; the above process is cyclically repeated. The higher the misjudgment rate after the user features are removed, the more significant the influence of the corresponding user variables is, and the calculation method of the misjudgment rate is the same as that in the previous step.
The important user characteristic acquisition module is used for selecting important user characteristics;
the method specifically comprises the following steps: the method for selecting the important user features under the excellent data prediction model by using the box line graph and the quartile is the same as that for selecting the excellent data prediction model, and is not repeated herein.
Further, fig. 12 is a structural diagram of the important characteristic factor acquiring apparatus in fig. 9;
the important characteristic factor acquisition device comprises a characteristic factor dimension reduction device of the important user characteristic, a third misjudgment rate matrix creation device and an important characteristic factor acquisition device;
the characteristic factor dimension reduction device of the important user characteristic is used for reducing the dimension of the characteristic factor of the important user characteristic;
the third misjudgment rate matrix creating device is used for performing circular iteration on the feature factors after the dimension reduction by using a feature factor circular iteration method, calculating the misjudgment rate after the combination of the feature factors for elimination is eliminated, and storing the misjudgment rate as a third misjudgment rate matrix;
the important characteristic factor acquisition device is used for selecting an important characteristic factor combination; the method specifically comprises the following steps: the box line graph and the quartile are used for selecting the important feature factor combination in the important user features, and the method is the same as that for selecting the important user features, and is not repeated herein.
Further, fig. 13 is a structural view of a candidate data prediction model evaluation apparatus in fig. 10;
the candidate data prediction model evaluation device comprises a confusion matrix creation module and a first misjudgment rate matrix creation module;
a confusion matrix creating module, configured to substitute the user characteristic variables in the test set into the candidate data prediction model, and calculate a behavior prediction characteristic value (referred to as a first behavior prediction characteristic value); then comparing the first behavior prediction characteristic value with the original behavior prediction characteristic value of the test set, and establishing a confusion matrix according to the compared prediction error; wherein each column of the confusion matrix represents a prediction category, and the total number of each column represents the number of data predicted for that category; each row represents a true attribution category of data, and the total number of data in each row represents the number of data instances for that category. The value in each column represents the number of classes for which real data is predicted;
the first misjudgment rate matrix creating module is used for calculating the misjudgment rate of the candidate prediction model and storing the misjudgment rate as a first misjudgment rate matrix; the misjudgment rate is (data amount of prediction error/total sample) 100%; wherein the misjudgment rate of the candidate prediction model is less than or equal to a first threshold value, the first threshold value is set by a user and generally does not exceed 50%; the smaller the misjudgment rate value of the candidate prediction model is, the better the data prediction model effect is;
further, FIG. 14 is a diagram of a structure of the feature factor dimension reduction apparatus for the important user features in FIG. 12;
the feature factor dimension reduction device for the important user features comprises a feature factor discretization processing module for the important user features, a feature factor conversion module and a feature factor dimension reduction processing module for the important user features;
the characteristic factor discretization processing module of the important user characteristic is used for discretizing the characteristic factors in the important user characteristic; the discretization processing method is the prior art;
the characteristic factor conversion module is used for converting the characteristic factors after the discretization treatment into the characteristics of the simulation user; setting the discretized characteristic factors as simulated user characteristics, carrying out interval division and classification on the values of the characteristic factors, and setting classified names (characteristic factor variables) as simulated user characteristic variables;
and the feature factor dimension reduction processing module of the important user features is used for deleting feature factors (simulated user features) which have no influence on behavior prediction by using a regression dimension reduction method (linear, nonlinear and Logistic). The regression dimensionality reduction method is the prior art;
further, fig. 15 is a structural diagram of the third misjudgment rate matrix creating apparatus in fig. 12.
The third misjudgment rate matrix creating device comprises a characteristic factor vectorization module and a characteristic factor loop iteration module;
the characteristic factor vectorization module is used for vectorizing the characteristic factors;
the characteristic factor loop iteration module is used for establishing a characteristic factor loop model, iterating, calculating the misjudgment rate after eliminating the characteristic factor combination, and storing the misjudgment rate as a third misjudgment rate matrix;
the method specifically comprises the following steps: judging whether the error rate is increased or decreased by assuming any characteristic factor combination in the removed important characteristics: if the error rate is increased after the characteristic factor combination is removed, judging that the positive influence of the characteristic factor combination on the predicted behavior result is more obvious, if the error rate is decreased after the characteristic factor combination is removed, judging that the negative influence of the characteristic factor combination on the behavior result is more obvious, and if the change is not large after the characteristic factor combination is removed, judging that the influence of the characteristic factor combination on the behavior result is not obvious; and performing multiple cycles on the characteristic factors of the important user characteristics, and repeating the process. The higher the misjudgment rate after the characteristic factor combination is removed, the more obvious the influence of the corresponding characteristic factor combination is. The method of calculating the misjudgment rate is the same as that described above.
Example four
The structure of the device is the same as that of the device for creating the feature-factor two-dimensional matrix library in the fourth embodiment, and details are not repeated here.
It will be understood by those skilled in the art that all or part of the steps in the method according to the above embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, such as ROM, RAM, magnetic disk, optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (18)

1. A method for extracting and querying user characteristics and characteristic factors is characterized by comprising the following steps:
s1, creating a feature-factor two-dimensional matrix library of important user features and important factors of a plurality of scenes;
s2, correlating the created feature-factor two-dimensional matrix library under different scenes according to the same or similar behavior prediction features to construct a scene-behavior two-dimensional matrix;
s3, searching the same or similar behavior prediction characteristics under the related scene according to the scene-behavior two-dimensional matrix, searching the related characteristic-factor two-dimensional matrix according to the same or similar behavior prediction characteristics, and acquiring important user characteristics and important characteristic factors;
the method for creating the feature-factor two-dimensional matrix library of the important user features and the important factors of one scene comprises the following steps (S101-S108):
s101, extracting a user behavior data set from a user behavior statistical database of a first scene;
s102, preprocessing the user behavior data set;
s103, carrying out normalization and discretization processing on the preprocessed user behavior data set to obtain a first user behavior feature set;
s104, performing user characteristic dimension reduction processing on the first user behavior feature set to obtain a second user behavior feature set after dimension reduction;
s105, extracting a training set and a test set from the second user behavior feature set, establishing a candidate data prediction model according to the training set and the test set, and evaluating to obtain an excellent data prediction model;
s106, screening the user characteristics in the second user behavior characteristic set according to the selected excellent data prediction model, and selecting important user characteristics;
s107, filtering the characteristic factors of the important user characteristics to obtain important characteristic factors;
and S108, constructing a feature-factor two-dimensional matrix library of the first scene according to the important user features and the important feature factor combination.
2. The method for extracting and querying the user characteristics and characteristic factors according to claim 1, wherein the user behavior data set M1 comprises at least one user characteristic and behavior prediction characteristic; the behavior prediction characteristics are generated by taking user characteristics as input variables according to a data prediction model; setting the user characteristics as an input variable x and the behavior prediction characteristics as an output variable y, wherein y is a model (x);
the user characteristics include at least one characteristic factor.
3. The method for extracting and querying the user characteristics and characteristic factors according to claim 2,
the preprocessing comprises missing value processing, abnormal data processing and data redundancy processing;
the training set and test set acquisition method adopts a non-return random sampling, equidistant sampling, layered sampling and classified sampling method;
the method for performing user characteristic dimension reduction processing on the first user behavior characteristic set comprises the following steps:
step A1, finding out the highly associated user characteristics by using a multiple collinearity dimension reduction method, and deleting and retaining one of the highly associated user characteristics;
and step A2, gradually optimizing by using a regression dimensionality reduction method, and deleting user characteristics irrelevant to behavior prediction characteristics.
4. The method of claim 1, wherein the data prediction model comprises one or more of neural network, random forest, support vector machine, decision tree, logistic regression, ensemble learning, K-nearest neighbor model, bayes, and linear discriminant.
5. The method for extracting and querying the user features and the feature factors according to claim 4, wherein the step of establishing a candidate data prediction model according to the training set and the test set and evaluating the candidate data prediction model to obtain an excellent data prediction model comprises the following steps:
step B1, constructing a candidate data prediction model according to the user characteristic variables and the behavior prediction characteristic variables in the training set;
step B2, evaluating the candidate data prediction model; the method specifically comprises the following steps:
b 201: substituting the user characteristic variables in the test set into the candidate data prediction model, and calculating a behavior prediction characteristic value, namely a first behavior prediction characteristic value; comparing the first behavior prediction characteristic value with the original behavior prediction characteristic value of the test set, and establishing a confusion matrix according to the compared prediction errors;
each column of the confusion matrix represents a prediction category, the total number of each column representing the number of data predicted for that category; each row represents a true attribution category of data, the total number of data in each row represents the number of data instances of the category, and the numerical value in each column represents the number of the classes for which the true data is predicted;
b 202: calculating the misjudgment rate of the candidate prediction model, and storing the misjudgment rate as a first misjudgment rate matrix;
the misjudgment rate is (data amount of prediction error/total sample) 100%;
step B3: selecting an excellent data prediction model;
the method specifically comprises the following steps: and screening excellent data prediction models by adopting a quartile and box plot method for the misjudgment rate of each candidate prediction model.
6. The method of claim 5, wherein the candidate data prediction model comprises one or more of neural network, random forest, support vector machine, decision tree, logistic regression, ensemble learning, K-nearest neighbor model, Bayes, and linear discriminant;
the misjudgment rate of the candidate prediction model is less than or equal to a first threshold value, and the first threshold value is set by a user and is not more than 50%.
7. The method as claimed in claim 6, wherein the user characteristics and characteristic factors are selected from the second user behavior characteristic set according to the selected excellent data prediction model to select important user characteristics;
the method specifically comprises the following steps:
step C1: establishing a user characteristic cycle model, performing cycle iteration, calculating the misjudgment rate after the user characteristics are removed, and storing the misjudgment rate as a second misjudgment rate matrix;
the method specifically comprises the following steps: determining whether the misjudgment rate is increasing or decreasing based on the selected excellent data prediction model by assuming that any one of the user features is eliminated: if the misjudgment rate is increased after the user characteristic is removed, judging that the positive influence of the user characteristic on the predicted behavior result is more obvious; if the error judgment rate is reduced after the user characteristic is removed, judging that the negative influence of the user characteristic on the predicted behavior result is more obvious; if the error judgment rate is not changed greatly after the user characteristic is removed, judging that the user characteristic has no significant influence on the predicted behavior result; the above processes are circularly repeated;
step C2: selecting important user characteristics;
the method specifically comprises the following steps: and selecting important user features under the excellent data prediction model by using the box line graph and the quartile.
8. The method for extracting and querying user characteristics and characteristic factors according to claim 7,
filtering the feature factors in the important user features, wherein the obtaining of the important feature factors comprises:
d1, reducing the dimension of the characteristic factors of the important user characteristics;
d2, performing circular iteration on the feature factors after dimensionality reduction by using a feature factor circular iteration method, calculating the misjudgment rate after eliminating the feature factor combination, and storing the misjudgment rate as a third misjudgment rate matrix;
d3, selecting important characteristic factor combination;
and selecting an important feature factor combination in the important user features by using the box line graph and the quartile.
9. The method for extracting and querying the user features and the feature factors according to claim 8, wherein the method for performing dimension reduction on the feature factors of the important user features comprises the following steps (d101-d 103):
d101, performing discretization processing on the feature factors in the important user features;
d102, converting the feature factors after the discretization treatment into the simulation user features;
setting the discretized characteristic factors as simulated user characteristics, carrying out interval division and classification on the values of the characteristic factors, and setting classified names, also called characteristic factor variables, as simulated user characteristic variables;
d103, deleting the simulation user characteristics irrelevant to the behavior prediction by using a regression dimension reduction method;
performing loop iteration on the feature factors after dimensionality reduction by using a feature factor loop iteration method, calculating the misjudgment rate after eliminating the feature factor combination, and storing the misjudgment rate as a third misjudgment rate matrix, wherein the third misjudgment rate matrix specifically comprises (d201-d 202):
d201, vectorizing the characteristic factors;
d202, establishing a characteristic factor cycle model and iterating, calculating the misjudgment rate after eliminating the characteristic factor combination, and storing the misjudgment rate as a third misjudgment rate matrix;
the method specifically comprises the following steps: judging whether the error rate is increased or decreased by assuming any characteristic factor combination in the removed important characteristics: if the error rate is increased after the characteristic factor combination is removed, judging that the positive influence of the characteristic factor combination on the predicted behavior result is more obvious, if the error rate is decreased after the characteristic factor combination is removed, judging that the negative influence of the characteristic factor combination on the behavior result is more obvious, and if the change is not large after the characteristic factor combination is removed, judging that the influence of the characteristic factor combination on the behavior result is not obvious; and performing multiple cycles on the characteristic factors of the important user characteristics, and repeating the process.
10. A method for extracting user features and feature factors, which is the same as the method for creating the feature-factor two-dimensional matrix library of important user features and important factors of one scene in claim 1.
11. A system for extracting and querying user characteristics and characteristic factors is characterized by comprising:
the characteristic-factor two-dimensional matrix base creating device is used for creating the characteristic-factor two-dimensional matrix base of the important user characteristics and the important factors of a plurality of scenes;
the scene-behavior two-dimensional matrix creating device is used for correlating the created feature-factor two-dimensional matrix library under different scenes according to the same or similar behavior prediction features to construct a scene-behavior two-dimensional matrix;
the important user characteristic and important characteristic factor inquiry device is used for searching the same or similar behavior prediction characteristics under the associated scene according to the scene-behavior two-dimensional matrix, searching the associated characteristic-factor two-dimensional matrix according to the same or similar behavior prediction characteristics, and acquiring the important user characteristics and the important characteristic factors;
wherein the feature-factor two-dimensional matrix library creating device includes:
the user behavior data set extraction module of the first scene is used for extracting a user behavior data set from a user behavior statistical database of the first scene; the user behavior data set M1 comprises at least one user characteristic, behavior prediction characteristic; the user characteristics include at least one characteristic factor; the behavior prediction characteristics are generated by taking user characteristics as input variables according to a data prediction model; setting the user characteristics as an input variable x and the behavior prediction characteristics as an output variable y, wherein y is a model (x); the data prediction model comprises one or more of a neural network, a random forest, a support vector machine, a decision tree, logistic regression, ensemble learning, a K nearest neighbor model and Bayesian linear discrimination;
the data preprocessing module is used for preprocessing the user behavior data set; the preprocessing comprises missing value processing, abnormal data processing and data redundancy processing;
the normalization and discretization processing module is used for performing normalization and discretization processing on the preprocessed user behavior data set to obtain a first user behavior feature set;
the user characteristic dimension reduction processing module is used for carrying out user characteristic dimension reduction processing on the first user behavior characteristic set to obtain a second user behavior characteristic set after dimension reduction; the dimension reduction processing method comprises the following steps: multiple collinearity dimension reduction method, regression dimension reduction method;
the excellent data prediction model acquisition device is used for extracting a training set and a test set from the second user behavior feature set, establishing a candidate data prediction model according to the training set and the test set, evaluating the candidate data prediction model and acquiring an excellent data prediction model; the training set and test set acquisition method adopts a non-return random sampling, equidistant sampling, layered sampling and classified sampling method;
the important user characteristic obtaining device is used for screening the user characteristics in the second user behavior characteristic set according to the selected excellent data prediction model to select the important user characteristics;
the important characteristic factor acquisition device is used for filtering the characteristic factors of the important user characteristics to acquire important characteristic factors;
and the characteristic-factor two-dimensional matrix base creation module is used for constructing a characteristic-factor two-dimensional matrix base of the first scene according to the important user characteristics and the important characteristic factor combination.
12. The system for user characteristic and characteristic factor extraction and query as claimed in claim 11, wherein the excellent data prediction model obtaining means comprises candidate data prediction model constructing means, candidate data prediction model evaluating means and excellent data prediction model extracting means,
the candidate data prediction model construction device is used for constructing a candidate data prediction model according to the user characteristic variables and the behavior prediction characteristic variables in the training set; the candidate data prediction model comprises one or more of a neural network, a random forest, a support vector machine, a decision tree, logistic regression, ensemble learning, a K neighbor model, Bayes and linear discrimination;
the candidate data prediction model evaluation device is used for evaluating the candidate data prediction model;
excellent data prediction model extraction means for selecting an excellent data prediction model; the method specifically comprises the following steps: and screening excellent data prediction models by adopting a quartile and box plot method for the misjudgment rate of each candidate prediction model.
13. The system for extracting and querying user characteristics and characteristic factors according to claim 11, wherein the important user characteristic obtaining means includes a second misjudgment rate matrix creating module and an important user characteristic obtaining module;
the second misjudgment rate matrix creating module is used for creating a user characteristic cycle model, performing cycle iteration, calculating the misjudgment rate after the user characteristics are removed, and storing the misjudgment rate as a second misjudgment rate matrix;
the method specifically comprises the following steps: determining whether the misjudgment rate is increasing or decreasing based on the selected excellent data prediction model by assuming that any one of the user features is eliminated: if the misjudgment rate is increased after the user characteristic is removed, judging that the positive influence of the user characteristic on the predicted behavior result is more obvious; if the error judgment rate is reduced after the user characteristic is removed, judging that the negative influence of the user characteristic on the predicted behavior result is more obvious; if the error judgment rate is not changed greatly after the user characteristic is removed, judging that the user characteristic has no significant influence on the predicted behavior result; the above processes are circularly repeated;
the important user characteristic acquisition module is used for selecting important user characteristics;
the method specifically comprises the following steps: the method for selecting the important user features under the excellent data prediction model by using the box line graph and the quartile is the same as that for selecting the excellent data prediction model, and is not repeated herein.
14. The system for extracting and querying user characteristics and characteristic factors according to claim 11, wherein the important characteristic factor obtaining means includes a characteristic factor dimension reduction means, a third misjudgment rate matrix creating means, and an important characteristic factor obtaining means for important user characteristics;
the characteristic factor dimension reduction device of the important user characteristic is used for reducing the dimension of the characteristic factor of the important user characteristic;
the third misjudgment rate matrix creating device is used for performing circular iteration on the feature factors after the dimension reduction by using a feature factor circular iteration method, calculating the misjudgment rate after the combination of the feature factors for elimination is eliminated, and storing the misjudgment rate as a third misjudgment rate matrix;
the important characteristic factor acquisition device is used for selecting an important characteristic factor combination; the method specifically comprises the following steps: and selecting an important feature factor combination in the important user features by using the box line graph and the quartile.
15. The system for user feature and feature factor extraction and query as claimed in claim 12,
the candidate data prediction model evaluation device comprises a confusion matrix creation module and a first misjudgment rate matrix creation module;
the confusion matrix creating module is used for substituting the user characteristic variables in the test set into the candidate data prediction model, calculating a behavior prediction characteristic value, namely a first behavior prediction characteristic value, comparing the first behavior prediction characteristic value with the original behavior prediction characteristic value of the test set, and creating a confusion matrix according to the compared prediction error; wherein each column of the confusion matrix represents a prediction category, and the total number of each column represents the number of data predicted for that category; each row represents a true attribution category of data, and the total number of data in each row represents the number of data instances in the category; the value in each column represents the number of classes for which real data is predicted;
the first misjudgment rate matrix creating module is used for calculating the misjudgment rate of the candidate prediction model and storing the misjudgment rate as a first misjudgment rate matrix; the misjudgment rate is (data amount of prediction error/total sample) 100%; and the misjudgment rate of the candidate prediction model is less than or equal to a first threshold value, and the first threshold value is set by a user.
16. The system for extracting and querying user characteristics and characteristic factors according to claim 14, wherein the characteristic factor dimension reduction means for the important user characteristics includes a characteristic factor discretization processing module for the important user characteristics, a characteristic factor transformation module, and a characteristic factor dimension reduction processing module for the important user characteristics,
the characteristic factor discretization processing module of the important user characteristic is used for discretizing the characteristic factors in the important user characteristic;
the characteristic factor conversion module is used for converting the characteristic factors after the discretization treatment into the characteristics of the simulation user; setting the discretized characteristic factors as simulated user characteristics, carrying out interval division and classification on the values of the characteristic factors, and setting classified names and called characteristic factor variables as simulated user characteristic variables;
and the feature factor dimension reduction processing module of the important user features is used for deleting the simulated user features which have no relation to the behavior prediction influence by utilizing a regression dimension reduction method.
17. The system for extracting and querying user characteristics and characteristic factors according to claim 14, wherein the third error rate matrix creating means includes a characteristic factor vectorization module and a characteristic factor loop iteration module,
the characteristic factor vectorization module is used for vectorizing the characteristic factors;
the characteristic factor loop iteration module is used for establishing a characteristic factor loop model, iterating, calculating the misjudgment rate after eliminating the characteristic factor combination, and storing the misjudgment rate as a third misjudgment rate matrix;
the method specifically comprises the following steps: judging whether the error rate is increased or decreased by assuming any characteristic factor combination in the removed important characteristics: if the error rate is increased after the characteristic factor combination is removed, judging that the positive influence of the characteristic factor combination on the predicted behavior result is more obvious, if the error rate is decreased after the characteristic factor combination is removed, judging that the negative influence of the characteristic factor combination on the behavior result is more obvious, and if the change is not large after the characteristic factor combination is removed, judging that the influence of the characteristic factor combination on the behavior result is not obvious; and performing multiple cycles on the characteristic factors of the important user characteristics, and repeating the process.
18. A user feature and feature factor extracting device having the same structure as the feature-factor two-dimensional matrix library creating device of claim 11.
CN201811619624.6A 2018-12-26 2018-12-26 User characteristic and characteristic factor extraction and query method and system Active CN109635010B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811619624.6A CN109635010B (en) 2018-12-26 2018-12-26 User characteristic and characteristic factor extraction and query method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811619624.6A CN109635010B (en) 2018-12-26 2018-12-26 User characteristic and characteristic factor extraction and query method and system

Publications (2)

Publication Number Publication Date
CN109635010A CN109635010A (en) 2019-04-16
CN109635010B true CN109635010B (en) 2021-10-08

Family

ID=66078580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811619624.6A Active CN109635010B (en) 2018-12-26 2018-12-26 User characteristic and characteristic factor extraction and query method and system

Country Status (1)

Country Link
CN (1) CN109635010B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135628A (en) * 2019-04-23 2019-08-16 上海淇玥信息技术有限公司 A kind of monetary device automatic generation method, device, system and recording medium
CN110602569B (en) * 2019-09-18 2021-11-09 深圳市梦网视讯有限公司 Bandwidth multiplexing method and system based on bandwidth trend
CN111104979B (en) * 2019-12-18 2023-08-01 北京思维造物信息科技股份有限公司 Method, device and equipment for generating user behavior value evaluation model
CN113671904B (en) * 2020-05-13 2022-09-06 Tcl科技集团股份有限公司 Machine monitoring method and device, machine, readable storage medium and terminal equipment
CN111950600B (en) * 2020-07-20 2024-05-14 奇富数科(上海)科技有限公司 Method and device for predicting overdue user resource return performance and electronic equipment
CN118095646A (en) * 2024-03-11 2024-05-28 深圳九间科技有限公司 Career planning method, device, terminal equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105469263A (en) * 2014-09-24 2016-04-06 阿里巴巴集团控股有限公司 Commodity recommendation method and device
CN105868847A (en) * 2016-03-24 2016-08-17 车智互联(北京)科技有限公司 Shopping behavior prediction method and device
CN106445147A (en) * 2016-09-28 2017-02-22 北京百度网讯科技有限公司 Behavior management method and device of conversational system based on artificial intelligence
CN107424007A (en) * 2017-07-12 2017-12-01 北京京东尚科信息技术有限公司 A kind of method and apparatus for building electronic ticket susceptibility identification model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105469263A (en) * 2014-09-24 2016-04-06 阿里巴巴集团控股有限公司 Commodity recommendation method and device
CN105868847A (en) * 2016-03-24 2016-08-17 车智互联(北京)科技有限公司 Shopping behavior prediction method and device
CN106445147A (en) * 2016-09-28 2017-02-22 北京百度网讯科技有限公司 Behavior management method and device of conversational system based on artificial intelligence
CN107424007A (en) * 2017-07-12 2017-12-01 北京京东尚科信息技术有限公司 A kind of method and apparatus for building electronic ticket susceptibility identification model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于互联网数据的个人信用风险评估的研究与应用";肖琴;《中国优秀硕士学位论文全文数据库 经济与管理科学辑》;20180215;论文正文第3、5章 *
"基于协同关系主题回归模型的推荐算法研究";丁雪涛;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140715;论文正文第3章 *

Also Published As

Publication number Publication date
CN109635010A (en) 2019-04-16

Similar Documents

Publication Publication Date Title
CN109635010B (en) User characteristic and characteristic factor extraction and query method and system
CN111882446B (en) Abnormal account detection method based on graph convolution network
CN107122327B (en) Method and training system for training model by using training data
US20040019574A1 (en) Processing mixed numeric and/or non-numeric data
CN109739844B (en) Data classification method based on attenuation weight
CN104636449A (en) Distributed type big data system risk recognition method based on LSA-GCC
CN104573130A (en) Entity resolution method based on group calculation and entity resolution device based on group calculation
CN110287292B (en) Judgment criminal measuring deviation degree prediction method and device
CN111639690A (en) Fraud analysis method, system, medium, and apparatus based on relational graph learning
CN106126719A (en) Information processing method and device
CN115343676B (en) Feature optimization method for positioning technology of redundant substances in sealed electronic equipment
CN110647995A (en) Rule training method, device, equipment and storage medium
CN114048468A (en) Intrusion detection method, intrusion detection model training method, device and medium
CN112232944B (en) Method and device for creating scoring card and electronic equipment
CN105609116A (en) Speech emotional dimensions region automatic recognition method
CN110634060A (en) User credit risk assessment method, system, device and storage medium
CN109255368B (en) Method, device, electronic equipment and storage medium for randomly selecting characteristics
CN113159419A (en) Group feature portrait analysis method, device and equipment and readable storage medium
CN106874286B (en) Method and device for screening user characteristics
CN117095230A (en) Air quality low-consumption assessment method and system based on image big data intelligent analysis
CN115953584A (en) End-to-end target detection method and system with learnable sparsity
Ding et al. Improved density peaks clustering based on natural neighbor expanded group
CN104778478A (en) Handwritten numeral identification method
CN115618297A (en) Method and device for identifying abnormal enterprise
CN115310606A (en) Deep learning model depolarization method and device based on data set sensitive attribute reconstruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 518000 Guangdong city of Shenzhen province Nanshan District Guangdong streets high in the four Longtaili Technology Building Room 325 No. 30

Applicant after: Shenzhen mengwang video Co., Ltd

Address before: 518000 Guangdong city of Shenzhen province Nanshan District Guangdong streets high in the four Longtaili Technology Building Room 325 No. 30

Applicant before: SHENZHEN MONTNETS ENCYCLOPEDIA INFORMATION TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant