CN107977413A - Feature selection approach, device, computer equipment and the storage medium of user data - Google Patents

Feature selection approach, device, computer equipment and the storage medium of user data Download PDF

Info

Publication number
CN107977413A
CN107977413A CN201711172183.5A CN201711172183A CN107977413A CN 107977413 A CN107977413 A CN 107977413A CN 201711172183 A CN201711172183 A CN 201711172183A CN 107977413 A CN107977413 A CN 107977413A
Authority
CN
China
Prior art keywords
variable
characteristic variable
user
data
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711172183.5A
Other languages
Chinese (zh)
Inventor
徐定坚
赖晓彬
刘奕慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dingfeng Cattle Technology Co Ltd
Original Assignee
Shenzhen Dingfeng Cattle Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dingfeng Cattle Technology Co Ltd filed Critical Shenzhen Dingfeng Cattle Technology Co Ltd
Priority to CN201711172183.5A priority Critical patent/CN107977413A/en
Publication of CN107977413A publication Critical patent/CN107977413A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Abstract

This application involves a kind of feature selection approach of user data, device, computer equipment and storage medium, the described method includes:The characteristic information of user data is obtained, extracts the corresponding characteristic variable of the characteristic information;The characteristic variable is clustered, obtains multiple cluster results;Characteristic variable in the multiple cluster result is respectively combined, obtains multiple combinations of features, the combinations of features includes multiple assemblage characteristic variables;Target variable is obtained, correlation test is carried out to multiple assemblage characteristic variables using the target variable;When upchecking, interactive tag is added to the assemblage characteristic variable;Using adding the corresponding characteristic variable of assemblage characteristic variable resolution after interactive tag;The characteristic variable obtained by parsing generates user's optimal characteristics variable.The accuracy that feature selecting is carried out to user data can be improved using this method.

Description

Feature selection approach, device, computer equipment and the storage medium of user data
Technical field
This application involves field of computer technology, more particularly to a kind of feature selection approach of user data, device, meter Calculate machine equipment and storage medium.
Background technology
With the fast development of internet and big data, data mining is keeping client, client's marketing and is finding high price More and more important effect is played in value client, valuable information can be excavated from mass data.Feature selecting is several According to an important directions in excavation, feature selecting is the selected section optimal characteristics from all primitive character set, from number The optimal characteristics selected in are used to establish air control model and be analyzed, and then user is carried out using risk control model Credit evaluation.
In traditional mode, typically using filtering type (Filter), packaged type (Wrapper) and embedded (Embedded) feature selection approach.Filtering type is that weight represents the dimensional feature by assigning weight to the feature of each Importance, then according to weight to feature ordering.Packaged type be by the way that subset to be generated to different combinations, to combine purchasing price, Compared with being combined again with others, selection preferably subset.Embedded is to learn in the case where model is set to improving The best attribute of model accuracy, picks out the feature significant to model automatically in learner training process.But these types Mode does not show the interactivity between feature fully, and is to be classified using single grader to feature, utilizes classification As a result feature selecting result is analyzed, the optimal characteristics caused have randomness.How selection optimal characteristics are improved The accuracy of subset becomes the current technical issues that need to address.
The content of the invention
Based on this, it is necessary to for above-mentioned technical problem, there is provided a kind of user for the accuracy that can improve feature selecting Feature selection approach, device, computer equipment and the storage medium of data.
A kind of feature selection approach of user data, including:
The characteristic information of user data is obtained, extracts the corresponding characteristic variable of the characteristic information;
The characteristic variable is clustered, obtains multiple cluster results;
Characteristic variable in the multiple cluster result is respectively combined, obtains multiple combinations of features, the feature Combination includes multiple assemblage characteristic variables;
Target variable is obtained, correlation test is carried out to multiple assemblage characteristic variables using the target variable;
When upchecking, interactive tag is added to the assemblage characteristic variable;
Using adding the corresponding characteristic variable of assemblage characteristic variable resolution after interactive tag;
The characteristic variable obtained by parsing generates user's optimal characteristics variable.
In one of the embodiments, the step of being tested using the target variable to multiple assemblage characteristic variables is wrapped Include:
The P-value values of the assemblage characteristic variable are calculated using the target variable;
By the P-value values compared with first threshold, when the P-value values are less than first threshold, record The assemblage characteristic variable passes through inspection.
In one of the embodiments, using adding the corresponding characteristic variable of assemblage characteristic variable resolution after interactive tag The step of include:
Count the frequency occurred in assemblage characteristic variable of the characteristic variable after the addition interactive tag;
The corresponding variance of the frequency is calculated, using the variance compared with second threshold;
When the variance reaches second threshold, the corresponding characteristic variable of the frequency is recorded as user's optimal characteristics and is become Amount.
In one of the embodiments, before the step of characteristic information of the acquisition user data, further include:
Obtain the log-on data of user and the historical data of user in database;
The user behavior data of third-party platform is obtained according to the log-on data of user;
The log-on data, historical data and behavioral data are analyzed, the user data after being analyzed;
Preset keyword is obtained, the characteristic information in the user data is extracted using preset keyword.
A kind of feature selecting device of user data, including:
Acquisition module, for obtaining the characteristic information of user data, extracts the corresponding characteristic variable of the characteristic information;
Cluster module, for being clustered to the characteristic variable, obtains multiple cluster results;
Inspection module, for being respectively combined to the characteristic variable in the multiple cluster result, obtains multiple features Combination, the combinations of features include multiple assemblage characteristic variables;Target variable is obtained, using the target variable to multiple combinations Characteristic variable carries out correlation test, and when upchecking, interactive tag is added to the assemblage characteristic variable;
Parsing module, for utilizing the corresponding characteristic variable of assemblage characteristic variable resolution after adding interactive tag;Pass through Parse obtained characteristic variable generation user's optimal characteristics variable.
In one of the embodiments, the feature verification module is additionally operable to calculate the combination using the target variable The P-value values of characteristic variable;By the P-value values compared with first threshold, when the P-value values are less than first During threshold value, record the assemblage characteristic variable and pass through inspection.
In one of the embodiments, the feature analysis module, is additionally operable to count the characteristic variable in the addition The frequency occurred in assemblage characteristic variable after interactive tag;The corresponding variance of the frequency is calculated, utilizes the variance and the Two threshold values are compared;When the variance reaches second threshold, the corresponding characteristic variable of the variance is recorded as user most Excellent characteristic variable.
In one of the embodiments, the acquisition module is used to obtain the log-on data of user and the history number of user According to;Obtain the user behavior data of third-party platform;The log-on data, historical data and behavioral data are analyzed and merged into User data, extracts the characteristic information in the user data.
A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor Computer program, it is characterised in that the step of realizing the above method when reason device performs described program.
A kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The step of above method is realized during execution.
Feature selection approach, device, computer equipment and the storage medium of above-mentioned user data, obtain the spy of user data After reference breath, the corresponding characteristic variable of extraction characteristic information;Characteristic variable is clustered, multiple cluster results can be obtained; By being respectively combined to the characteristic variable in multiple cluster results, multiple combinations of features are obtained;By obtaining target variable, Correlation test is carried out to multiple assemblage characteristic variables using target variable;So as to obtain characteristic variable and target variable it Between correlation.When upchecking, interactive tag is added to assemblage characteristic variable, interactive tag is with the addition of and represents characteristic variable Correlation between target variable is higher.Become according to the addition of the corresponding feature of the assemblage characteristic variable resolution after interactive tag Amount;The characteristic variable obtained by parsing generates user's optimal characteristics variable, can cause the feature more accurate and effective of selection, Correlation is higher and accurate user's optimal characteristics variable so as to selecting, and then improves the accuracy of feature selecting.
Brief description of the drawings
Fig. 1 is the flow chart of the feature selection approach of user data in one embodiment;
Fig. 2 is the internal structure schematic diagram of the feature selecting device of user data in one embodiment;
Fig. 3 is the internal structure schematic diagram of one embodiment Computer equipment.
Embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the object, technical solution and advantage of the application are more clearly understood The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the application, not Apply for limiting.It is appreciated that term " first " used in the present invention, " second " etc. can be used to describe herein it is various Element, but these elements should not be limited by these terms.These terms are only used to distinguish first element and another element.
In one embodiment, as shown in Figure 1, there is provided a kind of feature selection approach of user data, in this way should For being illustrated exemplified by server, this method specifically includes following steps:
Step 102, the characteristic information of user data, the corresponding characteristic variable of extraction characteristic information are obtained.
During establishing in risk control model, data mining is particularly important, it is necessary to obtain user in different platform Data message, the data characteristics in data message is made choice, by the feature of selection may determine that user consumption, Credit standing etc., and then the credit of user is assessed.
Server by obtaining the log-on data of user and the historical data of user in database, wherein, registration packet The essential information data of user are included, historical data includes the balance data of user.User can also be obtained in third-party platform Data, such as the behavioral data of the platform such as Alipay, Jingdone district, wechat, behavioral data include identities match data, user behavior number According to, balance data, consumption data etc..After server obtains these user data, user data is analyzed, obtains number of users According to characteristic information, extract the corresponding characteristic variable of characteristic information.
Step 104, characteristic variable is clustered, obtains multiple cluster results.
After server extracts the corresponding characteristic variable of characteristic information, characteristic variable is clustered, specifically, can be adopted The method clustered with k-means (k- mean algorithms), wherein, the value of k can be 2.By repeatedly being clustered to characteristic variable After obtain multiple cluster results.Arbitrarily two variables of selection are calculated as initial cluster center point first from characteristic variable Each the similarity between characteristic variable and cluster centre point, similarity can also represent each characteristic variable and cluster centre point The distance between, it can be calculated using mean square deviation function.According to each characteristic variable and the similarity of cluster centre point, divide Characteristic variable is not assigned in the cluster most like with cluster centre point, obtains multiple cluster results.
Specifically, k characteristic variable is randomly selected in all characteristic variables as cluster centre point, k is default poly- Class number, is μ12,...μk∈Rn.Calculation formula can be:
c(i)=arg min | | x(i)j||2
Wherein, c(i)Representative feature variable and class closest in k class, i represent each characteristic variable;argmin The parameter of variate-value when () represents to be minimized object function;x(i)Represent characteristic variable collection, μjRepresent each cluster Cluster centre point, j represents each cluster numbers.
Such cluster centre point is recalculated for each class j, obtains each characteristic variable to cluster centre point Square distance and, specific formula can be:
Wherein, J (c, μ) represents each characteristic variable to the square distance and the number of m expression characteristic variables of cluster centre point Amount, i represent each characteristic variable;x(i)Represent characteristic variable collection;μc(i)Represent the cluster centre point of characteristic variable.
J is adjusted to minimum by the algorithm of K-means.Assuming that current J is not reaching to minimum value, then can fix each The barycenter μ of classj, adjust the affiliated classification c of each sample(i), allow J functions to successively decrease.Equally, fixed c(i), adjust each class Barycenter μjJ can also be reduced.The two processes are exactly to make the process of J monotone decreasings in interior circulation.When J is decremented to minimum, μ Also restrained at the same time with c so that cluster centre point is constant or varies less, so as to be divided exactly characteristic variable Class, obtains cluster result.
Step 106, the characteristic variable in multiple cluster results is respectively combined, obtains multiple combinations of features, feature Combination includes multiple assemblage characteristic variables.
Characteristic variable in multiple cluster results is respectively combined, specifically, can be by the way of combination of two It is combined, obtains multiple combinations of features, each combinations of features includes multiple characteristic variables.
Step 108, target variable is obtained, correlation test is carried out to multiple assemblage characteristic variables using target variable.
Target variable can be pre-set variable, in risk control model, can pre-set after analysis Characteristic variable as target variable.Server obtains target variable, and multiple assemblage characteristic variables are carried out according to target variable Correlation test.Specifically, correlation test can be carried out by the way of Chi-square Test, by calculating assemblage characteristic variable With the chi square distribution of target variable, count between the actual observed value of assemblage characteristic variable and the theoretical implications value of target variable Departure degree, the departure degree between actual observed value and theoretical implications value obtains chi-square value, by the card of assemblage characteristic variable Side's value is converted to P-value values, so as to obtain the correlation between characteristic variable and target variable.
Server obtains default first threshold, and first threshold can be 0.05, by obtained P-value and first threshold It is compared, if P-value values are less than first threshold, show that assemblage characteristic variable has interaction, then record the group Close characteristic variable and pass through inspection.
Step 110, when upchecking, interactive tag is added to assemblage characteristic variable.
Step 112, using adding the corresponding characteristic variable of assemblage characteristic variable resolution after interactive tag.
When assemblage characteristic variable passes through inspection, server adds interactive tag to the assemblage characteristic variable, while can delete Except not verified union variable.Server by utilizing with the addition of the corresponding feature of assemblage characteristic variable resolution after interactive tag Variable, specifically, what server occurred by calculating each characteristic variable in the assemblage characteristic variable after adding interactive tag The second-order deviation of frequency, compared with second threshold, the characteristic variable after being parsed.
Step 114, the characteristic variable obtained by parsing generates user's optimal characteristics variable.
Specifically, the frequency that each characteristic variable of server statistics occurs in the assemblage characteristic variable after adding interactive tag Number, calculates the second-order deviation of frequency, by the result of the second-order deviation calculated compared with second threshold, when the result reaches During second threshold, the corresponding characteristic variable of the frequency is recorded as user's optimal characteristics variable, while delete and be less than second threshold Characteristic variable.
In the present embodiment, by obtaining the characteristic information of user data, after extracting the corresponding characteristic variable of characteristic information;It is right Characteristic variable is clustered, and can obtain multiple cluster results;By being carried out respectively to the characteristic variable in multiple cluster results Combination, obtains multiple combinations of features;By obtaining target variable, multiple assemblage characteristic variables are carried out using target variable related Property examine;So as to obtain the correlation between characteristic variable and target variable.When upchecking, to assemblage characteristic variable Interactive tag is added, interactive tag is with the addition of and represents that the correlation between characteristic variable and target variable is higher.According to the addition of The corresponding characteristic variable of assemblage characteristic variable resolution after interactive tag;It is optimal that the characteristic variable obtained by parsing generates user Characteristic variable, can cause the feature more accurate and effective of selection, and so as to select, correlation is higher and accurate user Optimal characteristics variable, and then improve the accuracy of feature selecting.
In one embodiment, the step of being tested using target variable to multiple assemblage characteristic variables is included:Utilize Target variable calculates the P-value values of assemblage characteristic variable;By P-value values compared with first threshold, when P-value values During less than first threshold, record combination characteristic variable passes through inspection.
Target variable can be pre-set variable, in risk control model, can pre-set after analysis Characteristic variable as target variable.Server obtains target variable, and multiple assemblage characteristic variables are carried out according to target variable Correlation test and analysis.P-value values are using target variable as sample is assumed, when given hypothesis sample is true, are combined The probability that characteristic variable occurs.
Specifically, can be tested by the way of Chi-square Test, Chi-square Test is the actual sight by statistical sample Departure degree between measured value and theoretical implications value, the departure degree between actual observed value and theoretical implications value determine card side The size of value, chi-square value is bigger, and actual observed value is not met more with theoretical implications value;Chi-square value is smaller, and deviation is smaller, actual to see Measured value more tends to meet with theoretical implications value.
Further, using target variable as hypothesis sample H0, the distribution function of population sample X, X are F (x), will be total The value range of body X is divided into k mutually disjoint minizone A1, A2, A3..., Ak, such as desirable A1=(a0, a1], A2=(a1, a2] ..., Ak=(ak-1,ak), the A for falling into i-th of minizoneiThe number of sample value be denoted as fi, become observed frequency, institute There is the sum of observed frequency f1+f2+...+fkEqual to sample size n sample range.Work as H0When being true, it is distributed according to the population theory assumed, can The value for calculating overall X falls into i-th of minizone AiProbability pi, npiIt is then to fall into i-th of minizone AiSample value theory Frequency.Work as H0When being true, sample value falls into i-th of minizone A in n experimentiFrequency fi/ n and Probability piShould be very close to working as H0 When untrue, then fi/ n and piDiffer greatly.Specifically formula can be:
Wherein, x2Represent the approximate chi square distribution for obeying the k-1 free degree of statistic;I represents horizontal expecterd frequency etc. In the expected probability p of total frequency n × i levelsi;fiFor the observed frequency of i levels, piFor the expected frequency of i levels, n is total frequency Number;K is calculating piWhen the number of parameters used.When n is bigger, the card of assemblage characteristic variable can be obtained according to chi square distribution Side's value.
The chi-square value of assemblage characteristic variable is converted into P-value values, the size that can preset P-value values is the first threshold Value, by P-value values compared with first threshold, when P-value values are less than first threshold, then record combines characteristic variable By examining, so as to obtain the correlation between characteristic variable and target variable.
In one embodiment, the step for adding the corresponding characteristic variable of assemblage characteristic variable resolution after interactive tag is utilized Suddenly include:The frequency that statistical nature variable occurs in the assemblage characteristic variable after adding interactive tag;It is corresponding to calculate frequency Variance, using variance compared with second threshold;When variance reaches second threshold, the corresponding characteristic variable of frequency is recorded For user's optimal characteristics variable.
Characteristic variable is obtained, after being clustered to characteristic variable, the characteristic variable after cluster is respectively combined respectively, Obtain assemblage characteristic variable.Correlation test is carried out to multiple assemblage characteristic variables using target variable, when assemblage characteristic variable When passing through inspection, interactive tag is added to the assemblage characteristic variable, while not verified union variable can be deleted.Utilize It with the addition of the corresponding characteristic variable of assemblage characteristic variable resolution after interactive tag.
Specifically, can occur by calculating each characteristic variable in the assemblage characteristic variable after adding interactive tag The second-order deviation of frequency, by obtained second-order deviation result compared with default second threshold.When obtained second-order deviation When as a result reaching second threshold, the corresponding characteristic variable of frequency is recorded as user's optimal characteristics variable, the spy after being parsed Variable is levied, can cause the feature more accurate and effective of selection, and then improve the accuracy of feature selecting.
In one embodiment, before the step of obtaining the characteristic information of user data, further include:Obtain in database The log-on data of user and the historical data of user;The user behavior number of third-party platform is obtained according to the log-on data of user According to;The log-on data, historical data and behavioral data of acquisition are analyzed, the user data after being analyzed;Obtain default Keyword, the characteristic information in the user data is extracted using preset keyword.
During establishing in risk control model, data mining is particularly important, it is necessary to obtain user in different platform Data message, and then the credit of user is assessed.Server obtains the log-on data of user and user in database Historical data, wherein, log-on data includes the essential information data of user, and historical data includes the balance data of user.May be used also To obtain data of the user in third-party platform, such as the row of the platform such as Alipay, Jingdone district, wechat according to the log-on data of user For data, behavioral data includes identities match data, user behavior data, balance data, consumption data etc..
Server is analyzed the data got, removes the data of repetition, the user data after being analyzed.Carrying Before taking the characteristic information of user data, keyword, such as " gender ", " age ", " educational background ", " marriage shape can be pre-set Condition ", " house property situation ", " working condition " etc..Server obtains preset keyword, is extracted using preset keyword in user data Characteristic information, and then after extracting the corresponding characteristic variable of characteristic information, feature selecting is carried out to characteristic variable.Pass through acquisition The characteristic information of the user data of each platform, can improve the quality and quantity of feature so that the feature of selection is more accurate Effectively, and then the accuracy of feature selecting is improved.By excavating valuable user data, it can be found that the user of high value Data characteristics, is conducive to keep client and carries out effectively customer account management.
In one embodiment, as shown in Figure 2, there is provided a kind of feature selecting device of user data, the device include: Acquisition module 202, cluster module 204, inspection module 206, parsing module 208, wherein:
Acquisition module 202, for obtaining the characteristic information of user data, the corresponding characteristic variable of extraction characteristic information.
Cluster module 204, for being clustered to characteristic variable, obtains multiple cluster results.
Inspection module 206, for being respectively combined to the characteristic variable in multiple cluster results, obtains multiple feature groups Close, combinations of features includes multiple assemblage characteristic variables;Obtain target variable, using target variable to multiple assemblage characteristic variables into Row correlation test;When upchecking, interactive tag is added to assemblage characteristic variable.
Parsing module 208, for utilizing the corresponding characteristic variable of assemblage characteristic variable resolution after adding interactive tag;It is logical Cross characteristic variable generation user's optimal characteristics variable that parsing obtains.
In one embodiment, inspection module 206, are additionally operable to calculate the P- of assemblage characteristic variable using target variable Value values;By P-value values compared with first threshold, when P-value values are less than first threshold, assemblage characteristic is recorded Variable passes through inspection.
In one embodiment, parsing module 210, it is special to be additionally operable to combination of the statistical nature variable after interactive tag is added The frequency occurred in sign variable;The corresponding variance of frequency is calculated, using variance compared with second threshold;When variance reaches During two threshold values, the corresponding characteristic variable of variance is recorded as user's optimal characteristics variable.
In one embodiment, acquisition module is additionally operable to obtain the log-on data of user and the history number of user in database According to;The user behavior data of third-party platform is obtained according to the log-on data of user;To log-on data, historical data and behavior number According to being analyzed, the user data after being analyzed;Preset keyword is obtained, is extracted using preset keyword in user data Characteristic information.
In one embodiment, as shown in Figure 3, there is provided a kind of internal structure schematic diagram of computer equipment.For example, should Computer equipment can be a kind of server, and server can be separate server or cluster server.The computer Equipment includes processor, non-volatile memory medium, built-in storage and the network interface connected by system bus.Wherein, should The non-volatile memory medium of computer equipment is stored with database, operating system and computer program, can be stored in database The information such as user data, characteristic information and characteristic variable.The processor of the computer equipment is used to provide calculating and control ability, Support the operation of whole server.The computer program be processed execution when, may be such that processor realizes a kind of user data Feature selection approach.The processor of the computer equipment is configured as performing a kind of feature selection approach of user data.Memory Reservoir provides environment for the operation of the computer program in non-volatile memory medium.The network interface of the computer equipment is used for Network is accessed according to this to communicate by network connection with exterior terminal, for example obtain the user data of terminal etc..Art technology Personnel are appreciated that the structure shown in Fig. 3, only with the block diagram of the relevant part-structure of application scheme, do not form The restriction for the server being applied thereon to application scheme, specific server can include than shown in figure more or more Few component, either combines some components or is arranged with different components.
In one embodiment, there is provided a kind of computer equipment, the computer equipment can be servers.The computer Equipment includes processor and memory, and the memory storage has computer program, and the computer program is executed by processor When, it may be such that processor performs following steps:The characteristic information of user data is obtained, the corresponding feature of extraction characteristic information becomes Amount;Characteristic variable is clustered, obtains multiple cluster results;Group is carried out respectively to the characteristic variable in multiple cluster results Close, obtain multiple combinations of features, combinations of features includes multiple assemblage characteristic variables;Target variable is obtained, utilizes target variable pair Multiple assemblage characteristic variables carry out correlation test;When upchecking, interactive tag is added to assemblage characteristic variable;Utilize addition The corresponding characteristic variable of assemblage characteristic variable resolution after interactive tag;It is optimal that the characteristic variable obtained by parsing generates user Characteristic variable.
In one of the embodiments, the step of being tested using target variable to multiple assemblage characteristic variables is included: The P-value values of assemblage characteristic variable are calculated using target variable;By P-value values compared with first threshold, work as P- When value values are less than first threshold, record combination characteristic variable passes through inspection.
In one of the embodiments, using adding the corresponding characteristic variable of assemblage characteristic variable resolution after interactive tag The step of include:The frequency that statistical nature variable occurs in the assemblage characteristic variable after adding interactive tag;Calculate frequency pair The variance answered, using variance compared with second threshold;When variance reaches second threshold, by the corresponding characteristic variable of frequency It is recorded as user's optimal characteristics variable.
In one of the embodiments, before the step of obtaining the characteristic information of user data, further include:Obtain data The log-on data of user and the historical data of user in storehouse;The user behavior of third-party platform is obtained according to the log-on data of user Data;Log-on data, historical data and behavioral data are analyzed, the user data after being analyzed;Obtain default key Word, the characteristic information in user data is extracted using preset keyword.
In one embodiment, there is provided a kind of computer-readable recording medium, the computer-readable recording medium storage There is computer program, when the computer program is executed by processor, may be such that processor performs following steps:Obtain number of users According to characteristic information, the corresponding characteristic variable of extraction characteristic information;Characteristic variable is clustered, obtains multiple cluster results; Characteristic variable in multiple cluster results is respectively combined, obtains multiple combinations of features, combinations of features includes multiple combinations Characteristic variable;Target variable is obtained, correlation test is carried out to multiple assemblage characteristic variables using target variable;Upcheck When, interactive tag is added to assemblage characteristic variable;Using adding the corresponding feature of assemblage characteristic variable resolution after interactive tag Variable;The characteristic variable obtained by parsing generates user's optimal characteristics variable.
In one of the embodiments, the step of being tested using target variable to multiple assemblage characteristic variables is included: The P-value values of assemblage characteristic variable are calculated using target variable;By P-value values compared with first threshold, work as P- When value values are less than first threshold, record combination characteristic variable passes through inspection.
In one of the embodiments, using adding the corresponding characteristic variable of assemblage characteristic variable resolution after interactive tag The step of include:The frequency that statistical nature variable occurs in the assemblage characteristic variable after adding interactive tag;Calculate frequency pair The variance answered, using variance compared with second threshold;When variance reaches second threshold, by the corresponding characteristic variable of frequency It is recorded as user's optimal characteristics variable.
In one of the embodiments, before the step of obtaining the characteristic information of user data, further include:Obtain data The log-on data of user and the historical data of user in storehouse;The user behavior of third-party platform is obtained according to the log-on data of user Data;Log-on data, historical data and behavioral data are analyzed, the user data after being analyzed;Obtain default key Word, the characteristic information in user data is extracted using preset keyword.
One of ordinary skill in the art will appreciate that realize all or part of flow in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, which can be stored in non-volatile computer and can be read In storage medium, the computer program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, it is foregoing Storage medium can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, to make description succinct, not to above-mentioned reality Apply all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, the scope that this specification is recorded all is considered to be.
Embodiment described above only expresses the several embodiments of the present invention, its description is more specific and detailed, but simultaneously Cannot therefore it be construed as limiting the scope of the patent.It should be pointed out that come for those of ordinary skill in the art Say, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims (10)

1. a kind of feature selection approach of user data, including:
The characteristic information of user data is obtained, extracts the corresponding characteristic variable of the characteristic information;
The characteristic variable is clustered, obtains multiple cluster results;
Characteristic variable in the multiple cluster result is respectively combined, obtains multiple combinations of features, the combinations of features Including multiple assemblage characteristic variables;
Target variable is obtained, correlation test is carried out to multiple assemblage characteristic variables using the target variable;
When upchecking, interactive tag is added to the assemblage characteristic variable;
Using adding the corresponding characteristic variable of assemblage characteristic variable resolution after interactive tag;
The characteristic variable obtained by parsing generates user's optimal characteristics variable.
2. according to the method described in claim 1, it is characterized in that, described become multiple assemblage characteristics using the target variable The step of amount is tested includes:
The P-value values of the assemblage characteristic variable are calculated using the target variable;
By the P-value values compared with first threshold, when the P-value values are less than first threshold, described in record Assemblage characteristic variable passes through inspection.
3. according to the method described in claim 1, it is characterized in that, described utilize adds the assemblage characteristic variable after interactive tag The step of parsing corresponding characteristic variable includes:
Count the frequency occurred in assemblage characteristic variable of the characteristic variable after the addition interactive tag;
The corresponding variance of the frequency is calculated, using the variance compared with second threshold;
When the variance reaches second threshold, the corresponding characteristic variable of the frequency is recorded as user's optimal characteristics variable.
4. according to the method in any one of claims 1 to 3, it is characterised in that the feature letter for obtaining user data Before the step of breath, further include:
Obtain the log-on data of user and the historical data of user in database;
The user behavior data of third-party platform is obtained according to the log-on data of user;
The log-on data, historical data and behavioral data are analyzed, the user data after being analyzed;
Preset keyword is obtained, the characteristic information in the user data is extracted using preset keyword.
5. a kind of feature selecting device of user data, including:
Acquisition module, for obtaining the characteristic information of user data, extracts the corresponding characteristic variable of the characteristic information;
Cluster module, for being clustered to the characteristic variable, obtains multiple cluster results;
Inspection module, for being respectively combined to the characteristic variable in the multiple cluster result, obtains multiple combinations of features, The combinations of features includes multiple assemblage characteristic variables;Target variable is obtained, using the target variable to multiple assemblage characteristics Variable carries out correlation test, and when upchecking, interactive tag is added to the assemblage characteristic variable;
Parsing module, for utilizing the corresponding characteristic variable of assemblage characteristic variable resolution after adding interactive tag;Pass through parsing Obtained characteristic variable generation user's optimal characteristics variable.
6. device according to claim 5, it is characterised in that the feature verification module is additionally operable to become using the target Amount calculates the P-value values of the assemblage characteristic variable;By the P-value values compared with first threshold, as the P- When value values are less than first threshold, record the assemblage characteristic variable and pass through inspection.
7. device according to claim 5, it is characterised in that the feature analysis module is additionally operable to count the feature change Measure the frequency occurred in the assemblage characteristic variable after the addition interactive tag;The corresponding variance of the frequency is calculated, is utilized The variance is compared with second threshold;When the variance reaches second threshold, by the corresponding characteristic variable of the frequency It is recorded as user's optimal characteristics variable.
8. device according to any one of claims 5 to 7, it is characterised in that the acquisition module is additionally operable to obtain number According to the log-on data of user in storehouse and the historical data of user;User's row of third-party platform is obtained according to the log-on data of user For data;The log-on data, historical data and behavioral data are analyzed, the user data after being analyzed, obtained pre- If keyword, the characteristic information in the user data is extracted using preset keyword.
9. a kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor Computer program, it is characterised in that the reason device realizes any one of claims 1 to 4 the method when performing described program The step of.
10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The step of any one of claims 1 to 4 the method is realized during execution.
CN201711172183.5A 2017-11-22 2017-11-22 Feature selection approach, device, computer equipment and the storage medium of user data Pending CN107977413A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711172183.5A CN107977413A (en) 2017-11-22 2017-11-22 Feature selection approach, device, computer equipment and the storage medium of user data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711172183.5A CN107977413A (en) 2017-11-22 2017-11-22 Feature selection approach, device, computer equipment and the storage medium of user data

Publications (1)

Publication Number Publication Date
CN107977413A true CN107977413A (en) 2018-05-01

Family

ID=62010880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711172183.5A Pending CN107977413A (en) 2017-11-22 2017-11-22 Feature selection approach, device, computer equipment and the storage medium of user data

Country Status (1)

Country Link
CN (1) CN107977413A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109816034A (en) * 2019-01-31 2019-05-28 清华大学 Signal characteristic combines choosing method, device, computer equipment and storage medium
CN111612548A (en) * 2020-05-27 2020-09-01 恩亿科(北京)数据科技有限公司 Information acquisition method and device, computer equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938523A (en) * 2016-03-31 2016-09-14 陕西师范大学 Feature selection method and application based on feature identification degree and independence
CN106570178A (en) * 2016-11-10 2017-04-19 重庆邮电大学 High-dimension text data characteristic selection method based on graph clustering
US20170193521A1 (en) * 2016-01-04 2017-07-06 International Business Machines Corporation Proactive customer relation management process based on application of business analytics
CN107292097A (en) * 2017-06-14 2017-10-24 华东理工大学 The feature selection approach of feature based group and traditional Chinese medical science primary symptom system of selection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170193521A1 (en) * 2016-01-04 2017-07-06 International Business Machines Corporation Proactive customer relation management process based on application of business analytics
CN105938523A (en) * 2016-03-31 2016-09-14 陕西师范大学 Feature selection method and application based on feature identification degree and independence
CN106570178A (en) * 2016-11-10 2017-04-19 重庆邮电大学 High-dimension text data characteristic selection method based on graph clustering
CN107292097A (en) * 2017-06-14 2017-10-24 华东理工大学 The feature selection approach of feature based group and traditional Chinese medical science primary symptom system of selection

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109816034A (en) * 2019-01-31 2019-05-28 清华大学 Signal characteristic combines choosing method, device, computer equipment and storage medium
CN109816034B (en) * 2019-01-31 2021-08-27 清华大学 Signal characteristic combination selection method and device, computer equipment and storage medium
CN111612548A (en) * 2020-05-27 2020-09-01 恩亿科(北京)数据科技有限公司 Information acquisition method and device, computer equipment and readable storage medium

Similar Documents

Publication Publication Date Title
Zannettou et al. On the origins of memes by means of fringe web communities
CN109922032B (en) Method, device, equipment and storage medium for determining risk of logging in account
Ojha et al. Towards universal fake image detectors that generalize across generative models
CN103793484B (en) The fraud identifying system based on machine learning in classification information website
US10482477B2 (en) Stratified sampling applied to A/B tests
WO2019061664A1 (en) Electronic device, user's internet surfing data-based product recommendation method, and storage medium
CN107818334A (en) A kind of mobile Internet user access pattern characterizes and clustering method
CN105654198B (en) Brand advertisement effect optimization method capable of realizing optimal threshold value selection
CN109831454B (en) False traffic identification method and device
CN108022146A (en) Characteristic item processing method, device, the computer equipment of collage-credit data
CN108829769B (en) Suspicious group discovery method and device
CN105760649A (en) Big-data-oriented creditability measuring method
CN110336838A (en) Account method for detecting abnormality, device, terminal and storage medium
CN111159563A (en) Method, device and equipment for determining user interest point information and storage medium
CN106803039A (en) The homologous decision method and device of a kind of malicious file
CN106789837A (en) Network anomalous behaviors detection method and detection means
CN106301979B (en) Method and system for detecting abnormal channel
CN108664501B (en) Advertisement auditing method and device and server
CN107977413A (en) Feature selection approach, device, computer equipment and the storage medium of user data
CN111787002A (en) Method and system for analyzing service data network security
CN110110219B (en) Method and device for determining user preference according to network behavior
CN116701772B (en) Data recommendation method and device, computer readable storage medium and electronic equipment
CN105550250B (en) A kind of processing method and processing device of access log
CN111612531A (en) Click fraud detection method and system
CN110995681A (en) User identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20210312

AD01 Patent right deemed abandoned