CN105718600A

CN105718600A - Heterogeneous data set feature quality visualization method

Info

Publication number: CN105718600A
Application number: CN201610130663.4A
Authority: CN
Inventors: 汤奇峰; 薛守辉
Original assignee: Shanghai Zamplus Technology Development Co Ltd
Current assignee: Shanghai Zamplus Technology Development Co Ltd
Priority date: 2016-03-08
Filing date: 2016-03-08
Publication date: 2016-06-29

Abstract

The invention relates to a heterogeneous data set feature quality visualization method; according to the method, the occurrence rate of feature discrete values is introduced by carrying out statistics on the feature distribution of a training set and a verification set of isomerism; visualization is carried out on a feature set and a feature category value set in a polar coordinate system by adopting a heterogeneous method; a feature quality diagram is drawn in polar coordinates by calculating the positive sample occurrence rate, standardized occurrence rate, drifting ratio and comprehensive occurrence rate of category values and taking the drifting ratio as radius and the comprehensive occurrence rate as an angle of declination. The feature visualization method can help to solve four typical feature engineering problems, i.e., feature evaluation, feature attribution, feature selection and feature improvement in supervision learning. The heterogeneous data set feature quality visualization method can solve the problem of distributional difference of the training set and a test set when a supervision machine learning model is in the face of a domain transfer learning problem or the field is the same, but data distribution has tendency drifting, thus further carrying out effective feature evaluation, feature attribution and feature selection, and even improving the effects of a model by improving the features.

Description

A kind of heterogeneous datasets characteristic mass method for visualizing

Technical field

The present invention relates to machine learning field, particularly relate to a kind of heterogeneous datasets characteristic mass method for visualizing.

Background technology

In recent years, along with the development of big data industry, a lot of industries all create mass data, and data class, data scale and data dimension are all in continuous expansion.In order to find knowledge and value from mass data, machine learning algorithm is more and more extensive in the application of industrial quarters.Except data sample constantly expands, data characteristics kind and dimension are also in swift and violent growth, and characteristic dimension can reach necessarily even bigger.

The feature of magnanimity serves problem can to follow-up machine learning algorithm band in extensibility and effect, the main cause of impact effect has two aspects: 1) big measure feature is unrelated with predicting target or degree of correlation is relatively low, namely feature degree of association (FRS, FeatureRelevanceScore) is poor；2) Partial Feature is higher with prediction target degree of correlation, but it is notable at the distributional difference of training set and test set (or training stage and application stage), and namely feature degree of stability (FSL, FeatureStabilityLevel) is poor.

In supervised learning field, Feature Engineering is very important link, and the problem that Feature Engineering to solve can be divided into: feature evaluation, feature attribution, feature selection and feature are improved.Traditional feature selection approach, has often only examined feature degree of association to the assessment of characteristic mass, for instance feature and the mutual information of label, carries out quantitative research or visual analyzing without using feature stability and feature correlation as a dual index.Therefore the present invention had both considered feature degree of association, had taken into account feature stability degree simultaneously, and index two tuple both consisted of polar coordinate system visualizes.Inventive feature quality (FQ, FeatureQuality), refers specifically to two tuples that feature degree of association and feature stability degree constitute or its expressed feature significance level for particular prediction model.

The field that the present invention fits includes: 1) transfer learning, training set and test set are inter-trade or cross-cutting situations；2) non-migrating study, training set and test set, the situation that the data set distributional difference of different time is bigger.

Under traditional machine learning framework, the task of study is one disaggregated model of study or regression model on the given basis training up data, then utilize study to model test set sample classified or predicts.But in practical application, often see that new field is emerged in large numbers, such as from traditional news, to webpage, picture, blog, blog etc., this new field or data set often lack mark；On the other hand, traditional machine learning assumes that training data obeys identical distribution with test data, and under practical situation, this same distributional assumption is also unsatisfactory for.Therefore, how to utilize existing has mark but the training data of obstructed distribution in a large number, migrates knowledge, is used for helping study, is the transfer learning problem that needs to solve.

The target of transfer learning (TransferLearning) be by from the knowledge of an environment learning for helping the learning tasks of new environment.The important feature of transfer learning is training set and test set not to be done same distributional assumption, and namely two data sets are isomeries.Training set data and feature in usual transfer learning can be very many, therefore only from the angle of feature to allow from training set study to model can effectively predict test set, which it is accomplished by big measure feature is estimated, is selected, the characteristic set less and relevant to prediction target to select changes in distribution.

Such as in ad conversion rates model, often from industry data learning model, predict whether the advertisement of certain client in industry can convert；Or from an industry training pattern, predict whether the advertisement of similar industry can convert.Of this sort transfer learning problem, necessary carries out feature evaluation by feature visualization method, feature retreats, feature selection and feature are improved.

Additionally, also similar training set and the situation of test set isomery is had at non-migrating learning areas, in such as ad conversion rates model certain client data set at ordinary times and red-letter day data set, if predicting red-letter day with data set at ordinary times, data set is likely to result in forecasting inaccuracy problem, is therefore also " heterogeneous datasets " problem of mentioning of this patent.

Summary of the invention

The invention aims to solve the deficiencies in the prior art, the visual feature evaluation of polar coordinate and the feature selection approach of a kind of heterogeneous datasets are provided, it is possible not only to increase the intuitivism apprehension to forecasting problem, produces explanatory strong feature evaluation report, feature selection can also be carried out according to feature evaluation report and feature is improved, so that follow-up supervised machine learning model still can overcome the adverse effect that feature unstability is brought when heterogeneous datasets, carry out more effective study.The present invention is applicable to following situations: under 1) heterogeneous datasets is assumed, training set is different with test set mechanism of production, generation field is different or has hierarchical relationship, including typical transfer learning；2), under isomorphism data set is assumed, data itself produce situation about periodically or non-periodically drifting about in time；3) under isomorphism data set is assumed, data itself have endogenous fluctuation, i.e. essence randomness, show the situation that variance that Partial Feature is distributed is bigger；4) under isomorphism data set is assumed, data distribution does not change, and namely training set and test set are with situation about being distributed.

It is an object of the invention to be achieved through the following technical solutions:

A kind of heterogeneous datasets characteristic mass method for visualizing (HeterogeneousDatasetFeatureQualityVisualization, hereinafter referred to as HeDFQV), at least comprises the following steps:

Step 1, given two classification have label heterogeneous datasets D (A) and D (B), given certain feature f, class label set V={v1, the v2 of construction feature,...VN}；

Step 2, in heterogeneous datasets D (d), d is A, B, calculate overall positive sample incidence rate r (A) and r (B) respectively, computing formula is the total sample number that r (d)=pos (d)/ins (d), pos (d) are heterogeneous datasets D (d) for the positive total sample number in heterogeneous datasets D (d), ins (d)；

Step 3, in heterogeneous datasets D (d), d is A, B, to each class label v in V, calculate its positive sample incidence rate r (v, d), computing formula is r (v, d)=pos (v, d)/ins (v, d), wherein pos (v, d), (v d) respectively comprises positive sample size and the total sample number of v in D (d) to ins；

Step 4, in heterogeneous datasets D (d), d is A, B, to each class label v in V, calculate its standardization incidence rate sr (v, d), computing formula is: sr (v, d)=r (v, d)/r (d), wherein (v, d) for the incidence rate of the upper v of data set D (d), r (d) is the upper overall incidence rate of D (d) to r；

Step 5, to each class label v in class label set V, calculate its comprehensive incidence rate t (v), drift ratio s (v), computing formula is: t (v)=sr (v, A)+sr (v, B), i.e. class label v standardization incidence rate summation on D (A) and D (B)；S (v)=sr (v, B)/sr (v, A), namely v standardization incidence rate on D (A) and D (B) seeks ratio；

Step 6, to each class label v in V, with comprehensive incidence rate t (v) be drift angle, with drift than s (v) for radius, feature class label is drawn in polar coordinate system, polar coordinate p (v)=(t (v), s (v))；

Step 7, constructs auxiliary circle in polar coordinate system, and auxiliary circle radius is 1, and the center of circle is initial point, constitutive characteristic f characteristic mass figure on isomorphism data set D, completes the visualization to feature f.

Above-mentioned a kind of heterogeneous datasets characteristic mass method for visualizing, wherein, in described step 1, the method for " the class label set of construction feature " is:

Step 1.1, whether judging characteristic f is numerical characteristics, if, then feature f is adopted formula int (log2 (c)) discrete chemical conversion class label, wherein c is the eigenvalue of feature f, int represents and rounds, and log2 represents and takes the logarithm with 2 the end of for, and then can obtain class label set on data-oriented integrates as V0；If it is not, then perform step 1.2；

Step 1.2, arranges a threshold values, is classified as in a class label by sample size less than the class label of this threshold values, is that V0 is converted into feature class label set V by class label set.

Above-mentioned a kind of heterogeneous datasets characteristic mass method for visualizing, wherein, its feature evaluation flow process (heterogeneous datasets feature visualization estimation flow, HomogeneousDatasetFeatureEvaluationPipeline, hereinafter referred to as HeDFEP) at least comprises the following steps:

Step 1, to heterogeneous datasets D, given characteristic set F, it is necessary to the feature quantity N of selection.

Step 2, calculates each feature indices data in visualization process in characteristic set F, including incidence rate, standardization incidence rate, drift ratio, comprehensive incidence rate, composing indexes collection M；And draw the characteristic mass figure of each feature in characteristic set F, constitute graphical-set G.

Step 3, according to index set M and graphical-set G, stability and degree of association to the feature in characteristic set F are estimated, and obtain feature evaluation conclusion.

Step 4, according to index set M and graphical-set G, it is judged that the effect bottleneck of forecast model is feature stability degree or feature degree of association, obtains feature attribution conclusion.

Step 5, according to index set M and graphical-set G, chooses the measured feature of top n matter from characteristic set F, and constitutive characteristic selects result set.

Step 6, according to index set M and graphical-set G, part degree of association in characteristic set is good, class label is many but the feature of poor stability carries out feature improvement, adopt and the class label with close comprehensive incidence rate is clustered, while making overall categorical measure reduce, improve feature stability in the large, form feature reconstruction view.

Step 7, comprehensive characteristics assessment result, feature attribution conclusion, feature selection result set, feature reconstruction view constitutive characteristic assessment report.

The present invention can according to the overall drift ratio of characteristic mass figure judging characteristic and overall degree of association, method particularly includes: according to the class label point set of feature distribution shape in the drawings, judge the total quality of this feature, when point set angle of distribution coordinate direction is more scattered, feature degree of association is more good, when point set is when axial coordinate directional spreding is closer to standard round, feature stability is more good；In general, point set distribution can be dispersed in inside and outside unit circle.

Feature class label point set distribution shape pattern in characteristic mass generally has four kinds, can according to this feature total quality (referring to Fig. 3) of these four mode decision:

(1) thin arc pattern is grown, namely point set is substantially distributed on circumference or near circumference, and it is more scattered at circumferential angle directional spreding, it is shaped like the segment length on circumference and thin camber line, this feature has the feature of " strong correlation strong stability ", characteristic mass is best, referring to Fig. 3 (upper left)；

(2) filling out arc pattern, namely point set is distributed in axial distance circumference farther out, and more scattered at circumferential angle directional spreding, it is shaped like the segment length on circumference and fat camber line, this feature has the feature of " strong correlation weak steady ", and characteristic mass is general, referring to Fig. 3 (upper right)；

(3) short thin arc pattern, namely point set is distributed in the region that axial distance circumference is nearer, and relatively concentrate at circumferential angle directional spreding, it is shaped like the camber line one section short and thin on circumference, this feature has the feature of " weak relevant strong stability ", characteristic mass is general, referring to Fig. 3 (lower-left)；

(4) ray pattern, namely point set is distributed in axial distance circumference farther out, and relatively concentrates at circumferential angle directional spreding, it is shaped like the ray line segment from initial point injection, this feature has the feature of " weak relevant weak steady ", and characteristic mass is worst, referring to Fig. 3 (bottom right).

Can also being weighed the drift ratio of each class label and comprehensive incidence rate by the polar coordinate radius of each feature class label and drift angle, method is: the position judgment according to feature class label point, when putting circumferentially, represents that feature incidence rate is without drift；Point is likely distributed in inside and outside unit circle, when point is when circumference is outer, represent that the drift ratio of this feature class label is more than 1, namely test set standardization incidence rate is more than training set, otherwise, if during such as fruit dot in circumference, representing that this feature classification ticket extremely is less than 1, namely test set standardization incidence rate is less than training set；When the drift angle of point is more big, its comprehensive incidence rate is more big.

In sum, compared with prior art, the invention have the advantages that and beneficial effect:

1, the feature polar coordinate method for visualizing that the present invention proposes, first feature class label is visualized, the feature classification index comprising degree of association and two dimensions of comprehensive incidence rate is visualized in the way of X-Y scheme, the class label (numerical characteristics needs discrete chemical conversion class label) of feature is mapped to the point in polar coordinate system, and then according to the axial coordinate distribution judging characteristic class label degree of stability of point or drift ratio, by uneven relative to average of the incidence rate of angle coordinate judging characteristic class label put.

2, the feature polar coordinate method for visualizing that the present invention proposes, first characteristic mass is visualized, the characteristic mass comprising degree of association and two dimensions of degree of stability is visualized in the way of X-Y scheme, feature category set (or category set of numerical characteristics discretization formation) is mapped to the point set in polar coordinate system, and propose " tetra-kinds of mode decision criterions of characteristic mass figure ", and then according to the characteristic mass that the global shape judging characteristic degree of association of point set, feature stability degree are constituted.

3, the feature polar coordinate method for visualizing that the present invention proposes, first feature is studied by employing method for visualizing, including feature based Quality Map feature evaluation method, the feature attribution method of feature based Quality Map, feature based Quality Map feature selection approach, feature based Quality Map feature improved method.

4, the present invention propose feature polar coordinate method for visualizing and polar coordinate visualization feature estimation flow, the intuitivism apprehension to forecasting problem can be increased on the one hand, produce explanatory strong feature evaluation report, deepen the understanding degree of depth to modeling problem, manual features is helped to select and feature improvement, carry out feature selection according to feature evaluation report on the other hand and feature is improved, so that follow-up supervised machine learning model still can overcome the adverse effect that feature unstability is brought when heterogeneous datasets, carry out more effective study.

Accompanying drawing explanation

Fig. 1 is the flow chart of a kind of heterogeneous datasets characteristic mass method for visualizing of the present invention.

Fig. 2 is the characteristic mass figure of the embodiment of the present invention 1.

Fig. 3 is four kinds of ideographs of inventive feature Quality Map distribution.

Detailed description of the invention

Embodiment 1

Table 1

The present embodiment is certain advertiser's Conversion Model, advertiser place industry is electricity firm industry, and training set D (A) is the sample data set of electricity firm industry, is called for short " industry data ", the sample data set that test set D (B) is this advertiser, is called for short " company data ".In the present embodiment, the step of a kind of heterogeneous datasets characteristic mass method for visualizing HeDFQV is as follows:

Step 1, given two classification have label heterogeneous datasets D (A) and D (B), and given certain feature f, f are dayofweek, and namely all several, V={1,2,3,4,5,6,7} represent Monday to Sunday respectively.Class label set V={v1, the v2 of construction feature,...VN}.

Step 2, in D (d), d is A, B, calculate overall positive sample incidence rate r (A) and r (B) respectively, computing formula is r (d)=pos (d)/ins (d), pos (d) is positive total sample number, ins (d) is total sample number；Then the overall sample incidence rate of training set A is overall sample incidence rate r (the B)=561/6029=0.0931 of r (A)=941/8706=0.1081, training set B.

Step 3, in D (d), d is A, B, to each class label v in class label set V, calculate its positive sample incidence rate r (v, d), computing formula is r (v, d)=pos (v, d)/ins (v, d), wherein pos (v, d), (v d) respectively comprises positive sample size and the total sample number of v in D (d) to ins.

Such as v=1, d=A, namely calculate the positive sample incidence rate of feature class label " Monday ", r (1, A)=pos (1, A)/ins (1, A)=99/1136=0.0871, and other data refer to r (v, d) row of table 1.

Step 4, in D (d), d is A, B, and to each class label v in class label set V, (v, d), computing formula is: sr (v, d)=r (v, d)/r (d) to calculate its standardization incidence rate sr.

Such as v=1, d=B, namely calculate the feature classification standardization incidence rate to " Monday ", sr (1, B)=r (1, B)/r (A)=0.1017/0.0931=1.0924, and other data refer to table 1sr (v, d) row.

Step 5, to each class label v in V, calculate its comprehensive incidence rate t (v), drift ratio s (v), t (v)=sr (v, A)+sr (v, B), i.e. v standardization incidence rate summation on D (A) and D (B), for instance, t (1)=sr (1, A)+sr (1, B)=0.8063+1.0924=1.8987, other data refer to t (v) row of table 1；S (v)=sr (v, B)/sr (v, A), namely v standardization incidence rate on D (A) and D (B) seeks ratio, such as, s (1)=sr (1, B)/sr (1, A)=1.0924/0.8063=1.3549 refers to the s (v) of table 1.

Step 6, to each class label v in V, with comprehensive incidence rate t (v) be drift angle, with drift than s (v) for radius, feature class label is drawn on a point in polar coordinate system；Namely polar coordinate p (v)=(t (v), s (v)), refers to Fig. 2, and wherein, P point is the point that angle coordinate t (v) is minimum, now v=3, and namely feature class label " Wednesday " is a some P in polar coordinate.

Step 7, constructs auxiliary circle, and method is structure unit's standard round, and then the characteristic mass figure (FeatureQualityGraph is called for short FQG) that constitutive characteristic f is on data set D, completes the visualization to feature f.

Embodiment 2

The present embodiment is certain advertiser's Conversion Model, assume that advertiser place industry is for electricity firm industry, the sample data set making training set D (A) be electricity firm industry, is called for short " industry data ", the sample data set that test set D (B) is this advertiser, is called for short " company data ".In the present embodiment, the step of heterogeneous datasets feature visualization estimation flow HeDFEP is as follows:

Step 1, given heterogeneous datasets D (A) and D (B), given characteristic set F, it is necessary to the feature quantity N of selection.Characteristic set F includes two features, and { some and what day hourofday, dayofweek} represent respectively, it is necessary to selection feature quantity is N=1.

Step 2, calculates the indices data of feature hourofday and dayofweek respectively, including class label quantity, incidence rate, standardization incidence rate, drift ratio, comprehensive incidence rate etc., and composing indexes collection M；Draw the characteristic mass figure of feature hourofday and dayofweek, constitute graphical-set G.

Step 3, according to the feature evaluation conclusion that the feature in feature set F is estimated obtaining by index set M and graphical-set G be: feature hourofday less stable, degree of association are better；Feature dayofweek stability is better, degree of association is better.

Step 4, according to index set M and graphical-set G, by judging the characteristic mass figure of two features of hourofday and dayofweek, it has been found that the stability that the effect bottleneck of the forecast model of application the two spy card is mainly this feature of hourofday is bad, constitutive characteristic attribution conclusion.

Step 5, according to index set M and graphical-set G, selecting general performance front N=1 feature preferably is dayofweek, because its degree of stability and degree of association are all better, constitutive characteristic selects result set.

Step 6, according to index set M and graphical-set G, for the feature hourofday that performance is bad, by it is clustered, for instance be polymerized to following a few class by the period: period in late into the night MN{0-6}, morning sessions M{7-10}, period at noon N{11-14}, afternoon hours AN{15-18}, evening session E{19-23}；Then new feature class label set is that { MN, M, N, AN, E} be totally 5 values, constitutive characteristic recommendation on improvement.

1, the feature polar coordinate method for visualizing that the present invention proposes, first feature class label is visualized, the feature classification index comprising degree of association and two dimensions of comprehensive incidence rate is visualized in the way of X-Y scheme, the class label (numerical characteristics needs discrete chemical conversion class label) of feature is mapped to the point in polar coordinate system, and then according to the axial coordinate distribution judging characteristic class label degree of stability of point or drift than, feature class label incidence rate change direction, by uneven relative to average of the angle coordinate judging characteristic incidence rate put.

2, the feature polar coordinate method for visualizing that the present invention proposes, first characteristic mass is visualized, the characteristic mass comprising degree of association and two dimensions of degree of stability is visualized in the way of X-Y scheme, each classification of feature category set (or category set of numerical characteristics discretization formation) is hinted obliquely at the point set in polar coordinates system, and propose tetra-kinds of feature mode determination methods of characteristic mass figure, and then according to the global shape judging characteristic degree of association of point set, feature stability degree and characteristic mass.

4, the feature polar coordinate method for visualizing of present invention proposition and polar coordinate visualization feature estimation flow, the intuitivism apprehension to forecasting problem can be increased on the one hand, produce explanatory strong feature evaluation report, deepen the understanding degree of depth to modeling problem, manual features is helped to select and feature improvement, carry out feature selection according to feature evaluation report on the other hand and feature is improved, so that follow-up supervised machine learning model is when heterogeneous datasets (being equally applicable to isomorphism data set), still can overcome the adverse effect that feature unstability is brought, carry out more effective study.

Embodiment described above is merely to illustrate technological thought and the feature of the present invention, its object is to make those skilled in the art will appreciate that present disclosure and implement according to this, the scope of the claims of the present invention only can not be limited with the present embodiment, i.e. all equal changes made according to disclosed spirit or modify, still drop in the scope of the claims of the present invention.

Claims

1. a heterogeneous datasets characteristic mass method for visualizing, it is characterised in that at least comprise the following steps:

Step 4, in heterogeneous datasets D (d), d is A, B, and to each class label v in V, (v, d), computing formula is: sr (v, d)=r (v, d)/r (d) to calculate its standardization incidence rate sr；

Step 7, constructs auxiliary circle in polar coordinate system, and auxiliary circle radius is 1, and the center of circle is initial point, constitutive characteristic f characteristic mass figure on isomorphism data set D.

2. a kind of heterogeneous datasets characteristic mass method for visualizing according to claim 1, it is characterised in that in described step 1, the method for " the class label set of construction feature " is:

3. a kind of heterogeneous datasets characteristic mass method for visualizing according to claim 1, it is characterised in that its feature evaluation flow process at least comprises the following steps: