CN102385719A

CN102385719A - Regression prediction method and device

Info

Publication number: CN102385719A
Application number: CN2011103392241A
Authority: CN
Inventors: 李锐; 张帅; 王斌; 李鹏; 张冠元; 鲁凯
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2011-11-01
Filing date: 2011-11-01
Publication date: 2012-03-21

Abstract

The invention provides a regression prediction method, wherein not only similarity between independent variables X is taken into consideration, but also similarity between dependent variables Y of raw data is taken into consideration, and the model of output value y development based on the historical angle of close neighbors. Compared with the conventional model without taking data development mode into consideration, only one preprocessing section is added to a data set, and the information of the data point can be diversified without the need of extra resource; and the information of the raw data point X is diversified, and finally the prediction effect is improved. Furthermore, the regression prediction method can be realized on a MapReduce frame, and the execution speed can be improved by utilizing the parallelism of the device.

Description

Regression forecasting method and device

Technical field

The invention belongs to statistical regression analysis and prediction, relate in particular to the regression forecasting method and the device that are used for statistical machine study.

Background technology

Regretional analysis (Regression Analysis) is the method for on a kind of statistics data being analyzed, and mainly is to hope to inquire between the data whether a kind of particular kind of relationship is arranged.Regretional analysis is the model of setting up dependent variable Y (response variables) or claiming Dependent variable, (dependent variables) and independent variable X (predictors) or claim to concern between only variable (independent variables).In statistical machine study, the regression forecasting method is mainly used in data is given a forecast and analyze.Wherein X generally is the data of multidimensional and Y generally is the numeric type data, is called multiple regression.Can be divided into linear regression, non-linear regression etc. again according to regression equation.The most basic linear regression formula: Y=β X+ β ₀

Existing regression forecasting method exists following two problems: at first; Because the disappearance of data or do not do feature selecting, making that raw data points itself possibly not comprise enough information sometimes comes regression forecasting (this problem can abbreviate the characteristic disappearance as) is carried out in output; Secondly; Because the data on each dimension of data point X possibly not be numeric type; It possibly not satisfy the Changing Pattern and the variation range of numerical value, like the angle of preiodic type, and the sex of Boolean type etc.; The colors of enumeration type etc., this has influenced the effect and the accuracy of predicting (this problem can abbreviate the characteristic isomery as) that return to a certain extent.In order to overcome the above problems, existing method all is that the dependence experience comes characteristic is carried out simple format conversion, does not have standardization and extendibility.When data set takes place need change format conversion method when changing slightly.Therefore can not well solve the problem of characteristic disappearance and characteristic isomery.

In addition, along with the development of cloud computing technology, MapReduce, Hadoop etc. the platform of MPP data have appearred being used to carry out, for example.There is scholar's research on these platforms, to realize the regression forecasting method again, utilizes the concurrency of these cloud computing platforms to improve the performance of regression forecasting with expectation.For example; Local linear weighted regression LWLR (locally weighted linear regression) based on MapReduce; It is according to the data point to be predicted of new input; Dynamically in former data set, find some neighbours, do local linear regression with neighbour's data and draw anticipation function, that is to say and all need do neighbor searching and regression forecasting each data point to be predicted.At first, find the neighbour of data point to be predicted according to the similarity (also can be called distance) of independent variable; Carry out curve fitting according to the neighbour then, draw anticipation function; Through anticipation function the output valve of tested point is made prediction at last.

The benefit of LWLR is to be convenient to walk abreast, and is to give a forecast according to neighbour's data, has considered the relation between the independent variable, can improve the accuracy rate of prediction to a certain extent.But it is owing to skipped the stage to matrix inversion, therefore can't consider between the dependent variable Y of former data point X and former data point X and data point x to be predicted _NewOutput y _NewBetween relation.That is to say that the neighbour for data point to be predicted is not easy to look for accurately, and whether the neighbour's accurately there is decisive influence to the prediction result quality.In addition, this method does not solve the problem of characteristic disappearance and characteristic isomery yet.

Summary of the invention

Therefore, the objective of the invention is to overcome the defective of above-mentioned prior art, a kind of characteristic extending method of regression forecasting is provided, utilize the corresponding predicted value (y) of former data (X) to enrich the effect of the information of data point with the lifting regression forecasting.

The objective of the invention is to realize through following technical scheme:

On the one hand, the invention provides a kind of characteristic extending method YET (Y axis ExTension) that is used for regression forecasting, said method comprises:

In former data point, select the neighbours of data point to be predicted, said neighbours equate or similar a series of former data point with the value of data point to be predicted on certain dimension or certain several dimension;

Utilize these neighbours and corresponding dependent variable value thereof to come the dimension of former data point and data point to be predicted is expanded.

Another aspect provides a kind of characteristic extending method based on MapReduce, and said method comprises:

Step 1) is selected the neighbours of data point to be predicted in former data point, said neighbours equate or similar a series of former data point with the value of data point to be predicted on certain dimension or certain several dimension;

Step 2) each former data point is distributed into D ₂-D ₁+ 1 part, D wherein ₂Be the dimension after the former data point expansion, D ₁For former data point expands preceding dimension; Every piece of data be (key, value), wherein; Key is the sign of the data point that need to receive this piece of data, and value is included in the sequence number and the corresponding dependent variable value of former data point of sending this piece of data of the dimension that the data point that receives this piece of data will expand;

Each former data point of step 3) is extracted the sequence number and the dependent variable value of the dimension that comprises among the value and is come the dimension of self is expanded based on the data that received.

Another aspect provides a kind of regression forecasting method, and said method comprises:

Step a) utilizes above-mentioned characteristic extending method that the dimension of each former data point X is expanded, the data point after obtaining expanding;

Step b) is treated the predicted data point based on the data point after expanding and is carried out regression forecasting.

Another aspect provides a kind of regression forecasting method based on MapReduce, and this method comprises:

Step 41) utilize above-mentioned characteristic extending method that the dimension of each former data point X is expanded, the data point after obtaining expanding;

Step 42) based on the data point after expanding; Treat that the predicted data point carries out that similarity is calculated and distributing data to (key, value), wherein; Key is the sign of data point to be predicted, value for the sign of the data point after expanding and with the similarity of data point to be predicted;

Step 43) based on the similarity of being calculated, select the most close with data point to be predicted K the data point after expanding, utilize local linear weighted regression method to treat the predicted data point and carry out regression forecasting.

In the above-mentioned regression forecasting method, said step 42) adopt KL distance, cosine distance or Euclidean distance to calculate similarity for the dimension after the different expansions in.

Another aspect provides a kind of regression forecasting device based on MapReduce, and said device comprises:

Be used to utilize above-mentioned characteristic extending method the dimension of each former data point X to be expanded the device of the data point after obtaining expanding;

Be used for based on the data point after expanding; Treat that the predicted data point carries out that similarity is calculated and distributing data to (key, device value), wherein; Key is the sign of data point to be predicted, value for the sign of the data point after expanding and with the similarity of data point to be predicted;

Be used for based on the similarity of being calculated, select the most close with data point to be predicted K the data point after expanding, utilize local linear weighted regression method to treat the device that the predicted data point carries out regression forecasting.

Another aspect provides a kind of supervision machine learning method that has, and said method comprises:

1) feature extraction of training data and dimension yojan, (x1 is x2....) with the form of label y to form data point X;

2) utilize above-mentioned characteristic extending method that data point X is expanded;

3) select to predict the model formation of y, confirm model parameter type and number of parameters and on the basis of training set, train by the data point after expanding;

4) utilize model and the parameter that trains to be used in regression forecasting or the classification, finally obtain regression forecasting result or classification results.

In the above-mentioned machine learning method, step 3) predicts that by X the model formation of y is a regressive prediction model; Said step 4) utilizes above-mentioned regression forecasting method to predict, and is predicted the outcome.

Above-mentioned machine learning method can be used to carry out weather forecast, disease forecasting, user's buying behavior prediction, music recommend, network friend recommendation, and books are recommended, the decision of a game prediction, information retrieval, spam classification, news importance degree prediction etc.

Compared with prior art, the invention has the advantages that:

Not only consider similarity between the independent variable X, also considered the similarity between the dependent variable Y in the former data, considered output valve y pattern of development from neighbour and neighbour's viewpoint of history.

Compare the model of in the past not considering the data development model, the present invention has only increased a pretreated stage on data set, does not need extra resource just can enrich the information of data point; On execution speed, the time complexity that this pre-service increased is the required N/M of scan-data, and wherein N is the data point number, and M is the number of the Mapper of MapReduce.On treatment effect, enriched the information of former data point X, and finally improved prediction effect.

Description of drawings

Followingly the embodiment of the invention is described further with reference to accompanying drawing, wherein:

Fig. 1 is the schematic flow sheet according to the regression forecasting method of the embodiment of the invention;

Fig. 2 is the structural representation according to the regression forecasting device of the embodiment of the invention;

Fig. 3 returns and uses the effect contrast figure of the regression forecasting of the embodiment of the invention for conventional linear.

Embodiment

In order to make the object of the invention, technical scheme and advantage are clearer, pass through specific embodiment to further explain of the present invention below in conjunction with accompanying drawing.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

In order to understand the present invention better, at first introduce some background technology knowledge.

MapReduce (Jeffrey Dean Sanjay Ghemawat.MapReduce:a flexible data processing tool [J] .Communications of the ACM; January 2010; V.53 n.1.) be the parallel framework (cloud computing framework) of a large-scale data of google proposition in recent years; Also be a kind of programming model and standard that large-scale data is handled that be used for, good bottom encapsulation is provided, conveniently write concurrent program.MapReduce has adopted the thought of dividing and rule; The processing stage that citation form having two of map (mapping) and reduce (yojan); The large-scale data Processing tasks is divided into a lot of subtasks, and several distributed machines are distributed in the subtask walks abreast and accomplish batch processing job.Wherein the map stage is to convert original input (generally be that key/value is right, i.e. key/value to) to intermediate result; The reduce stage then will before the intermediate result that produces merge ordering and output.Whole framework helps the user to accomplish a lot of thorny work, has solved some and has cut apart such as data, scheduling; The colocated of data and code, process synchronous communication, fault-tolerant and crash handling; Problems such as load balancing, and make that these functions are transparent to the developer.Therefore, the developer only need realize interfaces such as map and reduce, need not pay close attention to the problem of first floor system level, just can accomplish the exploitation of concurrent program on the distributed type assemblies easily.

Can realize the Return of Tradition Forecasting Methodology with MapReduce.Find the solution but in the Return of Tradition Forecasting Methodology, need matrix inversion or gradient to descend, if realize block parallel and calculate that every blocks of data will be finished all needs the information of the overall situation, gradient decline also is like this to the computing of matrix inversion.Yet the shortcoming of MapReduce framework itself is: global information be not easy to share and disk random access efficient low.Therefore, this traditional regression forecasting software can not utilize the concurrency of MapReduce framework to improve performance well.

Local linear weighted regression LWLR based on MapReduce gives a forecast according to neighbour's data, has skipped the stage to matrix inversion, therefore can utilize the concurrency of MapReduce.But as mentioned above, its exist neighbour for new data point be not easy to look for problem accurately, and do not solve the problem of characteristic disappearance or characteristic isomery.The basic step of LWLR is: regularization of data layout at first, confirmed that independent variable X (generally is a multidimensional, therefore with capitalization X; And each dimension of X also can be called attribute or row;) and dependent variable (generally be the predicted value of one dimension, therefore use small letter y) y, the form of every data generally is (x ₁ ⁽ⁱ⁾, x ₂ ⁽ⁱ⁾, x _j ⁽ⁱ⁾... x _n ⁽ⁱ⁾, y ⁽ⁱ⁾), wherein subscript j ∈ [1, n] represents every Column Properties, the numbering of the former data point of subscript i ∈ [1, m] representative, and former data just are expressed as the large matrix of a m* (n+1).Receiving new data X then _New(x _{New, 1}, x _{New, 2}, x _{New, 3}... X _{New, n}) after, calculate X _NewWith the Euclidean distance of each former data point X as similarity, then choose the most close K former data point, from Top K point, train regression model h (θ), the regression model h (θ) that last basis trains predicts dependent variable y.(C.Chu，S.Kim，Y.A.Lin，etc.Map-reduce?for?machine?learning?on?multicore[C]//NIPS?19，2007.)

According to one embodiment of present invention, a kind of characteristic extending method that is used for regression forecasting is provided, this characteristic extending method has not only been considered between the former data point (independent variable X) relation, but also has considered to concern between the dependent variable Y of former data point.Through each attribute of X being reconfigured and expand, enrich the characteristic of former data point and testing data point with the dependent variable value y of " neighbours " of former data point.

Below be convenient explanation, independent variable is designated as X (X ₁, X ₂, X ₃...), expand the back independent variable and be designated as X ⁺, the dependent variable that independent variable is corresponding is designated as Y (y ₁, y ₂, y ₃...).Data point to be predicted is designated as X _New, predicting the outcome is output as y _New

More specifically, this method may further comprise the steps:

Step 1 is in former data point " neighbours " of selection data point to be predicted.

The dependent variable y that is somebody's turn to do " neighbours " correspondence is used for following step 2 and expands new feature.The definition of " neighbours " in the present embodiment: with data point X to be predicted _NewValue on certain dimension or certain several dimension equates or similar a series of former data point.

To data point X to be predicted _NewEach the dimension X _i, utilize domain knowledge and experience, also can combine existing mining mode method such as Apriori, GSP, Prefixspan etc. find the former data (X of part that needs _I1, X _I2, X _I3...) as " neighbours ", these neighbours can be used as the off-line knowledge of background.

Illustrate: utilize domain knowledge; For example according to each certain product price y of attribute X prediction of certain product, wherein a certain row of X " place of production " comprise each country name, but think on the experience that bigger zone is reasonable characteristic; Like European crudely-made articles; Meetings such as Asia crudely-made articles have bigger influence to the result, therefore can these row be all European former data and be regarded as " neighbours ", and the y that uses them is as expansion.

Again for example; Utilize mode excavation; Like a simple method,, analyze which characteristic useful (like bigger parameter characteristic of correspondence) according to the regression equation that training before obtains; Which characteristic should be useful but do not play corresponding effect (have effect intuitively like floor space x with respect to price y, but parameter is less).The utilization factor of this category feature is relatively more not enough, needs to expand.

Judge according to user preference X whether it likes certain commodity y for another example, statistical information is found " whether liking shopping online ", and whether " often sleeping evening " has very strong incidence relation.Then can find the identical or identical conduct " neighbours " of row of two row, the y that uses them also can remedy " neighbours " not enough shortcoming as expansion.As known off-line knowledge, then can do expansion with similar above information to new data set and other similar data sets.

Step 2 utilizes these neighbours and corresponding dependent variable y thereof to come the former data point (independent variable X) and the dimension of testing data point are expanded.Can expand one or more dimensions, the independent variable that obtains after X is expanded is designated as X ⁺For the number of the dimension that expands can be according to the actual requirements, data set size and the algorithm complex that can bear confirm.For expanding which dimension, can be according to domain knowledge, experience is arranged, mode excavation, user preference, user's request or the like are confirmed.

Come two above-mentioned steps are explained in more detail below in conjunction with instantiation.

For example, the sales volume of some product of certain unit is given a forecast, more existing former data, concrete sample data as shown in table 1, former data are the data before in October, 2011, data to be predicted are 108002.Wherein, the dimension of independent variable X includes: the supply of starting material A, the supply of starting material B, month, input number, product type and product colour, totally six row (or six kinds attribute or dimension); Output valve Y: the sales volume of product.

Table 1

At first; Can be according to some domain knowledges or experience; Expand from the time angle, as having on the experience: one, the sales volume difference can be very not big in half a year for product, therefore adjacent month sales volume have correlativity; Two, this product has the busy season in dull season, so the sales volume in same January of different year also has certain correlativity.Concerning 106864, in " month " dimension, last month 106862, certain data of the month before last 106863 and " 2011.1 " all are its " neighbours ", and the corresponding y of neighbours is extended on the corresponding dimension.Therefore go up in " month " and expand: on each X, increase sales volume (X last month ₇), the month before last sales volume (X ₈), last year chain rate sales volume (X ₉).This can be called " history of historical data " on time dimension, because treat predicted data point X _New(108002), " neighbours " of itself comprise data 106864,106865 etc., also can its history 106863 corresponding historical sales volume y=334 of " neighbours " (historical data) 106864 be extended for 106864 the 7th dimension X ₇That is to say, predict 108002, used the historical sales volume y (106863 sales volumes 334) of its historical data 106864.

From the product design angle, there is certain correlativity in the sales volume between the homologous series model, therefore can trigger from the angle of model, goes up in " model " and expands, and promptly on X, increases homologous series product sales volumes (10).

Also can be according to the certain methods such as the Apriori of mining mode, GSP, Prefixspan etc. find A supply X ₁, B supply X ₂Proportioning and the interval of sales volume y concern with certain, like X ₁/ X ₂=0.7, and the frequent appearance of y ∈ [298,335].Therefore, can use the close former data of AB ratio, all be about 0.5, so 106861 and 106862 " neighbours " each other like the A/B of data 106861 in the table and 106862 as " neighbours ".

Secondly, according to selected neighbours and corresponding y value thereof, come the dimension of each former data point X is expanded, independent variable that obtains after the expansion and corresponding dependent variable are designated as (X ⁺, y).

In one embodiment, use MapReduce to realize the step that expands:

The step 21:Map stage: each former data point is distributed into D ₂-D ₁+ 1 part, D wherein ₂Be the X after expanding ⁺Dimension, D ₁Dimension for X before expanding.Every piece of data be (key, value), wherein, key (key) is used for identifying the former data point that needs to receive this piece of data, can use the id as take over party's data point to be used as key; And value (value) for the subscript of the columns (or dimension) that will write (which dimension for example, as preceding text 7,8,9 etc.) and send the corresponding y value of former data point of this piece of data.

The step 22:Reduce stage:, put data output in order by columns (dimension) and y value with the information of collecting.

Be that example is explained above-mentioned expansion step still with the sample data shown in the table 1:

To each former data point x, the dimension that expands as required comes distributing data right, and for example, if expand k dimension, it is right then will to distribute k data for each former data point x.Handle earlier and month expand dimension,, generate sign key and its distribution (step 21) to each data point x.With data 106863 is example, divide the data layout send for (key, value), i.e. (106864, (7,334)), wherein, key be the key of corresponding next month, then be (7,334) integrate with y=334 on the 7th hurdle of data after the expansion in order to expression to value.Because this month 2010.12, is with respect to being its " history " next month 2011.1, " sales volume last month " of corresponding front, just the 7th hurdle.Then be " sales volume the month before last " for another example concerning 2011.2, just corresponding the 8th hurdle, key and the value (8,334) i.e. that therefore sends month after next be (106864, (8,334)).

Be example with 106862 again: according to the analysis of front, 106862 y should be extended for 106864 the 8th row X ₈, can use the id of take over party's sample point.Columns (dimension) subscript and the corresponding y value of every part value for writing.So to 106862 this be output as key=106864, value=(8,325).Specifically distribute and collect details and then be responsible for processing by the framework of MapReduce, this method only need send out above-mentioned data and get final product.

Regenerate new expanding data point (step 22), receive, be integrated into new data,, will receive value and be (7,334) like data key=106864 with the data of key and according to row before, (8,325), (9,?) three records.Therefore the 7th of expanding data the row dimension 334, the eight classifies 325 etc. as.Wherein 2010.1 sales volume y represented in question mark; After accomplishing like this, expanding data will increase three-dimensional (X ₇=334, X ₈=325, X ₉=?)

Expansion to " model " hurdle (extends to X ₁₀), adopting similar method, Map and Reduce stage are carried out in the conduct that model is identical " neighbours " again, accomplish (X ₁₀) expansion on hurdle.

After the completion, former 106864 in this example through the expanding data point after expansion be (33,38,2011.1,120, AF002, red indigo plant, 334,325,?, 371), extend to 10 dimensions from 6 original dimensions;

Above-mentioned extending method is used for the problem that regression forecasting can solve traditional regression forecasting characteristic disappearance and characteristic isomery.For example; Problem for the characteristic isomery; Classic method generally is all to convert nonumeric characteristic into numerical value, and as Wednesday and Friday being converted into 3 and 5, variation has taken place for data after the conversion and former data: 5 greater than 3; Therefore Friday but can not be simply greater than Wednesday, and Friday, the output (y) of product can not simply be greater than or less than the output (y) of Wednesday; Also have certain methods to transfer it to enumeration type and come comparison, Saturday, the similarity with Saturday was 1, and all be 0 with other similarities Saturday, and this has also lacked some information to a certain extent, as possibly also having some similarities Sunday Saturday; There is method can be experimental to give Saturday and similarity on Sunday again, but the not theory support of abundance.And the characteristic extending method of the embodiment of the invention, explicitly is the expansion of y as characteristic, for example, can be with the sales volume y of Friday as augmented features, the dimension X of so new expansion is that the data through y obtain, so all be numerical characteristics.Because there is not the problem of characteristic isomery in predicted value y, so the characteristic that expands can effectively new feature as regression forecasting.This will improve the quality of characteristic greatly, the final effect that improves prediction.Like the month item " 2010.11 " in table 1, " 2011.1 " are different data types, and being translated into numerical characteristics is individual very stubborn problem, adopt y of their " neighbours " correspondences then to solve this problem cleverly as characteristic.

Should point out that characteristic extending method mentioned above can also be applied on the method for a lot of supervised learnings.Because characteristic all plays important effect having in supervision and semi-supervised (former data have y) method of a lot of machine learning, like classification.It is generally acknowledged that different sorting algorithms is influential to the result, but parameter is optimum the time, this influence is not maximum.But the quality of characteristic is bigger to result's influence.Therefore, if in classification, also come feature-rich, also can reach the purpose that improves effect at last with the characteristic extending method in the embodiment of the invention.

According to still another embodiment of the invention, a kind of regression forecasting method based on above-mentioned characteristic extending method is provided also.This regression forecasting method is at first selected " neighbours " of data point to be predicted in former data point; According to selected neighbours and corresponding y value thereof, come the dimension of each former data point X is expanded then, obtain X ⁺Then, based on X ⁺Treat predicted data point X _NewCarry out regression forecasting and obtain predicted value y _New

Can know from preceding text, can realize with the MapReduce framework, therefore, in one embodiment, can use MapReduc to realize above-mentioned regression forecasting method, wherein based on X according to the characteristic extending method of the embodiment of the invention ⁺Adopt LWLR to come to predicted data point X _NewCarry out regression forecasting and obtain predicted value y _New

But as indicated above, the LWLR method is existing problems in the neighbour calculates.Neighbour's calculating generally is to calculate through the Euclidean distance between the data point among the LWLR.Therefore, In yet another embodiment, after adopting above-mentioned characteristic extending method that former data characteristics is expanded, the available scheme of distance calculation is more flexibly tried to achieve the x of new data point _NewNeighbour x ⁺, train anticipation function with the neighbour at last and new data point is predicted drawn y.

Should point out that X and expansion can be the structures of serializing or tree even figure, but can only express with different dimensions in the data, structural information is record separately.Among the for example top embodiment, X ₇, X ₈, X ₉, represented the historical development trend of month sales volume.Concerning 106864, in " month " dimension, last month 106863, the month before last 106862 and certain data of 2011.1 all are its " neighbours ", and their y correspondence is extended for X ₇, X ₈, X ₉, be actually orderly moon sales volume y as expansion, belonging to has the serializing of sequencing data, when calculating similarity, can select to use KL distance (Kullback-Leibler Divergence is also referred to as relative entropy) as X ₇, X ₈, X ₉Distance, see table 2 for details.

Illustrate: X ₇, X ₈, X ₉Ordinal relation is arranged, if only this three-dimensional is calculated Euclidean distance, can have following problem: Euclidean distance is treated each dimension and is all made no exception, and the formula of usefulness is a residual sum of squares (RSS); Just for A (1,2,3) some B (2,3; 4) and C (2,1,2) be the same with its distance, but actual A of it seems and B all be promote gradually (if the moon sales volume; Then be to increase month by month), C then swings up and down, and Euclidean distance can't be calculated this difference.And what adopt the KL distance calculation is relative entropy, can draw the nearer correct conclusion of B and A.

Table 2

Should point out, in the process of calculating similarity, can use the different distances computing method for different dimensions.Like former dimension X ₁To X ₆Can directly use Euclidean distance D, expansion back dimension such as X ₇, X ₈, X ₉Ordinal relation is arranged, use the KL distance B according to table 2 _KLFinal similarity can adopt various integration methods: like both weighted sum (λ ₁D+ λ ₂D _KL) or both get, and it is little: min (D, D _KL), perhaps both get that it is big: max (D, D _KL) etc.The scheme of choosing is looked concrete condition and is had nothing in common with each other, can be on data set, and relatively more good and bad with the result that various fusion methods obtain, concrete parameter also can be drawn by training.

Former data before data after the expansion are compared: at first intrinsic dimensionality has obtained expansion, and the characteristic of 6 dimensions has been extended to 10 dimensional features by us in the table; In addition, new characteristic is numeric type and structurized relation can be arranged, like X ₇, X ₈, X ₉Between sequence relation is arranged, therefore can use sequence to predict sales volume y.At last, the not only close point on X that the neighbour calculates, and be point close on y; It is close to be not only monodrome, and is on the sequence and even close on the structure.Based on top 3 points owing to expand, enriched the information of former data point through characteristic, so the neighbour of testing data point can look for more accurately with enrich, thereby improve the quality and the accuracy for predicting of last regression function.

In one embodiment, also can realize the regression forecasting method, particularly, can be divided into the next stage based on MapReduce:

Step 31:Map (mapping) stage: calculate each X ⁺With to be predicted some X _NewSimilarity.The method of calculating similarity can be more flexibly, promptly can adopt distance calculating method provided above, also can use existing additive method as required.Output key is to be predicted some id, and value is X ⁺The basis on increase the value of similarity.Notice that the every dimension data order that wherein expands generally is immutable, like X ₇, X ₈, X ₉All representing different corresponding sales volumes (possibly increase month by month) of historical month, is sequential therefore, so do not consider that the range formula (like the cosine distance) of sequence is unaccommodated.

Step 32:Shuffle (breaing up) and Sort (ordering) stage: wherein the Shuffle stage is that the MapReduce framework automatically performs, and the Sort stage can be used prior art, according to similarity size ordering (like heapsort), finds each to be predicted some X _NewThe most close K expanding data point X ⁺And y.

Step 33:Reduce (yojan) stage: the information of coming before collecting (K data point the most close before also can filtering out once more), because data point at this moment is less, therefore can call various regression forecasting methods predicts Y _NewAnd output.

Still the example of associative list 1 is discussed the step of testing data being carried out regression forecasting:

1) utilize the front to generate good data x ⁺, treat predicted data point x _New(108002) carrying out similarity calculates.Multiple computing method can be arranged here, specifically can li find from table 2, the y (from same angle, promptly an iteration in the pre-service has only expanded one dimension) like single expansion then uses simple Euclidean distance to get final product; And a plurality of y, month As mentioned above angle expansion, expanded 3 dimensions, X7, X8, X9 has sequence relation between them, should unifiedly calculate similarity, uses the KL distance to calculate so in table, can find.

Wherein this example 106864 in, characteristic (7,334), (8,325) (9,?) order be immutable, except traditional cosine distance and Euclidean distance calculate similarity, can also use other distances to calculate.As can calculate negative KL distance as similarity, find the data point of similar development trend; Then the similarity that calculates is added among the value.

2) merge this locality of MapReduce, according to the similarity S of preceding text calculating (108002) with each data points, in every blocks of data, finds the similar point of top K to output to next step.Output format is still thought routine key:108002, and value (106864, S);

3) the most close data point of collection front, as, the X that predict _NewIn have 108002, data such as 106861,106862,106863,106864,106865 after the expansion of collecting before using are done common linear regression or non-linear regression etc., predict the product sales volume y of pending data point 108002 then _NewBecause data X to be predicted _NewAlso can have a lot, so this step is still executed in parallel in the MapReduce framework.

Fig. 1 has provided the schematic flow sheet of regression forecasting method according to an embodiment of the invention.This method mainly comprises following step, and wherein 3-5 is the processed offline stage, and 6-8 is the online treatment stage:

1) reads in raw data, read in data to be predicted (can be a plurality of);

2) if data processed offline mistake then skips to 6;

3) use domain knowledge and technology such as experience or mode excavation, needing to obtain the characteristic of expansion, write configuration file;

4) the Map stage, according to 3 distribute each former data information, key is reciever id, value is for expanding dimension and extended value;

5) the Reduce stage, collect all information with key, merge and preserve into new data point x ⁺

6) the Map stage, on the data after the expansion, calculate data x to be predicted _NewWith the former data x after the expansion ⁺Similarity, use range formula more flexibly.

7) the Reduce stage, select the most close K data points X _i ⁺, i ∈ [1, K].

8) utilize K data points X _i ⁺, i ∈ [1, K], the y that regression forecasting goes out to need.

Fig. 2 has provided the schematic block diagram of regression forecasting device according to an embodiment of the invention.This device mainly comprises following module:

Data analysis module: former data are analyzed and mode excavation, needing to be obtained the X dimension of expansion;

Data preprocessing module: according to the output of data analysis module, the characteristic extending method that utilizes the preceding text introduction expands the characteristic of former data;

Regression forecasting module: utilize the data after expanding, adopt new distance calculating method mentioned above, obtain the more effective neighbour predicted value last with returning output

Fig. 3 has provided the effect contrast figure of employing with the regression forecasting of the characteristic extending method that does not adopt the present invention to propose.

What left figure showed is the characteristic extending method that does not use the present invention to propose; Do the design sketch of regression forecasting; Point among the figure is the linear prediction function for " neighbours " point (can think that in the embodiment of this paper horizontal ordinate is month, ordinate is a sales volume) dotted line of having chosen.

Right figure be after using the characteristic expansion that the present invention proposes, do regression forecasting design sketch (can think in the present embodiment that horizontal ordinate is the X7 after the expansion, X8, X9 and y represent last month, the The Month Before Last, last year on year-on-year basis and this month; Ordinate is the sales volume value), dotted line is to be predicted some X _NewHistorical sales volume trend.The corresponding y value of the point in the upper right corner is a predicted value.

As can beappreciated from fig. 3, can grasp the development trend of dependent variable y on the whole, therefore obtain more obvious effects owing to adopted characteristic extending method provided by the invention.And this effect has been showed final predicted value y clearly _NewWith the data point x after the expansion ⁺Between relation; Can also provide proper explanations for prediction of the present invention: new data point is more close on attribute with former data point shown in the figure; And the dependent variable development trend also reaches unanimity, and the data after therefore using these to expand can better be treated future position and make prediction.

In sum, in the method for the embodiment of the invention, obtain useful pattern information and historical information, thereby former data internal schema is screened through relation between the dependent variable Y of mining data point and pattern.The pattern of being excavated mainly is conceived to nonumeric characteristic, in order to solve characteristic isomery problem.Generally speaking, time and the dimension (exist between the value of this dimension and comprise or relation of association, as the place of production etc.) of structure is arranged is the emphasis of mode excavation.Expand the number of dimension and generally judge, should under algorithm complex acceptable prerequisite, select based on the size of demand and data set.

And, to data point x to be predicted _NewExcavate similar Y, come X expanded and obtain X ⁺, Y; And realize with the MapReduce algorithm, reached under unified MapReduce framework fast parallel completion data mining and effects of pretreatment.Compare the model of in the past not considering the data development model, the present invention has only increased a pretreated stage on data set, does not need extra resource just can enrich the information of data point; On execution speed, the time complexity that this pre-service increased is the required N/M of scan-data, and wherein N is the data point number, and M is the number of the Mapper of MapReduce.On treatment effect, enriched the information of former data point X, and finally improved prediction effect.

In addition, use X ⁺, Y and X _NewPredict y _New,, converted to and must predict the prediction of monodrome in the past the data point to be predicted that structural information (that is, the sequence in the foregoing description) is arranged, used the measuring similarity mode (like the KL distance of using among the above-mentioned embodiment) that is more suitable for.Therefore big lifting is arranged on the prediction effect of data, and can provide reason and the corresponding data that dopes this value.

According to one embodiment of present invention, a kind of supervision machine learning method that has also is provided:

1) feature extraction of training data and dimension yojan, (x1 is x2....) with the form of label Y to form X;

2) utilize the characteristic extending method of being introduced among the preceding text embodiment that X is expanded;

3) selection is confirmed model parameter type and number of parameters by the model formation that x predicts y.

4) on training set, can train the best parameter of effect with various learning methods.

5) model and the parameter that trains are used on the particular problem, finally reach the purpose of machine learning and prediction, for example provide the regression forecasting equation and predicted the outcome, provide sorter and obtain classification results.

In actual applications, what the feature of training data was not enough often, but very big to last result's influence.Use characteristic extending method of the present invention and can overcome this problem.

Again for example, method of the present invention also can be used to carry out weather forecast.

Weather forecast can use rule-based or based on the statistics machine learning method.Latter's key step is exemplified below: 1) collect weather data, and temperature, humidity, longitude, latitude, on the date, whether rain as Y as X season etc.2) selection sort model such as logistic return, and number of parameters is also decided thereupon, and number of parameters equals the X dimension.3) utilize the known weather data of every day in the past, train best parameter.4) with tomorrow, the day after tomorrow, data such as next week, model and parameter that input trains draw the result of whether raining.

In the practical application 1) characteristic often be nonumeric, like season, the date.Therefore adopt this type of problem of solution that characteristic extending method of the present invention can be satisfactory.Finally reach the better prediction effect.

Again for example, method of the present invention also can be used to carry out disease forecasting, 1) from existing case, obtain training data (data of disease forecasting), X: age, sex; Working environment, medical history, body weight, heartbeat, blood platelet; RBC number, leukocyte count, CT, symptom etc., Y: whether suffer from certain disease; 2) selection sort model and method.3) utilize the machine learning training parameter.4) input checking person information utilizes model and calculation of parameter to draw ill probability.

In the practical application, most data of case are not had a numerical characteristics, though the leucocyte red blood cell is a number, are not to be the bigger the better or more bad more greatly yet.Therefore predict the outcome as using method of the present invention, can reaching better.Thereby better reference is provided for the doctor makes a definite diagnosis.

Again for example, method of the present invention also can be used for user's buying behavior is predicted.

User's buying behavior prediction is extremely important to product distribution and advertisement putting, finds the user of strong desire to purchase can effectively reduce the input to distribution and advertisement.Implementation step as: 1) user data is collected, and the information slip of filling out when using the user to buy product is as training data, as comprises: the age, length of surfing the Net, sex, family income, whether the address bought the similar functions product, purchasing channel, job specification etc.; 2) preference pattern, computing method and the parameter that needs; 3) use machine learning method, learn out best parameter; 4) with each user that do not do shopping as the active user, in the model and parameter before the corresponding input of its information, each user is bought the possibility of product and makes a prediction.

In the practical application 1) characteristic mostly be isomery, nonnumeric.Therefore use method of the present invention can better find the user similar with the active user, thereby reach better prediction effect, the product advertising expense is finally saved in the advertisement of directed input.

Again for example, the present invention also can be used to carry out music recommend.Utilize the user to listen the song number of times and give a mark as Y, other various user characteristicses and melody characteristic are as X, and judges is liked the possibility of some new song.Thereby recommend the music prefer for the user.Also can use the abundant and characteristic that quantizes of the present invention.Similar have, the network friend recommendation, and books are recommended.

Certainly, also can use with as decision of a game prediction, information retrieval, spam is classified, other fields such as news importance degree prediction.

Though the present invention is described through preferred embodiment, yet the present invention is not limited to described embodiment here, also comprises various changes and the variation of being made without departing from the present invention.

Claims

1. characteristic extending method that is used for regression forecasting, said method comprises:

2. characteristic extending method based on MapReduce, said method comprises:

3. regression forecasting method, said method comprises:

Step a) utilization method according to claim 1 or claim 2 expands the dimension of each former data point X, the data point after obtaining expanding;

4. regression forecasting method based on MapReduce, this method comprises:

Step 41) utilize method as claimed in claim 2 that the dimension of each former data point X is expanded, the data point after obtaining expanding;

5. adopt KL distance, cosine distance or Euclidean distance to calculate similarity for the dimension after the different expansions regression forecasting method according to claim 4, said step 42).

6. regression forecasting device based on MapReduce, said device comprises:

Be used to utilize method as claimed in claim 2 that the dimension of each former data point X is expanded, the device of the data point after obtaining expanding;

7. one kind has the supervision machine learning method, and said method comprises:

2) utilize characteristic extending method according to claim 1 or claim 2 that data point X is expanded;

8. machine learning method according to claim 7, wherein model formation is a regressive prediction model in the step 3); Said step 4) utilization such as claim 3,4, one of 5 described regression forecasting methods are predicted, and are predicted the outcome.

9. according to claim 7 or 8 described machine learning methods; Said method is used to carry out weather forecast, disease forecasting, user's buying behavior prediction, music recommend, network friend recommendation, and books are recommended, the decision of a game prediction, information retrieval; The spam classification, news importance degree prediction etc.