High potential user's purchase intention Forecasting Methodology based on big data user behavior analysis
Technical field
The invention belongs to sample definition, data in the fields, more particularly to data analysis such as machine learning, data analysis to draw
Point, the technology such as feature construction and modelling and optimization.
Background technology
Online shopping electric business, while high speed development is kept, has precipitated several hundred million loyal users, have accumulated the true of magnanimity
Real data.How rule is found out from historical data, go the purchasing demand in prediction user's future, allow most suitable commodity to meet most
The people needed, is that big data applies the key issue in precision marketing, and all electric business platforms are when doing intelligentized updating
Required core technology.
Proposed algorithm can substantially be divided into three classes:Content-based recommendation algorithm, Collaborative Filtering Recommendation Algorithm and based on knowing
The proposed algorithm of knowledge.Content-based recommendation algorithm, principle are that user likes the Item paid close attention to oneself similar in terms of content
Item, for example you have seen Harry Potter I, and content-based recommendation algorithm finds Harry Potter II-VI, with viewing before you
Face (sharing many keywords) has very big relevance in terms of content, the latter is just recommended you, this method can be to avoid Item
Cold start-up problem (cold start-up:If an Item was never concerned, other proposed algorithms can seldom go to recommend, but
It is that content-based recommendation algorithm can analyze relation between Item, realizes and recommend), drawback is that the Item recommended may
Repeat, typical is exactly that news is recommended, if you have seen the first news on MH370, it is likely that the news of recommendation is clear with you
Look at, content is consistent;Another drawback be then for some it is multimedia recommend (such as music, film, picture etc.) by
Then it is difficult to be recommended, a kind of settling mode is then manually to label to these Item in being difficult to carry content characteristic.Collaborative filtering
Algorithm, principle are the commodity that user likes those users with similar interests to like, for example your friend likes film to breathe out
Sharp baud I, then you will be recommended, this is the simplest collaborative filtering based on user, and it is to be based on to also have one kind
The collaborative filtering of Item, both approaches are all that all data of user are read into progress computing in memory.Finally
A kind of method is Knowledge based engineering proposed algorithm, and also this method is classified as content-based recommendation by someone, and this method compares
Domain body is typically built, or establishes certain rule, is recommended.Mixing proposed algorithm, then can merge more than
Method, is merged in a manner of weighting or series, parallel etc..
The prior art is based primarily upon correlation rule and recommends to be done to user, its thinking be according to user under some classification its
Similar users are searched in the scoring of his commodity, then recommend similar users to score high Shang Pin Give users.This scheme is letter
Single make use of the score information of user and has neglected the behavioural characteristic of user itself.The historical behavior that the present invention passes through user
Extraction feature establishes model, and purchase intention has been predicted whether using machine learning algorithms such as gradient lifting decision trees, and right
The purchase of which part commodity, has accomplished personalized recommendation and has precisely recommended.
The content of the invention
Present invention seek to address that above problem of the prior art.Propose a kind of height based on big data user behavior analysis
Potential user's purchase intention Forecasting Methodology.Technical scheme is as follows:
A kind of high potential user's purchase intention Forecasting Methodology based on big data user behavior analysis, it includes following step
Suddenly:
101st, data prediction step:To electric business user's history behavioral data include duplicate removal, to delete daily turnover big
Its weight pretreatment operation is assigned in the abnormal dat recorder of 3 times of monthly average exchange hand and according to each behavior classification importance;
102nd, sample definition and marking step:The consumer products pair that 5 days time windows for span of extraction interact, with user
Id and product id is index construct sample, and mark operation is carried out to it;
103rd, training set and test set partiting step:Using time window partitioning, the number after step 102 mark is operated
Different time granularity division training set and test set according to this;
104th, characteristic extraction step:In the Feature Engineering stage, mainly according to user behavior syndrome.Ranking syndrome and
Dtex syndrome carries out feature extraction;
105th, algorithm designs and realizes step:For the class imbalance classification problem of data set, propose a kind of based on poly-
The similar sample of class removes algorithm, including step:Original sample is divided into two parts according to the difference of label first, then logarithm
Measure a more part and carry out cluster operation, next the random sampling part sample in each class after cluster, finally will
The sample randomly selected in inhomogeneity merges into new data set, and sampling fraction is a part of quantity few in initial data than more
A part of quantity, such as the positive negative sample ratio of initial data are 1:4, then sampling fraction is 0.25 after corresponding cluster;
Reintroduce a kind of two-layer model Iterative Algorithm and go whether prediction user can finally buy the business in commodity subset P
Product, including step:First layer trains the using earlier data pretreatment, mark, training set division, the training set after feature extraction
One layer model, utilizes the probability of first layer model prediction test set;Test set prediction probability is sorted from high to low, before taking ranking
1/10 exports for positive sample, and rear 1/10 exports for negative sample;The positive sample of output and negative sample are subjected to stochastical sampling;Will sampling
The positive sample gone out adds former training set as the positive sample newly increased, and the former training set of negative sample addition sampled out, which is used as, to be newly increased
Negative sample;Using the training set and xgboost rebuild, rapidly gradient lifting decision tree is trained to obtain the second layer model;Weight
Again until the positive sample quantity summation exported every time is actual positive sample quantity in test set.
Further, the 101 data prediction step includes:
S1011, by merchant platform obtain user's history behavioral data, initial data include user basic information data,
Commodity data and user behavior data, user's master data include User ID, age, gender, user gradation and user's registration day
Field including phase;Commodity data includes goods number, attribute 1, attribute 2, attribute 3, category ID and brand ID;User behavior number
According to including Customs Assigned Number, goods number, time of the act, click module numbering, type behavior types, category ID and brand brand
ID, wherein behavior type including 1. browse commodity details page, 2. by commodity add shopping cart, 3. from shopping cart delete commodity, 4.
The single commodity of user, are puted into collection folder by the commodity by 5. users and 6. users click on goods links;
If S1012, only lower single act one day and no added shopping cart or without details page both behaviors are browsed, delete
All historical interaction datas of single user commodity pair under the same day;
S1013, will click on goods links behavior and browse commodity details page line to merge into a kind of behavior, that is, check business
Product;
S1014, to initial data according to user_id, sku_id, type be keyword using minute as time granularity duplicate removal,
Reptile is reduced to repeat to crawl the data redundancy of data band;
S1015, change primitive behavior, gives its weight according to its relative importance, it is detailed that behavior classification 1 browses commodity
Feelings page assigns weight 0.1, and commodity are added shopping cart and assign weight 1 by behavior classification 2, and behavior classification 3 deletes commodity from shopping cart
Weight 0.2 is assigned, 4 user of behavior classification, which places an order, assigns weight -0.5, and behavior classification 5 is paid close attention to or user puts into collection commodity folder
Assign weight 0.2.
Further, the definition of step 102 sample and marking step, are specially:
S1021, the consumer products pair of extraction special time window interaction, using User ID and goods number sku_id as index
Build sample;
S1022, investigate some consumer products carries out mark to that whether can be purchased in following 5 days to sample, can quilt
It is 1 that purchase, which then defines label, will not be purchased, and defines label as 0.
Further, step 104 characteristic extraction step, is specially:
S1041, according to different time granularity divide initial data according to time window;
S1042, using user_id as key, other attributes extract user behavior syndrome for value, including user browses business
Product details page number number, user add shopping cart number, user's shopping cart deletes single number, user under number, user and deletes shopping
Single number maximum Brand, user browse commodity, use recently under single number maximum merchandise classification, user under train number number, user
Family buys commodity, user and adds shopping cart commodity, user recently recently deletes shopping cart commodity recently;
S1043, using brand as key, other attributes extract businessman feature group, including brand various actions number for value
Statistics, the brand are per day daily in generic middle sales volume accounting, brand Monday to week in generic middle sales volume ranking, the brand
Lower list number, brand of that month maximum lower odd-numbered day number, brand of that month minimum lower odd-numbered day that places an order place an order odd number divided by brand under number, brand
Odd number item for disposal board browses details page number number under addition shopping cart number, brand;
S1044, by key of user_id+brand, other attributes are that value shifts to an earlier date user businessman behavioural characteristic group, including
Behavior number statistics, user brand browse details page number number processing user and browse details page number number, user brand in user brand 5
Browse details page number number processing brand and browse details page number number;
S1045, by key of user_id+sku_id, other attributes are value extraction consumer products behavioural characteristic groups, are wrapped
Include, interaction times weighted score, daily frequency of interaction, the last interaction time, originate interaction time, if added shopping
Car, if bought, two days interaction times after interacting first, and added shopping cart the previous day interaction times, addition shopping cart is latter
Its interaction times, last interaction the previous day interaction times.Nearly 3 days, 5 days, 7 days, 9 days, 11 days interaction scenarios, first interact when
Between, last interaction time, user preference interaction time section, average interaction time length.
Advantages of the present invention and have the beneficial effect that:
The present invention takes full advantage of the historical behavior data of user, by analyzing browsing for user, pays close attention to or add shopping
The behaviors such as car, it has been found that user is often related with its historical behavior to the purchase intention of certain part commodity.This hair on this basis
The bright historical behavior feature for being extracted user, model is established using gradient lifting decision tree, while utilizes proposed by the present invention two
Layer model Iterative Algorithm goes whether prediction user can finally buy certain part commodity, and to which part sum of commodity purchase intention
Higher.The favor information that businessman can utilize the present invention to push high purchase intention commodity to potential high purchase intention user is accomplished precisely
Marketing, while the sales volume of prediction businessman is can be used for, provide decision-making for the arrangement such as stock up of businessman.
For the class imbalance classification problem of data set, the similar sample proposed by the present invention based on cluster, which removes, to be calculated
Method, including step:Original sample is divided into two parts according to the difference of label first, then an a fairly large number of part is carried out
Cluster operation, the following random sampling part sample in each class after cluster, will finally randomly select in inhomogeneity
Sample merges into new data set, and sampling fraction is a part of quantity few in initial data than more a part of quantity, such as original
The positive negative sample ratio of beginning data is 1:4, then sampling fraction is 0.25 after corresponding cluster.The algorithm is changing the same of positive and negative sample proportion
When retain the larger user behavior feature of discrimination, can effectively reduce model complexity and lifting precision of prediction.
Two-layer model Iterative Algorithm proposed by the present invention takes full advantage of the output of model as a result, classifying in tradition
On the basis of machine learning algorithm, it is contemplated that export the uncertainty of result, which will export the part that probability is higher in result
As positive sample, the relatively low part of probability adds original training set as original training set is added after negative sample stochastical sampling
Behavioural characteristic of the user in training set time window is also increased while sample size.The algorithm effectively improves original
The precision of prediction of model
Brief description of the drawings
Fig. 1 is the overview flow chart that the present invention provides preferred embodiment;
Fig. 2 is the 2016-03-15 user behavior Abnormal Maps that the embodiment of the present invention one provides;
Fig. 3 is that training set divides schematic diagram under the different time granularity that the embodiment of the present invention one provides;
Fig. 4 is that the feature that the embodiment of the present invention one provides influences result table;
Fig. 5 is the Model Fusion conceptual scheme that the embodiment of the present invention one provides.
Embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, detailed
Carefully describe.Described embodiment is only the part of the embodiment of the present invention.
The present invention solve above-mentioned technical problem technical solution be:
Embodiment one
In real shopping online, we pay close attention to or add the behaviors such as shopping cart to select to like often through browsing
Article, but when a certain article is bought, also can often browse a variety of commodity.Therefore, the purpose of this example be by using
The historical behavior at family has predicted whether purchase intention, and the purchase intention higher to which part commodity.Defined in example such as
Under symbol:
S:The commodity complete or collected works of offer;
P:The commodity subset of candidate, P are the subsets of S;
U:User gathers;
A:Behavioral data set of the user to S;
So our target uses the historic sales data of commodity under the multiple categories of electric business, and developing algorithm model, surveys and use
Family in following 5 days to P in commodity purchase
1) summarize
The user behavior data for 2016-02-01 to the 2016-04-15 days that this example is provided using electric business, it is desirable to predict
Commodity in the whether lower list P of 2016-04-16 to 2016-04-20 users.The present invention will define and beat from data prediction, sample
Mark, training set division, several aspects such as Feature Engineering and model construction come introduce the present invention solution.It is that data are pre- first
Processing stage, has carried out initial data the operations such as duplicate removal and behavior conversion, rejecting abnormalities user behavior, then originally by analyzing
Invention proposes 4 kinds of data partition schemes.The Feature Engineering stage present invention is extracted user spy respectively according to time window division
Sign, product features and comment on commodity feature.Finally the model construction stage, the present invention using xgboost and gbdt both
The model that sample predictions probability can directly be exported has carried out the selection of sample with merging.
2) data processing
More detailed analysis is carried out in the data processing stage present invention to the interbehavior of user's commodity to find
3.15 this day users only no added shopping cart behaviors of lower single act, it is then of the invention as abnormal data, eliminate
All history mutual informations of the lower single user commodity pair of 3.15 this days.Find that behavior 1 and behavior 6 have greatly through analysis at the same time
Correlation, then the present invention behavior 6 is replaced for behavior 1.Further to reduce data volume, the present invention presses initial data
Reduce reptile according to duplicate removal per minute according to user_id, sku_id, type and repeat to crawl the data redundancy of data band.At the same time
The present invention has attempted only to choose the interactive information that cate is 8, finds that effect is undesirable later and abandons the strategy, because user purchases
Buying product, there may be correlation rule.The present invention is converted primitive behavior at the same time, is weighed according to its relative importance to it
Weight, as behavior 1 does not possess representativeness, only assigns weight 0.1 and behavior 2 assigns weight 1.
3) sample definition and mark
The consumer products pair that the scheme that the present invention uses interacts for extraction special time window, according to following 5 days u_s
To that whether can be purchased and give its mark.
4) training set is divided with test set
The present invention has been attempted to investigate 5 days a few days ago as mark window under line, and first 3 days, the scheme such as first 10 days, found effect
It is undesirable.Then by the analysis to initial data, finding it, there are stronger periodicity.Then the present invention have chosen investigation
What day day carry out mark with.The present invention is final when training set and test set are divided under line employs the multiple training sets of division
The scheme finally merged.Choose mark in 1 day, mark in 3 days, mark in 5 days, mark in 10 days.Finally by various schemes according to weight
Fusion obtains final result, and weight is according to from investigating day from closely being assigned to remote descending principle.Because from the nearlyer theory of investigation
Its bright desire to purchase is stronger.
5) feature selecting is with obtaining
Defined by the sample of early period, data prediction, mark is divided with data.Next the present invention will be described in detail spy
Levy engineering phase.In the Feature Engineering stage, the present invention has mainly investigated user behavior feature, is first according to 1 day, 3 days, 5 days and 10
The different time such as it granularity, which divides initial data according to time window, is then extracted its behavior number feature, and ranking is special
Score of seeking peace feature etc..It is specific as follows:
Using user_id as key, other attributes extract user behavior syndrome for value, including user browses commodity details
Page number number, user add shopping cart number, user's shopping cart deletes single number under number, user, user deletes shopping cart number,
Single number maximum Brand under single number maximum merchandise classification, user under user, user browses commodity recently, user purchases recently
Buy commodity, user and add shopping cart commodity, user recently and delete shopping cart commodity recently.
Using brand as key, other attributes for value extract businessman feature group, including brand various actions number statistics,
The brand per day places an order daily in generic middle sales volume ranking, the brand in generic middle sales volume accounting, brand Monday to week
In number, brand of that month minimum lower odd-numbered day that places an order in number, brand of that month maximum lower odd-numbered day places an order odd number divided by brand addition under number, brand
Odd number item for disposal board browses details page number number under shopping cart number, brand.
By key of user_id+brand, other attributes are that value shifts to an earlier date user businessman behavioural characteristic group, including user's product
In board 5 behavior number statistics, user brand browse details page number number processing user browses details page number number, user brand browses in detail
Feelings page number number processing brand browses details page number number.
By key of user_id+sku_id, other attributes are value extraction consumer products behavioural characteristic groups.Including interaction
Number weighted score, daily frequency of interaction, the last interaction time, originates interaction time, if added shopping cart, if
Bought, two days interaction times after interacting first, and added shopping cart the previous day interaction times, the interaction time one day after of addition shopping cart
Number, last interaction the previous day interaction times.Nearly 3 days, 5 days, 7 days, 9 days, first 11 days interaction scenarios, interaction time, last friendship
Mutual time, user preference interaction time section, average interaction time length.
The wide table of detailed feature is as follows:
(1) user-product features
User clicks on commodity, collects, adding shopping cart, the nearest time of purchase, farthest time, and the ranking of time.
User clicks on commodity, collects, adding shopping cart, the number of purchase
Temporally be segmented, and not overlapping between the period of same particle sizes, respectively click of the counting user to commodity, receive
Hide plus shopping cart, the number of purchase, when segmentation granularity is when being respectively 4 small, 12 is small, 1 day, 2 days, 3 days
(2) user-commodity, user-category combinations feature
User is to removing click, collection plus the shopping cart of other similar commodity of the commodity, buying number.
The nearest closest approach clicking on, collect, add shopping cart, time buying subtract similar other commodity of the user to the commodity
Hit, collect, adding shopping cart, time buying, and the ranking of time buying
User subtracts the click volume of commodity average click volume of the user to similar other commodity
(3) user characteristics
User clicks on, collects recently plus shopping cart, time buying
User clicks on, collects, adds shopping cart, purchase volume
User's conversion ratio, that is, user's purchase volume is respectively divided by user clicks on, collects, adds this three class behaviors number of shopping cart
User clicks on, collects, adds the mean variance of shopping cart, purchase volume in 7 days (not by computation of Period)
(4) product features
Commodity are clicked, collect plus shopping cart, purchase volume
Commodity are purchased conversion ratio
Commodity are clicked, collect plus the mean variance of shopping cart, purchase volume in 7 days
(5) category feature
Such commodity is clicked, collects plus shopping cart, purchase volume
Such commodity conversion rate
(6) model selection and training
The present invention proposes a kind of two-layer model Iterative Algorithm prediction final result.Concrete scheme is that first layer utilizes
Training set after earlier data pretreatment, mark, training set division, feature extraction trains the first layer model, and the present invention adopts herein
It is the first layer model with GBDT (gradient lifts decision tree);Utilize the probability of first layer model prediction test set;Test set is pre-
Survey probability to sort from high to low, take before ranking 1/10 to be exported for positive sample, rear 1/10 exports for negative sample;By the positive sample of output
Random sampling is carried out, the negative sample of output carries out random sampling;The positive sample sampled out is added into former training set as newly increasing
Positive sample, the negative sample sampled out adds former training set as the negative sample newly increased;Using the training set that rebuilds and
Xgboost (rapidly gradient lifting decision tree) training obtains the second layer model;Repeat step two, until the positive sample exported every time
Quantity summation is actual positive sample quantity in test set.
The above embodiment is interpreted as being merely to illustrate the present invention rather than limits the scope of the invention.
After the content for having read the record of the present invention, technical staff can make various changes or modifications the present invention, these equivalent changes
Change and modification equally falls into the scope of the claims in the present invention.