The content of the invention
In view of the shortcomings of the prior art, it is an object of the invention to provide a kind of internet information based on Distributed Calculation
Channel optimization systems are delivered, optimizes the selection that information delivers channel by the navigation patterns of user, more accurately realizes interconnection
Net information recommendation, meets user's request.
To achieve the above object, present invention employs following technical scheme:
The present invention provides a kind of internet information based on Distributed Calculation and delivers channel optimization systems, and the system includes:
Data collection module, data preprocessing module, training module, information deliver channel contribution degree prediction module and conversion ratio prediction mould
Block, wherein:
Data collection module, the module collects user behavior data by web server:By the user behavior being collected into point
For two parts, part of records whole navigation patterns of certain user, another part have recorded the different channels of same information
Access feature;
Data preprocessing module, the module be to server collect user behavior data cleared up, integrated, reduction,
The user behavior information being collected into is simplified, standardization;
Training module, the input of the module is training set, and is iterated with class E-M algorithms computing, and iteration is tired to probability
Plus the factor this two parameter convergence that the customer impact intensity factor in model and influence decay with the time, complete to the two parameters
Parameter Estimation.
Information delivers channel contribution degree prediction module, and the input of the module is test set, builds information and delivers channel m contributions
Degree, the affiliated web site or type for delivering channel m further according to each information is summed up, and draws each website and all types of contributions
Degree;Finally according to each website and all types of contribution degrees, it is ranked up, is come from website in the top or type from high to low
Enter row information push, obtained with this and preferably deliver effect;
Conversion ratio prediction module, the input of the module is test set, is scored using survival function to each user,
The user for being most likely to occur conversion behavior is predicted, and internet information is pushed to this certain customers.
It is related to calculating section in Distributed Calculation based on Hadoop platform, all of above module, it is flat in Hadoop
Platform is carried out, and complicated calculating is distributed on multiple nodes and carried out by we, realizes the parallel processing of multitask, reduces task
Between wait so that resource allocation is more reasonable, and arithmetic speed is greatly enhanced.
Compared with prior art, the invention has the advantages that:
Internet information based on Distributed Calculation proposed by the invention delivers channel optimization systems, can greatly improve
The accuracy of channel contribution degree prediction is delivered for information, so that the convenient maximally effective website of selection or type carry out impression information;
And the user crowd for being most likely to occur conversion is have selected, makes information recommendation more targeted.Therefore, it is possible to minimum cost
Exchange best recommendation effect for.In addition, the data processing of the present invention is all based on Hadoop platform, multiple computers are realized
Parallel processing, greatly reduces the requirement for computer operational capability and internal memory when handling big data, meanwhile, greatly improve fortune
Calculate speed.
Embodiment
With reference to specific embodiment, the present invention is described in detail.Following examples will be helpful to the technology of this area
Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that to the ordinary skill of this area
For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention
Protection domain.
As shown in figure 1, being explicitly shown user profile receipts in the information domestic model figure based on server, figure in the present invention
Collection, the formation of User profile, and the recommending module that the present invention is built all are stored in server, and are carried out by server
Processing.And the client computer used in user is not responsible storage, processing user profile.
As shown in Fig. 2 the internet information dispensing channel optimization systems based on Distributed Calculation include in the present invention:
Data collection module, collects user behavior using web server, the user behavior being collected into is divided into two parts:
Web page browsing message, Information message.Wherein, web page browsing message accounting whole navigation patterns of certain user, it can
To reflect that this user browses the correlated characteristic of webpage;The access feature of the different channels of the same information of Information message accounting,
It reflects the click history and feature that channel is delivered for information.
Data preprocessing module, data scrubbing, integrated, reduction are carried out to the user behavior data that server is collected.
Training module, input training intensive data, based on maximum likelihood estimate, computing is iterated with class E-M algorithms,
So as to complete the parameter Estimation to the cumulative model of probability;
Information delivers channel contribution degree prediction module and conversion ratio prediction module, calls the parameter obtained from training set,
Test data is brought into, so as to complete the prediction that information is delivered the prediction of channel contribution degree and whether converted to user.
As shown in figure 3, distributed computing framework figure shows the Distributed Calculation based on Hadoop platform in the present invention.Base
Delivered in the internet information of Distributed Calculation in channel optimization systems and be related to calculating section in all modules, in Hadoop
Platform is carried out, and complicated calculating is distributed on multinode and carried out by we, parallel processing is realized, so as to save substantial amounts of system
System resource, and greatly accelerate arithmetic speed.
As shown in figure 4, the present embodiment, which provides a kind of internet information based on Distributed Calculation, delivers channel optimization systems,
And be trained and test using True Data collection.The present embodiment chooses current internet information and delivers contribution degree prediction field fortune
The system returned with the system most widely based on last point hit method and logic-based is compared.Test result indicates that, this
Either in the degree of accuracy of contribution degree for predicting different channels, or in prediction user the standard of conversion behavior may occur for invention
In exactness, above two kinds of systems are better than.The final present invention can also provide the preceding N user that is most likely to occur conversion behavior and most
Effective information delivers channel.
The present embodiment is that methods described is applied into the optimization that information in internet delivers channel, and the system includes:
1st, data collection module
The module is based on web server, and whole navigation patterns of certain user are recorded using the method for behavior tracking;Adopt
With the method for Web log mining, the access feature of the different channels of same information is recorded;The collection for user profile is completed, and will
User profile is stored in web server.
2nd, data preprocessing module
The module carries out data scrubbing, integrated, reduction.Wherein, data scrubbing, which is mainly taken, ignores first ancestral and removal redundancy
Method because in the data being collected into, the data proportion of void value is very small;Data integration is mainly unification
The unit of collected data;Hough transformation is substantially carried out quantity stipulations, and the time of will click on is converted into model parameter, and finally
Formed and deliver channel, time comprising ID, information and click on the data set in this four domains;Again by the part in this data set
Extract, be used as training set;The data that remainder data is concentrated are used as test set.So far, the user profile of specification can be formed,
It is also convenient for the application next for data.
3rd, training module
The module is responsible for being trained with the data in training set, completes the parameter Estimation to the cumulative model of probability.
The situation that training module is delivered according to actual information first makes hypothesis below:
(1) conversion of the information exhibition to user produces an influence power every time;
(2) information displaying decays to the influence power of the conversion of user with the time every time;
(3) same information is consistent with the rate of decay to the influence power of all users;
(4) influence power for the information that different channels are delivered can linear superposition;
(5) the instantaneous conversion probability of user is directly proportional to influence power.
Based on assumed above, training module can set up probability and add up model, i.e. user behavior conditional intensity function lambdau
(t):
Wherein:Wherein:Remember user for set { 1 ..., U }, information channel for set { 1 ..., n }, it was observed that user's row
For for set { C1,......,Cu, the structure of user u behavior record is
WhereinIt is the information dispensing channel id of user's u ith behaviors,It is the time of user's u ith behaviors, xuIt is that user turns
Change result (xu=1 represents user's conversion, xu=0 is anti-), l_u is the total degree of user's u behaviors, if user u is converted,
tuTransformation time is represented, observing time window node is otherwise represented.α be different channels deliver information to customer impact intensity because
Son, the factor that ω decays for influence with the time, k is that information delivers channel id, a_k, w_k difference representative information dispensing channel k
The factor that influence intensity factor and influence decay with the time, Tu represents transformation time or observing time window node.
Then to represent user's conversion ratio, survival function S is set upu(t):
Then class EM algorithms are passed through:
Wherein E-step:
M-step:
Order It can obtain:
Training process can be completed.
4th, information delivers channel contribution degree prediction module
The module is responsible for bringing test set into the cumulative model of the probability for having completed training process into, obtains each different information and throws
Put the contribution degree of channel.
The contribution degree that information delivers channel m can be written as:
The affiliated web site or type for delivering channel m further according to each information are summed up, and draw each website and all types of
Contribution degree.Finally, choose the high website of contribution degree or type pushes to enter row information, to ensure that chooses pushes the efficient of channel
Property.
5th, conversion ratio prediction module
The module is responsible for whether prediction user's conversion behavior can occur.The module utilizes 1-S (Tu) each user is carried out
Scoring, sequence from low to high is then carried out to user's fraction, fraction highest top n user is selected, it is believed that they are possible
Occurs the user of conversion behavior.Then, the user that conversion behavior can be occurred by being predicted to these enters row information push, pushes away information
Recommend more targetedly, so as to improve push effect.
6th, the Distributed Calculation based on Hadoop platform.Data in data set are assigned to by programming multiple different
Among mapper, a collection of intermediate result is formed<Key, value>, and reducer can be then handled intermediate result, will be had
The item for having identical key is merged.Amalgamation result is finally obtained into the result α, ω of current iteration as output.This is tied again
Fruit is re-entered in mapper as parameter, realizes the interative computation of parameter Estimation.So, just a complicated task is divided into
Many more fine-grained subtasks.And these subtasks can be dispatched between idle processing node, make processing speed faster
The more task of node processing, so as to avoid the slow node of processing speed from extending the deadline of whole task, carried with reaching
The effect of high arithmetic speed.Meanwhile, it is capable to which the wait between avoiding task, saves system resource.
Implementation result
Above-mentioned technical proposal, uses real data set.
First, the present invention according to F1 fractions come the quality of assessment system.
Wherein, the method for F1 fractions is as follows:
Wherein, P is accuracy rate, is to call together equal to (predict the outcome be actually consistent ID numbers)/(the total ID numbers predicted the outcome) R
The rate of returning, equal to (predict the outcome be actually consistent the ID numbers for having conversion)/(ID that test is concentrated with conversion is total).
In Fig. 4 (a), it is evident that the score of F1 fractions of the present invention is higher than last point hit method and logistic regression,
This also just illustrates that the degree of accuracy of conversion behavior user in predicting may occur for preceding N for the present invention, be significantly larger than latter two system
System.
Then, using accuracy rate as abscissa, recall rate is compared as ordinate to 3 kinds of systems.From Fig. 4
(b) as can be seen that in the case of identical recall rate, accuracy rate of the invention is higher than remaining two kinds of system in.More worth one
It is mentioned that, when recall rate reaches 0.9 or so, unusual good of effect of the invention, that is to say, that nearly all covering
Under conditions of data, practicality of the invention is extremely good.
The test more than is as can be seen that the internet information based on Distributed Calculation of the present invention delivers channel optimization system
System, can effectively improve the degree of accuracy and user's conversion prediction accuracy that different information deliver the prediction of channel contribution degree, from
And preferably show prediction effect, meet the demand of user.The data processing of the present invention is all based on Hadoop platform, realizes
The parallel processing of multiple computers, greatly reduces the requirement for computer operational capability and internal memory when handling big data, meanwhile, pole
Improve arithmetic speed greatly.
The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned
Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow
Ring the substantive content of the present invention.