Summary of the invention
For the deficiencies in the prior art, the object of the present invention is to provide a kind of internet information based on Distributed Calculation to throw in channel optimization systems, by user's the behavior of browsing optimization information, throw in the selection of channel, realize more exactly internet information and recommend, meet consumers' demand.
For achieving the above object, the present invention has adopted following technical scheme:
The invention provides a kind of internet information based on Distributed Calculation and throw in channel optimization systems, this system comprises: data collection module, data preprocessing module, training module, information are thrown in channel contribution degree prediction module and conversion ratio prediction module, wherein:
Data collection module, this module is collected user behavior data by web server: the user behavior of collecting is divided into two parts, part of records certain user whole browse behavior, another part has recorded the access characteristic of the different channels of same information;
Data preprocessing module, this module is that the user behavior data that server is collected is cleared up, integrated, reduction, the user behavior information of collecting is simplified to standardization;
Training module, this module be input as training set, and carry out interative computation with class E-M algorithm, iteration, to the customer impact intensity factor in the cumulative model of probability and this two parameters convergence of the factor of impact decay in time, completes the parameter estimation to these two parameters.
Information is thrown in channel contribution degree prediction module, this module be input as test set, structure information is thrown in channel m contribution degree, then sums up according to affiliated web site or type that each information is thrown in channel m, draws each website and all types of contribution degrees; Finally, according to each website and all types of contribution degrees, sort from high to low, select the forward website of rank or type to carry out information pushing, with this, obtain better input effect;
Conversion ratio prediction module, this module be input as test set, utilize survival function to mark to each user, dope the user that most possible generation transforms behavior, and push internet information to this part user.
Distributed Calculation based on Hadoop platform, in all modules, relate to calculating section above, all at Hadoop platform, carry out, we carry out complicated Computation distribution to a plurality of nodes, realized the parallel processing of multitask, reduced the wait between task, resource has been distributed more reasonable, arithmetic speed is greatly enhanced.
Compared with prior art, the present invention has following beneficial effect:
Internet information based on Distributed Calculation proposed by the invention is thrown in channel optimization systems, can greatly improve the accuracy of throwing in the prediction of channel contribution degree for information, thereby conveniently choose the most effective website or type, carrys out impression information; And selected most probable the user crowd who transforms has occurred, made information recommendation more targeted.Therefore, can exchange with minimum cost best recommendation effect for.In addition, data processing of the present invention, all based on Hadoop platform, has realized the parallel processing of multiple computers, while greatly reducing the large data of processing, for the requirement of computer arithmetic capability and internal memory, meanwhile, has greatly improved arithmetic speed.
Embodiment
Below in conjunction with specific embodiment, the present invention is described in detail.Following examples will contribute to those skilled in the art further to understand the present invention, but not limit in any form the present invention.It should be pointed out that to those skilled in the art, without departing from the inventive concept of the premise, can also make some distortion and improvement.These all belong to protection scope of the present invention.
As shown in Figure 1, the information domestic model figure based on server in the present invention, has clearly shown user information collection in figure, the formation of User profile, and the recommending module that the present invention builds is to be all stored in server, and processed by server.And user's client computer used is not responsible storage, processing user profile.
As shown in Figure 2, in the present invention, the input of the internet information based on Distributed Calculation channel optimization systems comprises:
Data collection module, is used web server to collect user behavior, the user behavior of collecting is divided into two parts: web page browsing message, Information message.Wherein, web page browsing message accounting certain user whole browse behavior, it can reflect the correlated characteristic of this user's browsing page; Information message accounting the access characteristic of the different channels of same information, click history and the feature of channel are thrown in its reflection for information.
Data preprocessing module, the user behavior data that server is collected carries out data scrubbing, integrated, reduction.
Training module, input training intensive data, based on maximum likelihood estimate, carries out interative computation with class E-M algorithm, thereby completes the parameter estimation to the cumulative model of probability;
Information is thrown in channel contribution degree prediction module and conversion ratio prediction module, calls the parameter obtaining from training set, and test data is brought into, thereby complete, information is thrown in to the prediction of channel contribution degree and the prediction whether user is transformed.
As shown in Figure 3, in the present invention, distributed computing framework figure has shown the Distributed Calculation based on Hadoop platform.Internet information based on Distributed Calculation is thrown in channel optimization systems and is related to calculating section in all modules, all at Hadoop platform, carry out, we carry out complicated Computation distribution to multinode, realized parallel processing, thereby saved a large amount of system resource, and greatly accelerated arithmetic speed.
As shown in Figure 4, the present embodiment provides a kind of internet information based on Distributed Calculation to throw in channel optimization systems, and uses the training of True Data collection and test.The present embodiment is chosen current internet information and is thrown in the system that contribution degree prediction field uses the system based on rearmost point striking the most widely and logic-based to return and compare.Experimental result shows, no matter the present invention is in the accuracy of the contribution degree of the different channels of prediction, or may transform in the accuracy of behavior in predictive user, is all better than two kinds of systems above.Final the present invention can also provide front N user and the most effective information of most probable generation conversion behavior and throw in channel.
The present embodiment is that described method is applied to the optimization that in internet, information is thrown in channel, and this system comprises:
1, data collection module
This module is based on web server, and what adopt that the method for behavior tracking records certain user all browses behavior; Adopt the method for daily record excavation, record the access characteristic of the different channels of same information; Complete the collection for user profile, and user profile is stored in to web server.
2, data preprocessing module
This module is carried out data scrubbing, integrated, reduction.Wherein, the method that data scrubbing is mainly taked to ignore first ancestral and removed redundancy, this is that the data proportion of void value is very little because in the data of collecting; Data integration is mainly the unit of unified collected data; Data stipulations are mainly carried out quantity stipulations, and the click time is converted into model parameter, and the final data set that comprises user ID, information input channel, time and these four territories of click that forms; Again a part for this data centralization is extracted, as training set; The concentrated data of remainder data are as test set.So far, can form the user profile of standard, also convenient next for the application of data.
3, training module
This module is responsible for training by the data in training set, completes the parameter estimation to the cumulative model of probability.
The situation that first training module is thrown according to actual information is made following hypothesis:
(1) each information display can produce an influence power to user's conversion;
(2) each information display is decayed in time to the influence power of user's conversion;
(3) same information is consistent with the rate of decay to all users' influence power;
(4) influence power of the information that different channels are thrown in can linear superposition;
(5) user's instantaneous conversion probability is directly proportional to influence power.
Based on above hypothesis, training module can be set up probability cumulative model, i.e. user behavior condition strength function λ
u(t):
Wherein: wherein: note user for set 1 ..., U}, information channel for set 1 ..., n}, the user behavior of observing is set { C
1..., C
u, the structure of the behavior record of user u is
wherein
the information that is the i time behavior of user u is thrown in channel id,
the time of the i time behavior of user u, x
uuser's conversion results (x
u=1 represents that user transforms, x
u=0 is anti-), l_u is the total degree of user u behavior, if user u has transformed, t
urepresent transformation time, otherwise represent window node observing time.α is that the different channels information of throwing in is to customer impact intensity factor, the factor that ω decays in time for impact, k is that information is thrown in channel id, a_k, w_k respectively representative information throw in channel k affect intensity factor and the impact factor of decay in time, Tu represent transformation time or observing time window node.
Then for representing user's conversion ratio, set up survival function S
u(t):
Then by class EM algorithm:
E-step wherein:
M-step:
Order
Can obtain:
Can complete training process.
4, information is thrown in channel contribution degree prediction module
This module is responsible for bringing test set into complete training process the cumulative model of probability, obtains the contribution degree that each different information is thrown in channel.
The contribution degree that information is thrown in channel m can be written as:
According to affiliated web site or the type of each information input channel m, sum up again, draw each website and all types of contribution degrees.Finally, choose website or the type that contribution degree is high and carry out information pushing, with the high efficiency of the propelling movement channel that guarantees to choose.
5, conversion ratio prediction module
Whether this module is responsible for predictive user conversion behavior and can be occurred.This module is utilized 1-S (T
u) each user is marked, then user's mark is carried out to sequence from low to high, select the top n user that mark is the highest, think that they are the users that possible transform behavior.Subsequently, these predicted users that conversion behavior can occur are carried out to information pushing, make information recommendation more targeted, thereby improved propelling movement effect.
6, the Distributed Calculation based on Hadoop platform.The data of data centralization are assigned in the middle of a plurality of different mapper by programming, form a collection of intermediate result <key, value>, reducer can process intermediate result, and the item with identical key is merged.Finally, using amalgamation result as output, obtain the result α of this iteration, ω.Using this result as parameter, re-enter in mapper again, realize the interative computation of parameter estimation.Like this, just a complicated task is divided into much more fine-grained subtasks.And these subtasks can be dispatched between idle processing node, make the more tasks of the faster node processing of processing speed, thereby avoid the slow node of processing speed to extend the deadline of whole task, to reach the effect that improves arithmetic speed.Meanwhile, can avoid the wait between task, save system resource.
Implementation result
Technique scheme, use be real data set.
First, the present invention carrys out the quality of evaluating system according to F1 mark.
Wherein, the method for F1 mark is as follows:
Wherein, P is accuracy rate, equaling (predicting the outcome and the actual ID number conforming to)/(the total ID number predicting the outcome) R is recall rate, equals (predicting the outcome and the actual ID number that has conversion conforming to)/(the ID sum that has conversion in test set).
By (a) in Fig. 4, can obviously find out that the score of F1 mark of the present invention will be higher than rearmost point striking and logistic regression, this is just explanation also, and the present invention may transform the accuracy of behavior user in predicting for front N, be far away higher than latter two system.
Subsequently, using accuracy rate as horizontal ordinate, recall rate, as ordinate, compares 3 kinds of systems.From Fig. 4, in (b), can find out, the in the situation that of identical recall rate, accuracy rate of the present invention will be higher than all the other two kinds of systems.It will be further appreciated that, when recall rate reaches 0.9 left and right, unusual good of effect of the present invention, that is to say, is covering under the condition of nearly all data, and practicality of the present invention is extremely good.
From above test, can find out, internet information based on Distributed Calculation of the present invention is thrown in channel optimization systems, accuracy and the user that can effectively improve the prediction of different information input channel contribution degree transform prediction accuracy, thereby better represent prediction effect, meet user's demand.Data processing of the present invention is all based on Hadoop platform, has realized the parallel processing of multiple computers, while greatly reducing the large data of processing, for the requirement of computer arithmetic capability and internal memory, meanwhile, has greatly improved arithmetic speed.
Above specific embodiments of the invention are described.It will be appreciated that, the present invention is not limited to above-mentioned specific implementations, and those skilled in the art can make various distortion or modification within the scope of the claims, and this does not affect flesh and blood of the present invention.