CN107622086A

CN107622086A - A kind of clicking rate predictor method and device

Info

Publication number: CN107622086A
Application number: CN201710701071.8A
Authority: CN
Inventors: 王颖帅; 李晓霞; 苗诗雨
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2017-08-16
Filing date: 2017-08-16
Publication date: 2018-01-23

Abstract

The invention discloses clicking rate predictor method and device, it is related to field of computer technology.One embodiment of this method includes：Start Spark clusters, to import machine learning storehouse Mllib associated class；Obtain and parse source data, then the source data is trained to the division of collection, checking collection and test set；Clicking rate is created in Mllib and estimates integrated study gradient lifting tree-model, then trains clicking rate to estimate integrated study gradient lifting tree-model in Spark.The embodiment can realize estimating for intelligentized clicking rate.

Description

A kind of clicking rate predictor method and device

Technical field

The present invention relates to field of computer technology, more particularly to a kind of clicking rate predictor method and device.

Background technology

At present, with the development of internet, data scale is also increasing, how to be found out from more and miscellaneous data useful Information, and how to help user to find content interested from electric business website to turn into a new challenge.Such as：It was found that article Personalized ordering, prior art main logic are：Data Analyst (refers to that article is numbered in the underlying model of story label Cluster under different labels) on the basis of, the real-time and offline label preference according to user to article, filtered out for each user It is (first inclined according to label preference, the category of user when being to finding that the material of article sorts that article material recalls pond The article material pulled more than comparison such as good, these all article materials pulled, which just constitute, recalls pond), then analyst According to business experience, a marking formula is provided, article material is ranked up.

In process of the present invention is realized, inventor has found that at least there are the following problems in the prior art：To specific industry Business data (such as：Article, commodity etc.) carry out personalized ordering when, it is necessary to first for user draw one recall pond, Ran Hou Recall in pond, analyst provides a marking formula according to business experience.This formula needs with the change of business datum Will often be followed up change, and analyst, which needs to put into bigger energy, to go to analyze, and service end is also required to frequent updating change code. To sum up, prior art needs the more resource of input ratio, is not a kind of proposed algorithm of manual intelligent.

The content of the invention

In view of this, the embodiment of the present invention provides a kind of clicking rate predictor method and device, can realize intelligentized point Hit estimating for rate.

To achieve the above object, one side according to embodiments of the present invention, there is provided a kind of clicking rate predictor method, bag Include and start Spark clusters, to import machine learning storehouse Mllib associated class；Obtain and parse source data, then by the source data It is trained the division of collection, checking collection and test set；Clicking rate is created in Mllib and estimates integrated study gradient boosted tree mould Type, clicking rate is then trained to estimate integrated study gradient lifting tree-model in Spark.

Alternatively, it is described to obtain and parse source data, including：Source data of the Spark load stores on catalogue HDFS, so After parse the source data.

Alternatively, described clicking rate is estimated integrated study gradient lifting tree-model and included using decision tree as weak learner, The strong learner combined by least two weak learners.

Alternatively, the establishment clicking rate estimates integrated study gradient lifting tree-model, including：Using just in training set Beginning weight training obtains weak learner 1, to calculate the learning error of weak learner 1；Updated according to the learning error of weak learner 1 The weight of training sample；Weak learner 2, repetition training, until weak study are trained according to the training set after weight is adjusted Device quantity reaches the number n pre-set；N weak learners are integrated by aggregation policy, obtain strong learner.

Alternatively, after training clicking rate estimates integrated study gradient lifting tree-model in Spark, including：In Spark Mllib in assess training after clicking rate estimate integrated study gradient lifting tree-model.

Alternatively, after training clicking rate estimates integrated study gradient lifting tree-model in Spark, in addition to：Preserve Clicking rate is estimated under integrated study gradient lifting tree-model to HDFS paths.

In addition, one side according to embodiments of the present invention, there is provided a kind of clicking rate estimating device, including start mould Block, for starting Spark clusters, to import machine learning storehouse Mllib associated class；Data acquisition module, for obtaining and parsing Source data, then the source data is trained to the division of collection, checking collection and test set；Model formed module, for Clicking rate is created in Mllib and estimates integrated study gradient lifting tree-model, then trains clicking rate to estimate integrated in Spark Practise gradient lifting tree-model.

Alternatively, the data acquisition module obtains and parses source data, including：Spark load stores are in catalogue HDFS On source data, then parse the source data.

Alternatively, the model forms module creation clicking rate and estimates integrated study gradient lifting tree-model, including：Instructing Practice to concentrate and weak learner 1 is obtained using initial weight training, to calculate the learning error of weak learner 1；According to weak learner 1 Learning error updates the weight of training sample；Weak learner 2 is trained according to the training set after weight is adjusted, repeats to instruct Practice, until weak learner quantity reaches the number n pre-set；N weak learners are integrated by aggregation policy, obtained Strong learner.

Alternatively, the model formation module trains clicking rate to estimate integrated study gradient lifting tree-model in Spark Afterwards, including：The clicking rate after training is assessed in Spark Mllib and estimates integrated study gradient lifting tree-model.

Alternatively, the model formation module trains clicking rate to estimate integrated study gradient lifting tree-model in Spark Afterwards, in addition to：Clicking rate is preserved to estimate under integrated study gradient lifting tree-model to HDFS paths.

Other side according to embodiments of the present invention, a kind of electronic equipment is additionally provided, including：

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are by one or more of computing devices so that one or more of processing Device realizes the method described in any of the above-described embodiment.

Other side according to embodiments of the present invention, a kind of computer-readable medium is additionally provided, be stored thereon with meter Calculation machine program, realizes the method described in any of the above-described embodiment when described program is executed by processor.

One embodiment in foregoing invention has the following advantages that or beneficial effect：Because using the shape in Spark environment The technological means of integrated study gradient lifting tree-model is estimated into clicking rate, so overcoming traditional needs analyst according to industry Business experience, the technical problem of clicking rate predictor formula is provided, and then business realizes intelligentized technology substantially on whole line Effect.

Further effect adds hereinafter in conjunction with embodiment possessed by above-mentioned non-usual optional mode With explanation.

Brief description of the drawings

Accompanying drawing is used to more fully understand the present invention, does not form inappropriate limitation of the present invention.Wherein：

Fig. 1 is the schematic diagram of the main flow of clicking rate predictor method according to embodiments of the present invention；

Fig. 2 is the schematic diagram of the main flow for the clicking rate predictor method that embodiment is referred to according to the present invention；

Fig. 3 is the schematic diagram of the main modular of clicking rate estimating device according to embodiments of the present invention；

Fig. 4 is that the embodiment of the present invention can apply to exemplary system architecture figure therein；

Fig. 5 is adapted for the structural representation for realizing the terminal device of the embodiment of the present invention or the computer system of server Figure.

Embodiment

The one exemplary embodiment of the present invention is explained below in conjunction with accompanying drawing, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize Arrive, various changes and modifications can be made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, the description to known function and structure is eliminated in following description.

Fig. 1 is clicking rate predictor method according to embodiments of the present invention, as shown in figure 1, the clicking rate predictor method bag Include：

Step S101, start Spark clusters, to import machine learning storehouse Mllib associated class.

Wherein, Spark is a Multifunctional group computing system for being used to handle big data workflow.Spark speed, All it is better than MapReduce in ease for use and analysis ability, Spark also provides the unification of the various big data storage sources of connection Runnable interface, such as HDFS, while the top layer storehouse for being largely used to different big data calculating tasks is also provided, such as machine learning storehouse Mllib。

In embodiment, start Spark Cluster Exploitation environment, that is, open the JA(junction ambient) of Spark clusters.Then, import Machine learning storehouse Mllib associated class, that is, need to call the machine learning bag in Spark, and the machine learning bag all exists In Mllib.

Step S102, obtain and parse source data, the source data is then trained collection, checking collection and test set Division.

As embodiment, source data is read in Spark programs and parsed, and source data is stored in cluster HDFS (refer to Hadoop distributed file systems, be configured to be adapted to operate on common hardware (commodity hardware) Distributed file system) catalogue, it is therefore desirable to Spark programs are connected with storage catalogue HDFS.It is preferred that load store exists Source data on catalogue HDFS, the source data is then parsed, the source data after parsing is trained collection, checking collects and test The division of collection.

In addition, server is also exposed daily record parsing landing, that is to say, that server is arranged the article for showing user Sequence records from first to last one, and this record is known as exposing daily record, and daily record was json forms originally, and data are opened After originator parsing, fall in hive tables, i.e. service end exposure daily record parsing landing.

Step S103, clicking rate is created in Mllib and estimates integrated study gradient lifting tree-model, then in Spark Training clicking rate estimates integrated study gradient lifting tree-model.

As embodiment, described clicking rate estimates integrated study gradient lifting tree-model can be by combining multiple individuals Learner (individual learner is referred to as weak learner in description below, multiple weak learners combine below It is referred to as strong learner in description.), to obtain the Generalization Capability significantly more superior than single individual learner.Further, choose The criterion of individual learner be that individual learner will have certain accuracy, predictive ability can not be too poor, while individual study There is diversity between device.

Further, described clicking rate is estimated integrated study gradient lifting tree-model and can used using decision tree to be weak The method of learner, described clicking rate, which estimates the integrated study gradient lifting i.e. strong learner of tree-model, can be expressed as decision tree Addition model.

Preferably, when establishment clicking rate estimates integrated study gradient lifting tree-model, can be used first in training set Initial weight training obtains weak learner 1, to calculate the learning error of weak learner 1.Then missed according to the study of weak learner 1 The weight of difference renewal training sample, the training set after weight is adjusted to train weak learner 2, repetition training, until weak Practise device quantity and reach the number n pre-set.Finally, n weak learners are integrated by aggregation policy, learnt by force Device.

As another embodiment, clicking rate is trained to be needed after estimating integrated study gradient lifting tree-model in Spark The clicking rate after training is assessed in Spark Mllib and estimates integrated study gradient lifting tree-model.Further, can be to point The rate of hitting estimates integrated study gradient lifting tree-model and carries out test of being followed up on line.

Furthermore it is also possible to the clicking rate trained is estimated into integrated study gradient lifting tree-model is stored in HDFS paths Under., can be to load the HDFS paths of the preservation, to call clicking rate to estimate integrated when needing progress clicking rate to estimate Practise gradient lifting tree-model.

According to various embodiments recited above, it can be seen that clicking rate predictor method of the present invention is compared to tradition Analyst according to business experience, provide the mode of clicking rate predictor formula, the invention enables business on whole line to realize substantially Intellectuality, save characteristic coefficient analysis time of analyst.Secondly, the feature of input can be existed with automatic configuration, algorithm , can more comprehensively in the ability for excavating latent cross feature so that prediction ranking results more fully make use of effective information.Meanwhile Under the fast-developing background of internet artificial intelligence and big data, the clicking rate predictor method is to personalized recommendation platform Intelligent construction, there is important promotion meaning.

Fig. 2 is according to the schematic diagram of the main flow of the clicking rate predictor method of the invention for referring to embodiment, the point The rate predictor method of hitting can include：

Step S201, start Spark clusters, to import machine learning storehouse Mllib associated class.

Step S202, source data of the load store on catalogue HDFS, then parses the source data.

In embodiment, source data is read in Spark programs and parsed, and source data is stored in cluster HDFS (refer to Hadoop distributed file systems, be configured to be adapted to operate on common hardware (commodity hardware) Distributed file system) catalogue, it is therefore desirable to Spark programs are connected with storage catalogue HDFS.

Further, the wide table of active data is stored on catalogue HDFS, and (wherein, wide table refers to the related finger of business-subject The database table of mark, dimension, Attribute Association together.), for example, the source data being stored on catalogue HDFS is article frequency Track data, specific wide table can then include：The wide table of user property, the wide table of category dimension, the wide table of article material dimension and use The wide table of family Behavior preference dimension.

Further, the wide table of user property includes user's sex, the conventional province of receiving of user, the conventional city of receiving of user City, user whether have child, user whether be plus member (plus member is one of user gradation differentiation, user if Plus member, has 5 freight free certificates for one month, and purchase part commodity also have discount.), user's purchasing power grade, user To commodity browse loyalty (after referring to that user browses commodity, the business as it can also browse this category lower class for a period of time Product, illustrate it is not that the overdue of user hits behavior), user to the order loyalty of commodity (after user buys a commodity, interval one The section time can also buy commodity again, and it is loyal just to illustrate the user), the consuming capacity of user, age of user and user Value point (value point refer to from user can to bring order, the dimension such as the amount of money, judge whether a user is ready often to stroll And the user to place an order).The wide table of category dimension includes one-level category id, two level category id, three-level category id, three-level category matter Amount point, three-level category visitor unit price, three-level category shopping cycle.

It should be noted that commodity have " one-level category ", " two level category ", " three-level category ", such as " big household electrical appliances " are one Level category, " Haier " is two level category, and " Haier's refrigerator " is three-level category, and each three-level category has corresponding with its title Numbering ID.Further, after " three-level category visitor unit price " refers to all commodity duplicate removals under three-level category, total amount divided by three-level Category number, obtained price.In addition, " shopping cycle " refers to that user bought a commodity and then secondary purchase commodity, The intermediate demand time spaced apart.

The wide table of article material dimension includes article commodity material point and (refers to recommend in the article of user if recommending business Product, the quality of the commodity point), article history material point (refer to the article for recommending user, in the table of the past period Existing, the comprehensive score provided), article publisher mass point (being to writing the author of this article a overall merit point), text Commodity in label (after referring to that article is aggregated to some label, that belonging label), article material where Zhang Sucai Not in (commodity for referring to recommend in article are the probability of male article, the probability of female article), article material commodity whether be New product (commodity for referring to recommend in article material whether be just restocking recently commodity), whether commodity are high-quality in article material Brand (the brand whether brand corresponding to commodity for referring to recommend in article belongs in the best buy brand storehouse in Jingdone district), text In Zhang Sucai commodity whether be plus commodity (commodity for referring to recommend in article whether be it is appointed have to plus member it is preferential Commodity), material similar features point (current article for referring to show user is the similar article that user clicks on recently, If for 1, otherwise for 0), popular article material point (refer to whether current presentation to the article of user is popular article, if being 1, otherwise for 0), (whether the article for referring to show user is label corresponding to current popular article to popular label, if being 1, otherwise for 0).

The wide table of user behavior preference dimension includes user to the real-time tag preference of article, user to the real-time extension of article Label preference, user are to the offline label preference of article, user to the offline extension tag preference of article.Wherein, described expansion Exhibition label refers to that user does not have an article operation behavior, but by the preference of association user, is also deduced the article of the user Label preference.Described offline label refers to that user has the label preference of direct action, and this program is that one is run in certain time It is secondary, so being referred to as " offline ".

In preferred embodiment, according to practical business demand, in the wide table of data required feature can be selected to go forward side by side Row loading.Wherein, selection is characterized in the feature related to target variable, the feature of selection can be loaded into training set data In.Following feature can only be loaded：Article commodity material point, history material point, publisher's mass point, material fraction, article Label, user's sex, material similar features point, popular article material point, popular label characteristics, real-time tag preference where material Divide, real-time tag extension preference point, offline label preference point, offline tag extension preference point, real-time material divide, article material is total Fraction.

Step S203, the source data after parsing is trained to the division of collection, checking collection and test set.

Step S204, clicking rate is created in Mllib and estimates integrated study gradient lifting tree-model and arrange parameter.

As embodiment, described clicking rate estimates integrated study gradient lifting tree-model can be by combining multiple weak Device is practised, to obtain the Generalization Capability significantly more superior than single weak learner.Further, the criterion of the weak learner of selection is, Weak learner will have certain accuracy, and predictive ability can not be too poor, while to have diversity between weak learner.

It is preferred that the parameter of weak learner could be arranged to：Clicking rate estimates the weak of integrated study gradient lifting tree-model Learner base_estimator is arranged to gbtree.Gradient lifts the quantity n_estimators of tree-model, gradient boosted tree Model has good robustness to over-fitting, is arranged to 20.Learning rate learning_rate is used for the step-length for reducing each step, Prevent step-length too big across extreme point, be arranged to 0.01.The depth capacity max_depth of every decision tree is specified, is set For 6.The minimum min_sample_split of every decision tree nodes division is specified, using mllib default values.Meanwhile specify The parameter subsample that a subset in original training set is used to train basic decision tree is extracted, is arranged to 0.7.Decision tree Upper maximum number of nodes max_leaf_nodes, using default value.

It is preferred that in order to increase the diversity of weak learner, randomness can be introduced in learning process.Further, Described clicking rate estimates integrated study gradient lifting tree-model and uses following perturbation motion method：

(1) sample disturbance is carried out to data set, this disturbance is very effective for " unstable weak learner ", wherein " no Stable weak learner " is to may result in learner when training sample slightly changes to have notable variation.It is specific to implement to include： The training set data of each weak learner can the random sampling from source data, for the training set data of each weak learner Be both configured to it is different, so as to having perturbation.

(2) disturbance of attribute is inputted to weak learner, training sample is described by one group of feature, and it is special can be based on these The various combination of sign produces different data subsets, then recycles these data subsets to train different weak learners.Tool The implementation of body includes：Different weak learner concerns are characterized in feature that different, different weak learner is good at also not Equally, therefore set the data attribute of each weak learner input different, and then be provided with attribute disturbance.Such as：Weak study To article history material point, this feature has good predictive ability to device 1, and weak learner 2 has well to material similar features point Predictive ability.

(3) parameter perturbation in weak learner iterative process, i.e. parameter can be updated in the iteration of each round, had Perturbation.Such as：The random perturbation of small range is added to parameter, so as to the larger weak learner that creates a difference.

Preferably, described clicking rate estimates integrated study gradient lifting tree-model using the machine learning of supervision, can be with The data that the training set of integrated study gradient lifting tree-model is estimated to the clicking rate are labelled to realize supervision.Enter one Step ground, it can be labelled when labelling in the exposure data of user is showed if user clicks on and obtains the data For 1, labelled if user does not click on and obtains the data as 0.Such as：User clicks on article and then this article is labelled For 1, otherwise label as 0.

In a specific embodiment, the clicking rate estimates integrated study gradient lifting tree-model and is carrying out integrated During habit, a weak learner 1 is trained from training set with initial weight first, is updated according to the learning error of weak learner 1 The weight of training sample so that the weight of the high training sample point of the weak learning error of learner 1 uprises, so that the error The high point of rate is more paid attention in the weak learner 2 below.Then the training set after adjustment weight substantially is instructed Practice weak learner 2, so repeat, until weak learner quantity reaches the number n pre-set, most at last n weak study Device is integrated by aggregation policy, obtains final strong learner.Wherein, n weak learners are carried out by aggregation policy Integrate and refer to the result of weak learner prediction to be weighted summation, that is, consider all gradient boosted trees, absorb each The advantage of gradient boosted tree, powerful integrated study device (strong learner) is built, that is, be serially added each weak learner Estimate fraction.

It is preferred that during n weak learners are combined as into strong learner, parameter lambda is used for determining regularization Part, it is possible to reduce clicking rate estimates the over-fitting of integrated study gradient lifting tree-model, is arranged to 0.1.Parameter loss is specified Loss function, it can be logarithm loss function or figure penalties function, be preferably set to log-likelihood function.With The seed seed of machine number, is arranged to 1234.The evaluation index eval_metric of checking data uses AUC.In node split, The value of loss function have dropped after only dividing, and can just divide this node.Gamma parameter values are relevant with loss function value, adopt Use default value.Feature selecting algorithm Impurity is arranged to information gain entropy.

In addition, in embodiment, the setting of Spark program related parameters is as follows：

Driver-cores=2, executor-cores=6, num-executors=90, driver-memory are set =4g, executor-memory=16g.Training set data path train_path is arranged to HDFS paths on line, checking collection number HDFS paths on line are arranged to according to path eval_path, test set data path test_path is arranged to HDFS paths on line, Model is saved as an object by the path model_path of preservation model.

Step S205, clicking rate is trained to estimate integrated study gradient lifting tree-model in Spark.

, can be pre- to clicking rate on training set according to ready-portioned training set, checking collection and test set as embodiment Estimate integrated study gradient lifting tree-model to be trained, carry out checking modelling effect on checking collection, carried out on test set pre- Survey.

Step S206, the clicking rate after training is assessed in Spark Mllib and estimates integrated study gradient boosted tree mould Type.

In embodiment, it can be assessed in terms of two, on the one hand：By loss function, mean square error can be used Difference, absolute value error, figure penalties, log-likelihood loss, bold and unrestrained evaluation index.On the other hand：Using data on training set Average loss is approximate.Preferably, the evaluation index used in embodiment can obtain for AUC from daily record.Wherein, it is described Evaluation index AUC full name be Area Under Curve, the area being defined as under ROC curve, ROC full name are Receiver Operating Characteristic, AUC have two kinds of meanss of interpretation：The first explanation is area under a curve, span 0 To 1, represent that the effect of model is better closer to 1.Second of explanation be：Give a positive sample and a negative sample, model pair Positive sample marking is the bigger the better higher than the probability that negative sample is given a mark, and AUC, embodies the sequencing ability of model.

In addition, also what deserves to be explained is, integrated study gradient lifting tree-model can also be estimated to clicking rate and is carried out on line Follow-up test.It is preferred that can statistical test result offline, can also real time inspection effect data.Preferably, offline effect can To use Hive statistical analyses.Further, test can use Black-box Testing and white-box testing, and Black-box Testing is analyst in white name See whether commodity or article recommendation results meet expection in single user, white-box testing is that model end and service end verification code are realized Process.

Step S207, preserve clicking rate and estimate under integrated study gradient lifting tree-model to HDFS paths.

In embodiment, the clicking rate trained is estimated into integrated study gradient lifting tree-model and is stored in HDFS paths Under., can be to load the HDFS paths of the preservation, to call clicking rate to estimate integrated when needing progress clicking rate to estimate Practise gradient lifting tree-model.

According to the various embodiments referred to recited above, it can be seen that described clicking rate predictor method can excavate Go out to find the Ensemble Learning Algorithms of tacit knowledge from mass data, predict the potential interest preference of user.Also, characteristic According to can be with automatic configuration, flexibly intelligence, compared to traditional analyst's manual analyzing characteristic coefficient, liberate the work of analyst Make.Moreover, the advantages of having merged multiple learners, makes clicking rate estimate the generalization ability of integrated study gradient lifting tree-model more By force.Meanwhile on big data dispatching platform, in face of mass data, the clicking rate pre-estimation of low time complexity is realized with Spark Calculate, it is ensured that daily recommendation results are reliably timely.Further, under the fast-developing background of artificial intelligence, pushing away for personalization is built Algorithm is recommended, clicking rate is done to user and estimated, improves Consumer's Experience.

In addition, the specific implementation content of clicking rate predictor method described in embodiment is referred in the present invention, in institute above State and be described in detail in clicking rate predictor method, therefore no longer illustrate in this duplicate contents.

Fig. 3 is clicking rate estimating device according to embodiments of the present invention, as shown in figure 3, the clicking rate estimating device 300 Module 303 is formed including starting module 301, data acquisition module 302 and model.Wherein, starting module 301 starts Spark Cluster, to import machine learning storehouse Mllib associated class.Then data acquisition module 302 obtains and parses source data, then by institute State the division that source data is trained collection, checking collection and test set.Finally, model forms module 303 and click is created in Mllib Rate estimates integrated study gradient lifting tree-model, then trains clicking rate to estimate integrated study gradient boosted tree mould in Spark Type.

It is preferred that starting module 301 starts Spark Cluster Exploitation environment, that is, open the JA(junction ambient) of Spark clusters.So Afterwards, machine learning storehouse Mllib associated class is imported, that is, needs the machine learning bag in calling Spark, and the machine learning Bag is all in Mllib.

As one embodiment, data acquisition module 302 reads source data in Spark programs and parsed, and source Data are stored in cluster HDFS catalogues, it is therefore desirable to connect Spark programs with storage catalogue HDFS.It is preferred that data acquisition Source data of the load store of module 302 on catalogue HDFS, then parses the source data, and the source data after parsing is instructed Practice the division of collection, checking collection and test set.

Further, the clicking rate that model formation module 303 is formed, which estimates integrated study gradient lifting tree-model, to lead to Cross and combine multiple weak learners, to obtain the Generalization Capability significantly more superior than single weak learner.Further, described point The rate of hitting, which estimates integrated study gradient lifting tree-model, can use method using decision tree as weak learner, and described clicking rate is pre- The addition model of decision tree can be expressed as by estimating the integrated study gradient lifting i.e. strong learner of tree-model.

Preferably, can be first when the model formation establishment of module 303 clicking rate estimates integrated study gradient lifting tree-model Weak learner 1 is obtained using initial weight training in training set, to calculate the learning error of weak learner 1.Then according to weak The weight of the learning error renewal training sample of learner 1, is adjusted the training set after weight to train weak learner 2, weight Refreshment is practiced, until weak learner quantity reaches the number n pre-set.Finally, n weak learners are carried out by aggregation policy Integrate, obtain strong learner.

As another embodiment, model formation module 303 is trained clicking rate to estimate integrated study gradient in Spark and carried Rise tree-model needs the clicking rate in Spark Mllib after assessment training to estimate integrated study gradient lifting tree-model afterwards. Further, model forms module 303 can estimate test of being followed up on integrated study gradient lifting tree-model progress line to clicking rate.

In addition, model, which forms module 303, can also estimate the clicking rate trained integrated study gradient lifting tree-model It is stored under HDFS paths., can be to load the HDFS paths of the preservation, with point of invocation when needing progress clicking rate to estimate The rate of hitting estimates integrated study gradient lifting tree-model.

It should be noted that in the specific implementation content of clicking rate estimating device of the present invention, click described above It has been described in detail in rate predictor method, therefore has no longer illustrated in this duplicate contents.

Fig. 4 show can apply the embodiment of the present invention clicking rate predictor method or clicking rate estimating device it is exemplary System architecture 400.

As shown in figure 4, system architecture 400 can include terminal device 401,402,403, network 404 and server 405. Network 404 between terminal device 401,402,403 and server 405 provide communication link medium.Network 404 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..

User can be interacted with using terminal equipment 401,402,403 by network 404 with server 405, to receive or send out Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 401,402,403 (merely illustrative) such as the application of page browsing device, searching class application, JICQ, mailbox client, social platform softwares.

Terminal device 401,402,403 can have a display screen and a various electronic equipments that supported web page browses, bag Include but be not limited to smart mobile phone, tablet personal computer, pocket computer on knee and desktop computer etc..

Server 405 can be to provide the server of various services, such as utilize terminal device 401,402,403 to user The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to receiving To the data such as information query request analyze etc. processing, and by result (such as target push information, product letter Breath -- merely illustrative) feed back to terminal device.

It should be noted that the clicking rate predictor method that the embodiment of the present invention is provided typically is performed by server 405, phase Ying Di, clicking rate estimating device are generally positioned in server 405.

It should be understood that the number of the terminal device, network and server in Fig. 4 is only schematical.According to realizing need Will, can have any number of terminal device, network and server.

Below with reference to Fig. 5, it illustrates suitable for for realizing the computer system 500 of the terminal device of the embodiment of the present invention Structural representation.Terminal device shown in Fig. 5 is only an example, to the function of the embodiment of the present invention and should not use model Shroud carrys out any restrictions.

As shown in figure 5, computer system 500 includes CPU (CPU) 501, it can be read-only according to being stored in Program in memory (ROM) 502 or be loaded into program in random access storage device (RAM) 503 from storage part 508 and Perform various appropriate actions and processing.In RAM 503, also it is stored with system 500 and operates required various programs and data. CPU 501, ROM 502 and RAM503 are connected with each other by bus 504.Input/output (I/O) interface 505 is also connected to always Line 504.

I/O interfaces 505 are connected to lower component：Importation 506 including keyboard, mouse etc.；Penetrated including such as negative electrode The output par, c 507 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage part 508 including hard disk etc.； And the communications portion 509 of the NIC including LAN card, modem etc..Communications portion 509 via such as because The network of spy's net performs communication process.Driver 510 is also according to needing to be connected to I/O interfaces 505.Detachable media 511, such as Disk, CD, magneto-optic disk, semiconductor memory etc., it is arranged on as needed on driver 510, in order to read from it Computer program be mounted into as needed storage part 508.

Especially, according to embodiment disclosed by the invention, may be implemented as counting above with reference to the process of flow chart description Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product, it includes being carried on computer Computer program on computer-readable recording medium, the computer program include the program code for being used for the method shown in execution flow chart. In such embodiment, the computer program can be downloaded and installed by communications portion 509 from network, and/or from can Medium 511 is dismantled to be mounted.When the computer program is performed by CPU (CPU) 501, system of the invention is performed The above-mentioned function of middle restriction.

It should be noted that the computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer-readable recording medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, system, device or the device of infrared ray or semiconductor, or it is any more than combination.Meter The more specifically example of calculation machine readable storage medium storing program for executing can include but is not limited to：Electrical connection with one or more wires, just Take formula computer disk, hard disk, random access storage device (RAM), read-only storage (ROM), erasable type and may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer-readable recording medium can any include or store journey The tangible medium of sequence, the program can be commanded the either device use or in connection of execution system, device.And at this In invention, computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium beyond storage medium is read, the computer-readable medium, which can send, propagates or transmit, to be used for By instruction execution system, device either device use or program in connection.Included on computer-readable medium Program code can be transmitted with any appropriate medium, be included but is not limited to：Wirelessly, electric wire, optical cable, RF etc., or it is above-mentioned Any appropriate combination.

Flow chart and block diagram in accompanying drawing, it is illustrated that according to the system of various embodiments of the invention, method and computer journey Architectural framework in the cards, function and the operation of sequence product.At this point, each square frame in flow chart or block diagram can generation The part of one module of table, program segment or code, a part for above-mentioned module, program segment or code include one or more For realizing the executable instruction of defined logic function.It should also be noted that some as replace realization in, institute in square frame The function of mark can also be with different from the order marked in accompanying drawing generation.For example, two square frames succeedingly represented are actual On can perform substantially in parallel, they can also be performed in the opposite order sometimes, and this is depending on involved function.Also It is noted that the combination of each square frame and block diagram in block diagram or flow chart or the square frame in flow chart, can use and perform rule Fixed function or the special hardware based system of operation are realized, or can use the group of specialized hardware and computer instruction Close to realize.

Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module can also be set within a processor, for example, can be described as：A kind of processor bag Include starting module, data acquisition module and model and form module, wherein, the title of these modules is not formed under certain conditions To the restriction of the module in itself.

As on the other hand, present invention also offers a kind of computer-readable medium, the computer-readable medium can be Included in equipment described in above-described embodiment；Can also be individualism, and without be incorporated the equipment in.Above-mentioned calculating Machine computer-readable recording medium carries one or more program, when said one or multiple programs are performed by the equipment, makes Obtaining the equipment includes：Start Spark clusters, to import machine learning storehouse Mllib associated class；Obtain and parse source data, then The source data is trained to the division of collection, checking collection and test set；Clicking rate is created in Mllib and estimates integrated study ladder Degree lifting tree-model, clicking rate is then trained to estimate integrated study gradient lifting tree-model in Spark.

Technical scheme according to embodiments of the present invention, integrated study ladder is estimated using clicking rate is formed in Spark environment The technological means of degree lifting tree-model, so overcoming traditional needs analyst to provide clicking rate according to business experience and estimate The technical problem of formula, and then business realizes intelligentized technique effect substantially on whole line.

Above-mentioned embodiment, does not form limiting the scope of the invention.Those skilled in the art should be bright It is white, depending on design requirement and other factors, various modifications, combination, sub-portfolio and replacement can occur.It is any Modifications, equivalent substitutions and improvements made within the spirit and principles in the present invention etc., should be included in the scope of the present invention Within.

Claims

A kind of 1. clicking rate predictor method, it is characterised in that including：

Start Spark clusters, to import machine learning storehouse Mllib associated class；

Obtain and parse source data, then the source data is trained to the division of collection, checking collection and test set；

Clicking rate is created in Mllib and estimates integrated study gradient lifting tree-model, then trains clicking rate to estimate in Spark Integrated study gradient lifts tree-model.
2. according to the method for claim 1, it is characterised in that it is described to obtain and parse source data, including：

Source data of the Spark load stores on catalogue HDFS, then parses the source data.
3. according to the method for claim 1, it is characterised in that described clicking rate estimates integrated study gradient boosted tree mould Type is included using decision tree as weak learner, the strong learner combined by least two weak learners.
4. according to the method for claim 3, it is characterised in that the establishment clicking rate estimates integrated study gradient boosted tree Model, including：

Weak learner 1 is obtained using initial weight training in training set, to calculate the learning error of weak learner 1；

The weight of training sample is updated according to the learning error of weak learner 1；

Weak learner 2, repetition training are trained according to the training set after weight is adjusted, until weak learner quantity reaches pre- The number n first set；

N weak learners are integrated by aggregation policy, obtain strong learner.
5. according to the method for claim 1, it is characterised in that train clicking rate to estimate integrated study gradient in Spark After lifting tree-model, including：

The clicking rate after training is assessed in Spark Mllib and estimates integrated study gradient lifting tree-model.
6. according to the method described in claim any one of 1-5, it is characterised in that training clicking rate is estimated integrated in Spark After learning gradient lifting tree-model, in addition to：

Clicking rate is preserved to estimate under integrated study gradient lifting tree-model to HDFS paths.
A kind of 7. clicking rate estimating device, it is characterised in that including：

Starting module, for starting Spark clusters, to import machine learning storehouse Mllib associated class；

Data acquisition module, for obtaining and parsing source data, the source data is then trained collection, checking collects and test The division of collection；

Model forms module, and integrated study gradient lifting tree-model, Ran Hou are estimated for creating clicking rate in Mllib Clicking rate is trained to estimate integrated study gradient lifting tree-model in Spark.
8. device according to claim 7, it is characterised in that the data acquisition module obtains and parses source data, bag Include：

Source data of the Spark load stores on catalogue HDFS, then parses the source data.
9. device according to claim 7, it is characterised in that described clicking rate estimates integrated study gradient boosted tree mould Type is included using decision tree as weak learner, the strong learner combined by least two weak learners.
10. device according to claim 9, it is characterised in that the model formation module creation clicking rate is estimated integrated Learn gradient lifting tree-model, including：

Weak learner 1 is obtained using initial weight training in training set, to calculate the learning error of weak learner 1；

The weight of training sample is updated according to the learning error of weak learner 1；

Weak learner 2, repetition training are trained according to the training set after weight is adjusted, until weak learner quantity reaches pre- The number n first set；

N weak learners are integrated by aggregation policy, obtain strong learner.
11. device according to claim 7, it is characterised in that the model forms module and clicking rate is trained in Spark After estimating integrated study gradient lifting tree-model, including：

The clicking rate after training is assessed in Spark Mllib and estimates integrated study gradient lifting tree-model.
12. according to the device described in claim any one of 7-11, it is characterised in that the model forms module in Spark After training clicking rate estimates integrated study gradient lifting tree-model, in addition to：

Clicking rate is preserved to estimate under integrated study gradient lifting tree-model to HDFS paths.
13. a kind of electronic equipment, it is characterised in that including：

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are by one or more of computing devices so that one or more of processors are real The now method as described in any in claim 1-6.
14. a kind of computer-readable medium, is stored thereon with computer program, it is characterised in that described program is held by processor The method as described in any in claim 1-6 is realized during row.