CN107622086A - A kind of clicking rate predictor method and device - Google Patents
A kind of clicking rate predictor method and device Download PDFInfo
- Publication number
- CN107622086A CN107622086A CN201710701071.8A CN201710701071A CN107622086A CN 107622086 A CN107622086 A CN 107622086A CN 201710701071 A CN201710701071 A CN 201710701071A CN 107622086 A CN107622086 A CN 107622086A
- Authority
- CN
- China
- Prior art keywords
- clicking rate
- model
- spark
- integrated study
- tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses clicking rate predictor method and device, it is related to field of computer technology.One embodiment of this method includes:Start Spark clusters, to import machine learning storehouse Mllib associated class;Obtain and parse source data, then the source data is trained to the division of collection, checking collection and test set;Clicking rate is created in Mllib and estimates integrated study gradient lifting tree-model, then trains clicking rate to estimate integrated study gradient lifting tree-model in Spark.The embodiment can realize estimating for intelligentized clicking rate.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of clicking rate predictor method and device.
Background technology
At present, with the development of internet, data scale is also increasing, how to be found out from more and miscellaneous data useful
Information, and how to help user to find content interested from electric business website to turn into a new challenge.Such as:It was found that article
Personalized ordering, prior art main logic are:Data Analyst (refers to that article is numbered in the underlying model of story label
Cluster under different labels) on the basis of, the real-time and offline label preference according to user to article, filtered out for each user
It is (first inclined according to label preference, the category of user when being to finding that the material of article sorts that article material recalls pond
The article material pulled more than comparison such as good, these all article materials pulled, which just constitute, recalls pond), then analyst
According to business experience, a marking formula is provided, article material is ranked up.
In process of the present invention is realized, inventor has found that at least there are the following problems in the prior art:To specific industry
Business data (such as:Article, commodity etc.) carry out personalized ordering when, it is necessary to first for user draw one recall pond, Ran Hou
Recall in pond, analyst provides a marking formula according to business experience.This formula needs with the change of business datum
Will often be followed up change, and analyst, which needs to put into bigger energy, to go to analyze, and service end is also required to frequent updating change code.
To sum up, prior art needs the more resource of input ratio, is not a kind of proposed algorithm of manual intelligent.
The content of the invention
In view of this, the embodiment of the present invention provides a kind of clicking rate predictor method and device, can realize intelligentized point
Hit estimating for rate.
To achieve the above object, one side according to embodiments of the present invention, there is provided a kind of clicking rate predictor method, bag
Include and start Spark clusters, to import machine learning storehouse Mllib associated class;Obtain and parse source data, then by the source data
It is trained the division of collection, checking collection and test set;Clicking rate is created in Mllib and estimates integrated study gradient boosted tree mould
Type, clicking rate is then trained to estimate integrated study gradient lifting tree-model in Spark.
Alternatively, it is described to obtain and parse source data, including:Source data of the Spark load stores on catalogue HDFS, so
After parse the source data.
Alternatively, described clicking rate is estimated integrated study gradient lifting tree-model and included using decision tree as weak learner,
The strong learner combined by least two weak learners.
Alternatively, the establishment clicking rate estimates integrated study gradient lifting tree-model, including:Using just in training set
Beginning weight training obtains weak learner 1, to calculate the learning error of weak learner 1;Updated according to the learning error of weak learner 1
The weight of training sample;Weak learner 2, repetition training, until weak study are trained according to the training set after weight is adjusted
Device quantity reaches the number n pre-set;N weak learners are integrated by aggregation policy, obtain strong learner.
Alternatively, after training clicking rate estimates integrated study gradient lifting tree-model in Spark, including:In Spark
Mllib in assess training after clicking rate estimate integrated study gradient lifting tree-model.
Alternatively, after training clicking rate estimates integrated study gradient lifting tree-model in Spark, in addition to:Preserve
Clicking rate is estimated under integrated study gradient lifting tree-model to HDFS paths.
In addition, one side according to embodiments of the present invention, there is provided a kind of clicking rate estimating device, including start mould
Block, for starting Spark clusters, to import machine learning storehouse Mllib associated class;Data acquisition module, for obtaining and parsing
Source data, then the source data is trained to the division of collection, checking collection and test set;Model formed module, for
Clicking rate is created in Mllib and estimates integrated study gradient lifting tree-model, then trains clicking rate to estimate integrated in Spark
Practise gradient lifting tree-model.
Alternatively, the data acquisition module obtains and parses source data, including:Spark load stores are in catalogue HDFS
On source data, then parse the source data.
Alternatively, described clicking rate is estimated integrated study gradient lifting tree-model and included using decision tree as weak learner,
The strong learner combined by least two weak learners.
Alternatively, the model forms module creation clicking rate and estimates integrated study gradient lifting tree-model, including:Instructing
Practice to concentrate and weak learner 1 is obtained using initial weight training, to calculate the learning error of weak learner 1;According to weak learner 1
Learning error updates the weight of training sample;Weak learner 2 is trained according to the training set after weight is adjusted, repeats to instruct
Practice, until weak learner quantity reaches the number n pre-set;N weak learners are integrated by aggregation policy, obtained
Strong learner.
Alternatively, the model formation module trains clicking rate to estimate integrated study gradient lifting tree-model in Spark
Afterwards, including:The clicking rate after training is assessed in Spark Mllib and estimates integrated study gradient lifting tree-model.
Alternatively, the model formation module trains clicking rate to estimate integrated study gradient lifting tree-model in Spark
Afterwards, in addition to:Clicking rate is preserved to estimate under integrated study gradient lifting tree-model to HDFS paths.
Other side according to embodiments of the present invention, a kind of electronic equipment is additionally provided, including:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are by one or more of computing devices so that one or more of processing
Device realizes the method described in any of the above-described embodiment.
Other side according to embodiments of the present invention, a kind of computer-readable medium is additionally provided, be stored thereon with meter
Calculation machine program, realizes the method described in any of the above-described embodiment when described program is executed by processor.
One embodiment in foregoing invention has the following advantages that or beneficial effect:Because using the shape in Spark environment
The technological means of integrated study gradient lifting tree-model is estimated into clicking rate, so overcoming traditional needs analyst according to industry
Business experience, the technical problem of clicking rate predictor formula is provided, and then business realizes intelligentized technology substantially on whole line
Effect.
Further effect adds hereinafter in conjunction with embodiment possessed by above-mentioned non-usual optional mode
With explanation.
Brief description of the drawings
Accompanying drawing is used to more fully understand the present invention, does not form inappropriate limitation of the present invention.Wherein:
Fig. 1 is the schematic diagram of the main flow of clicking rate predictor method according to embodiments of the present invention;
Fig. 2 is the schematic diagram of the main flow for the clicking rate predictor method that embodiment is referred to according to the present invention;
Fig. 3 is the schematic diagram of the main modular of clicking rate estimating device according to embodiments of the present invention;
Fig. 4 is that the embodiment of the present invention can apply to exemplary system architecture figure therein;
Fig. 5 is adapted for the structural representation for realizing the terminal device of the embodiment of the present invention or the computer system of server
Figure.
Embodiment
The one exemplary embodiment of the present invention is explained below in conjunction with accompanying drawing, including the various of the embodiment of the present invention
Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize
Arrive, various changes and modifications can be made to the embodiments described herein, without departing from scope and spirit of the present invention.Together
Sample, for clarity and conciseness, the description to known function and structure is eliminated in following description.
Fig. 1 is clicking rate predictor method according to embodiments of the present invention, as shown in figure 1, the clicking rate predictor method bag
Include:
Step S101, start Spark clusters, to import machine learning storehouse Mllib associated class.
Wherein, Spark is a Multifunctional group computing system for being used to handle big data workflow.Spark speed,
All it is better than MapReduce in ease for use and analysis ability, Spark also provides the unification of the various big data storage sources of connection
Runnable interface, such as HDFS, while the top layer storehouse for being largely used to different big data calculating tasks is also provided, such as machine learning storehouse
Mllib。
In embodiment, start Spark Cluster Exploitation environment, that is, open the JA(junction ambient) of Spark clusters.Then, import
Machine learning storehouse Mllib associated class, that is, need to call the machine learning bag in Spark, and the machine learning bag all exists
In Mllib.
Step S102, obtain and parse source data, the source data is then trained collection, checking collection and test set
Division.
As embodiment, source data is read in Spark programs and parsed, and source data is stored in cluster HDFS
(refer to Hadoop distributed file systems, be configured to be adapted to operate on common hardware (commodity hardware)
Distributed file system) catalogue, it is therefore desirable to Spark programs are connected with storage catalogue HDFS.It is preferred that load store exists
Source data on catalogue HDFS, the source data is then parsed, the source data after parsing is trained collection, checking collects and test
The division of collection.
In addition, server is also exposed daily record parsing landing, that is to say, that server is arranged the article for showing user
Sequence records from first to last one, and this record is known as exposing daily record, and daily record was json forms originally, and data are opened
After originator parsing, fall in hive tables, i.e. service end exposure daily record parsing landing.
Step S103, clicking rate is created in Mllib and estimates integrated study gradient lifting tree-model, then in Spark
Training clicking rate estimates integrated study gradient lifting tree-model.
As embodiment, described clicking rate estimates integrated study gradient lifting tree-model can be by combining multiple individuals
Learner (individual learner is referred to as weak learner in description below, multiple weak learners combine below
It is referred to as strong learner in description.), to obtain the Generalization Capability significantly more superior than single individual learner.Further, choose
The criterion of individual learner be that individual learner will have certain accuracy, predictive ability can not be too poor, while individual study
There is diversity between device.
Further, described clicking rate is estimated integrated study gradient lifting tree-model and can used using decision tree to be weak
The method of learner, described clicking rate, which estimates the integrated study gradient lifting i.e. strong learner of tree-model, can be expressed as decision tree
Addition model.
Preferably, when establishment clicking rate estimates integrated study gradient lifting tree-model, can be used first in training set
Initial weight training obtains weak learner 1, to calculate the learning error of weak learner 1.Then missed according to the study of weak learner 1
The weight of difference renewal training sample, the training set after weight is adjusted to train weak learner 2, repetition training, until weak
Practise device quantity and reach the number n pre-set.Finally, n weak learners are integrated by aggregation policy, learnt by force
Device.
As another embodiment, clicking rate is trained to be needed after estimating integrated study gradient lifting tree-model in Spark
The clicking rate after training is assessed in Spark Mllib and estimates integrated study gradient lifting tree-model.Further, can be to point
The rate of hitting estimates integrated study gradient lifting tree-model and carries out test of being followed up on line.
Furthermore it is also possible to the clicking rate trained is estimated into integrated study gradient lifting tree-model is stored in HDFS paths
Under., can be to load the HDFS paths of the preservation, to call clicking rate to estimate integrated when needing progress clicking rate to estimate
Practise gradient lifting tree-model.
According to various embodiments recited above, it can be seen that clicking rate predictor method of the present invention is compared to tradition
Analyst according to business experience, provide the mode of clicking rate predictor formula, the invention enables business on whole line to realize substantially
Intellectuality, save characteristic coefficient analysis time of analyst.Secondly, the feature of input can be existed with automatic configuration, algorithm
, can more comprehensively in the ability for excavating latent cross feature so that prediction ranking results more fully make use of effective information.Meanwhile
Under the fast-developing background of internet artificial intelligence and big data, the clicking rate predictor method is to personalized recommendation platform
Intelligent construction, there is important promotion meaning.
Fig. 2 is according to the schematic diagram of the main flow of the clicking rate predictor method of the invention for referring to embodiment, the point
The rate predictor method of hitting can include:
Step S201, start Spark clusters, to import machine learning storehouse Mllib associated class.
In embodiment, start Spark Cluster Exploitation environment, that is, open the JA(junction ambient) of Spark clusters.Then, import
Machine learning storehouse Mllib associated class, that is, need to call the machine learning bag in Spark, and the machine learning bag all exists
In Mllib.
Step S202, source data of the load store on catalogue HDFS, then parses the source data.
In embodiment, source data is read in Spark programs and parsed, and source data is stored in cluster HDFS
(refer to Hadoop distributed file systems, be configured to be adapted to operate on common hardware (commodity hardware)
Distributed file system) catalogue, it is therefore desirable to Spark programs are connected with storage catalogue HDFS.
Further, the wide table of active data is stored on catalogue HDFS, and (wherein, wide table refers to the related finger of business-subject
The database table of mark, dimension, Attribute Association together.), for example, the source data being stored on catalogue HDFS is article frequency
Track data, specific wide table can then include:The wide table of user property, the wide table of category dimension, the wide table of article material dimension and use
The wide table of family Behavior preference dimension.
Further, the wide table of user property includes user's sex, the conventional province of receiving of user, the conventional city of receiving of user
City, user whether have child, user whether be plus member (plus member is one of user gradation differentiation, user if
Plus member, has 5 freight free certificates for one month, and purchase part commodity also have discount.), user's purchasing power grade, user
To commodity browse loyalty (after referring to that user browses commodity, the business as it can also browse this category lower class for a period of time
Product, illustrate it is not that the overdue of user hits behavior), user to the order loyalty of commodity (after user buys a commodity, interval one
The section time can also buy commodity again, and it is loyal just to illustrate the user), the consuming capacity of user, age of user and user
Value point (value point refer to from user can to bring order, the dimension such as the amount of money, judge whether a user is ready often to stroll
And the user to place an order).The wide table of category dimension includes one-level category id, two level category id, three-level category id, three-level category matter
Amount point, three-level category visitor unit price, three-level category shopping cycle.
It should be noted that commodity have " one-level category ", " two level category ", " three-level category ", such as " big household electrical appliances " are one
Level category, " Haier " is two level category, and " Haier's refrigerator " is three-level category, and each three-level category has corresponding with its title
Numbering ID.Further, after " three-level category visitor unit price " refers to all commodity duplicate removals under three-level category, total amount divided by three-level
Category number, obtained price.In addition, " shopping cycle " refers to that user bought a commodity and then secondary purchase commodity,
The intermediate demand time spaced apart.
The wide table of article material dimension includes article commodity material point and (refers to recommend in the article of user if recommending business
Product, the quality of the commodity point), article history material point (refer to the article for recommending user, in the table of the past period
Existing, the comprehensive score provided), article publisher mass point (being to writing the author of this article a overall merit point), text
Commodity in label (after referring to that article is aggregated to some label, that belonging label), article material where Zhang Sucai
Not in (commodity for referring to recommend in article are the probability of male article, the probability of female article), article material commodity whether be
New product (commodity for referring to recommend in article material whether be just restocking recently commodity), whether commodity are high-quality in article material
Brand (the brand whether brand corresponding to commodity for referring to recommend in article belongs in the best buy brand storehouse in Jingdone district), text
In Zhang Sucai commodity whether be plus commodity (commodity for referring to recommend in article whether be it is appointed have to plus member it is preferential
Commodity), material similar features point (current article for referring to show user is the similar article that user clicks on recently,
If for 1, otherwise for 0), popular article material point (refer to whether current presentation to the article of user is popular article, if being
1, otherwise for 0), (whether the article for referring to show user is label corresponding to current popular article to popular label, if being
1, otherwise for 0).
The wide table of user behavior preference dimension includes user to the real-time tag preference of article, user to the real-time extension of article
Label preference, user are to the offline label preference of article, user to the offline extension tag preference of article.Wherein, described expansion
Exhibition label refers to that user does not have an article operation behavior, but by the preference of association user, is also deduced the article of the user
Label preference.Described offline label refers to that user has the label preference of direct action, and this program is that one is run in certain time
It is secondary, so being referred to as " offline ".
In addition, server is also exposed daily record parsing landing, that is to say, that server is arranged the article for showing user
Sequence records from first to last one, and this record is known as exposing daily record, and daily record was json forms originally, and data are opened
After originator parsing, fall in hive tables, i.e. service end exposure daily record parsing landing.
In preferred embodiment, according to practical business demand, in the wide table of data required feature can be selected to go forward side by side
Row loading.Wherein, selection is characterized in the feature related to target variable, the feature of selection can be loaded into training set data
In.Following feature can only be loaded:Article commodity material point, history material point, publisher's mass point, material fraction, article
Label, user's sex, material similar features point, popular article material point, popular label characteristics, real-time tag preference where material
Divide, real-time tag extension preference point, offline label preference point, offline tag extension preference point, real-time material divide, article material is total
Fraction.
Step S203, the source data after parsing is trained to the division of collection, checking collection and test set.
Step S204, clicking rate is created in Mllib and estimates integrated study gradient lifting tree-model and arrange parameter.
As embodiment, described clicking rate estimates integrated study gradient lifting tree-model can be by combining multiple weak
Device is practised, to obtain the Generalization Capability significantly more superior than single weak learner.Further, the criterion of the weak learner of selection is,
Weak learner will have certain accuracy, and predictive ability can not be too poor, while to have diversity between weak learner.
Further, described clicking rate is estimated integrated study gradient lifting tree-model and can used using decision tree to be weak
The method of learner, described clicking rate, which estimates the integrated study gradient lifting i.e. strong learner of tree-model, can be expressed as decision tree
Addition model.
It is preferred that the parameter of weak learner could be arranged to:Clicking rate estimates the weak of integrated study gradient lifting tree-model
Learner base_estimator is arranged to gbtree.Gradient lifts the quantity n_estimators of tree-model, gradient boosted tree
Model has good robustness to over-fitting, is arranged to 20.Learning rate learning_rate is used for the step-length for reducing each step,
Prevent step-length too big across extreme point, be arranged to 0.01.The depth capacity max_depth of every decision tree is specified, is set
For 6.The minimum min_sample_split of every decision tree nodes division is specified, using mllib default values.Meanwhile specify
The parameter subsample that a subset in original training set is used to train basic decision tree is extracted, is arranged to 0.7.Decision tree
Upper maximum number of nodes max_leaf_nodes, using default value.
It is preferred that in order to increase the diversity of weak learner, randomness can be introduced in learning process.Further,
Described clicking rate estimates integrated study gradient lifting tree-model and uses following perturbation motion method:
(1) sample disturbance is carried out to data set, this disturbance is very effective for " unstable weak learner ", wherein " no
Stable weak learner " is to may result in learner when training sample slightly changes to have notable variation.It is specific to implement to include:
The training set data of each weak learner can the random sampling from source data, for the training set data of each weak learner
Be both configured to it is different, so as to having perturbation.
(2) disturbance of attribute is inputted to weak learner, training sample is described by one group of feature, and it is special can be based on these
The various combination of sign produces different data subsets, then recycles these data subsets to train different weak learners.Tool
The implementation of body includes:Different weak learner concerns are characterized in feature that different, different weak learner is good at also not
Equally, therefore set the data attribute of each weak learner input different, and then be provided with attribute disturbance.Such as:Weak study
To article history material point, this feature has good predictive ability to device 1, and weak learner 2 has well to material similar features point
Predictive ability.
(3) parameter perturbation in weak learner iterative process, i.e. parameter can be updated in the iteration of each round, had
Perturbation.Such as:The random perturbation of small range is added to parameter, so as to the larger weak learner that creates a difference.
Preferably, described clicking rate estimates integrated study gradient lifting tree-model using the machine learning of supervision, can be with
The data that the training set of integrated study gradient lifting tree-model is estimated to the clicking rate are labelled to realize supervision.Enter one
Step ground, it can be labelled when labelling in the exposure data of user is showed if user clicks on and obtains the data
For 1, labelled if user does not click on and obtains the data as 0.Such as:User clicks on article and then this article is labelled
For 1, otherwise label as 0.
In a specific embodiment, the clicking rate estimates integrated study gradient lifting tree-model and is carrying out integrated
During habit, a weak learner 1 is trained from training set with initial weight first, is updated according to the learning error of weak learner 1
The weight of training sample so that the weight of the high training sample point of the weak learning error of learner 1 uprises, so that the error
The high point of rate is more paid attention in the weak learner 2 below.Then the training set after adjustment weight substantially is instructed
Practice weak learner 2, so repeat, until weak learner quantity reaches the number n pre-set, most at last n weak study
Device is integrated by aggregation policy, obtains final strong learner.Wherein, n weak learners are carried out by aggregation policy
Integrate and refer to the result of weak learner prediction to be weighted summation, that is, consider all gradient boosted trees, absorb each
The advantage of gradient boosted tree, powerful integrated study device (strong learner) is built, that is, be serially added each weak learner
Estimate fraction.
It is preferred that during n weak learners are combined as into strong learner, parameter lambda is used for determining regularization
Part, it is possible to reduce clicking rate estimates the over-fitting of integrated study gradient lifting tree-model, is arranged to 0.1.Parameter loss is specified
Loss function, it can be logarithm loss function or figure penalties function, be preferably set to log-likelihood function.With
The seed seed of machine number, is arranged to 1234.The evaluation index eval_metric of checking data uses AUC.In node split,
The value of loss function have dropped after only dividing, and can just divide this node.Gamma parameter values are relevant with loss function value, adopt
Use default value.Feature selecting algorithm Impurity is arranged to information gain entropy.
In addition, in embodiment, the setting of Spark program related parameters is as follows:
Driver-cores=2, executor-cores=6, num-executors=90, driver-memory are set
=4g, executor-memory=16g.Training set data path train_path is arranged to HDFS paths on line, checking collection number
HDFS paths on line are arranged to according to path eval_path, test set data path test_path is arranged to HDFS paths on line,
Model is saved as an object by the path model_path of preservation model.
Step S205, clicking rate is trained to estimate integrated study gradient lifting tree-model in Spark.
, can be pre- to clicking rate on training set according to ready-portioned training set, checking collection and test set as embodiment
Estimate integrated study gradient lifting tree-model to be trained, carry out checking modelling effect on checking collection, carried out on test set pre-
Survey.
Step S206, the clicking rate after training is assessed in Spark Mllib and estimates integrated study gradient boosted tree mould
Type.
In embodiment, it can be assessed in terms of two, on the one hand:By loss function, mean square error can be used
Difference, absolute value error, figure penalties, log-likelihood loss, bold and unrestrained evaluation index.On the other hand:Using data on training set
Average loss is approximate.Preferably, the evaluation index used in embodiment can obtain for AUC from daily record.Wherein, it is described
Evaluation index AUC full name be Area Under Curve, the area being defined as under ROC curve, ROC full name are Receiver
Operating Characteristic, AUC have two kinds of meanss of interpretation:The first explanation is area under a curve, span 0
To 1, represent that the effect of model is better closer to 1.Second of explanation be:Give a positive sample and a negative sample, model pair
Positive sample marking is the bigger the better higher than the probability that negative sample is given a mark, and AUC, embodies the sequencing ability of model.
In addition, also what deserves to be explained is, integrated study gradient lifting tree-model can also be estimated to clicking rate and is carried out on line
Follow-up test.It is preferred that can statistical test result offline, can also real time inspection effect data.Preferably, offline effect can
To use Hive statistical analyses.Further, test can use Black-box Testing and white-box testing, and Black-box Testing is analyst in white name
See whether commodity or article recommendation results meet expection in single user, white-box testing is that model end and service end verification code are realized
Process.
Step S207, preserve clicking rate and estimate under integrated study gradient lifting tree-model to HDFS paths.
In embodiment, the clicking rate trained is estimated into integrated study gradient lifting tree-model and is stored in HDFS paths
Under., can be to load the HDFS paths of the preservation, to call clicking rate to estimate integrated when needing progress clicking rate to estimate
Practise gradient lifting tree-model.
According to the various embodiments referred to recited above, it can be seen that described clicking rate predictor method can excavate
Go out to find the Ensemble Learning Algorithms of tacit knowledge from mass data, predict the potential interest preference of user.Also, characteristic
According to can be with automatic configuration, flexibly intelligence, compared to traditional analyst's manual analyzing characteristic coefficient, liberate the work of analyst
Make.Moreover, the advantages of having merged multiple learners, makes clicking rate estimate the generalization ability of integrated study gradient lifting tree-model more
By force.Meanwhile on big data dispatching platform, in face of mass data, the clicking rate pre-estimation of low time complexity is realized with Spark
Calculate, it is ensured that daily recommendation results are reliably timely.Further, under the fast-developing background of artificial intelligence, pushing away for personalization is built
Algorithm is recommended, clicking rate is done to user and estimated, improves Consumer's Experience.
In addition, the specific implementation content of clicking rate predictor method described in embodiment is referred in the present invention, in institute above
State and be described in detail in clicking rate predictor method, therefore no longer illustrate in this duplicate contents.
Fig. 3 is clicking rate estimating device according to embodiments of the present invention, as shown in figure 3, the clicking rate estimating device 300
Module 303 is formed including starting module 301, data acquisition module 302 and model.Wherein, starting module 301 starts Spark
Cluster, to import machine learning storehouse Mllib associated class.Then data acquisition module 302 obtains and parses source data, then by institute
State the division that source data is trained collection, checking collection and test set.Finally, model forms module 303 and click is created in Mllib
Rate estimates integrated study gradient lifting tree-model, then trains clicking rate to estimate integrated study gradient boosted tree mould in Spark
Type.
It is preferred that starting module 301 starts Spark Cluster Exploitation environment, that is, open the JA(junction ambient) of Spark clusters.So
Afterwards, machine learning storehouse Mllib associated class is imported, that is, needs the machine learning bag in calling Spark, and the machine learning
Bag is all in Mllib.
As one embodiment, data acquisition module 302 reads source data in Spark programs and parsed, and source
Data are stored in cluster HDFS catalogues, it is therefore desirable to connect Spark programs with storage catalogue HDFS.It is preferred that data acquisition
Source data of the load store of module 302 on catalogue HDFS, then parses the source data, and the source data after parsing is instructed
Practice the division of collection, checking collection and test set.
Further, the clicking rate that model formation module 303 is formed, which estimates integrated study gradient lifting tree-model, to lead to
Cross and combine multiple weak learners, to obtain the Generalization Capability significantly more superior than single weak learner.Further, described point
The rate of hitting, which estimates integrated study gradient lifting tree-model, can use method using decision tree as weak learner, and described clicking rate is pre-
The addition model of decision tree can be expressed as by estimating the integrated study gradient lifting i.e. strong learner of tree-model.
Preferably, can be first when the model formation establishment of module 303 clicking rate estimates integrated study gradient lifting tree-model
Weak learner 1 is obtained using initial weight training in training set, to calculate the learning error of weak learner 1.Then according to weak
The weight of the learning error renewal training sample of learner 1, is adjusted the training set after weight to train weak learner 2, weight
Refreshment is practiced, until weak learner quantity reaches the number n pre-set.Finally, n weak learners are carried out by aggregation policy
Integrate, obtain strong learner.
As another embodiment, model formation module 303 is trained clicking rate to estimate integrated study gradient in Spark and carried
Rise tree-model needs the clicking rate in Spark Mllib after assessment training to estimate integrated study gradient lifting tree-model afterwards.
Further, model forms module 303 can estimate test of being followed up on integrated study gradient lifting tree-model progress line to clicking rate.
In addition, model, which forms module 303, can also estimate the clicking rate trained integrated study gradient lifting tree-model
It is stored under HDFS paths., can be to load the HDFS paths of the preservation, with point of invocation when needing progress clicking rate to estimate
The rate of hitting estimates integrated study gradient lifting tree-model.
It should be noted that in the specific implementation content of clicking rate estimating device of the present invention, click described above
It has been described in detail in rate predictor method, therefore has no longer illustrated in this duplicate contents.
Fig. 4 show can apply the embodiment of the present invention clicking rate predictor method or clicking rate estimating device it is exemplary
System architecture 400.
As shown in figure 4, system architecture 400 can include terminal device 401,402,403, network 404 and server 405.
Network 404 between terminal device 401,402,403 and server 405 provide communication link medium.Network 404 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be interacted with using terminal equipment 401,402,403 by network 404 with server 405, to receive or send out
Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 401,402,403
(merely illustrative) such as the application of page browsing device, searching class application, JICQ, mailbox client, social platform softwares.
Terminal device 401,402,403 can have a display screen and a various electronic equipments that supported web page browses, bag
Include but be not limited to smart mobile phone, tablet personal computer, pocket computer on knee and desktop computer etc..
Server 405 can be to provide the server of various services, such as utilize terminal device 401,402,403 to user
The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to receiving
To the data such as information query request analyze etc. processing, and by result (such as target push information, product letter
Breath -- merely illustrative) feed back to terminal device.
It should be noted that the clicking rate predictor method that the embodiment of the present invention is provided typically is performed by server 405, phase
Ying Di, clicking rate estimating device are generally positioned in server 405.
It should be understood that the number of the terminal device, network and server in Fig. 4 is only schematical.According to realizing need
Will, can have any number of terminal device, network and server.
Below with reference to Fig. 5, it illustrates suitable for for realizing the computer system 500 of the terminal device of the embodiment of the present invention
Structural representation.Terminal device shown in Fig. 5 is only an example, to the function of the embodiment of the present invention and should not use model
Shroud carrys out any restrictions.
As shown in figure 5, computer system 500 includes CPU (CPU) 501, it can be read-only according to being stored in
Program in memory (ROM) 502 or be loaded into program in random access storage device (RAM) 503 from storage part 508 and
Perform various appropriate actions and processing.In RAM 503, also it is stored with system 500 and operates required various programs and data.
CPU 501, ROM 502 and RAM503 are connected with each other by bus 504.Input/output (I/O) interface 505 is also connected to always
Line 504.
I/O interfaces 505 are connected to lower component:Importation 506 including keyboard, mouse etc.;Penetrated including such as negative electrode
The output par, c 507 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage part 508 including hard disk etc.;
And the communications portion 509 of the NIC including LAN card, modem etc..Communications portion 509 via such as because
The network of spy's net performs communication process.Driver 510 is also according to needing to be connected to I/O interfaces 505.Detachable media 511, such as
Disk, CD, magneto-optic disk, semiconductor memory etc., it is arranged on as needed on driver 510, in order to read from it
Computer program be mounted into as needed storage part 508.
Especially, according to embodiment disclosed by the invention, may be implemented as counting above with reference to the process of flow chart description
Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product, it includes being carried on computer
Computer program on computer-readable recording medium, the computer program include the program code for being used for the method shown in execution flow chart.
In such embodiment, the computer program can be downloaded and installed by communications portion 509 from network, and/or from can
Medium 511 is dismantled to be mounted.When the computer program is performed by CPU (CPU) 501, system of the invention is performed
The above-mentioned function of middle restriction.
It should be noted that the computer-readable medium shown in the present invention can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two any combination.Computer-readable recording medium for example can be --- but not
Be limited to --- electricity, magnetic, optical, electromagnetic, system, device or the device of infrared ray or semiconductor, or it is any more than combination.Meter
The more specifically example of calculation machine readable storage medium storing program for executing can include but is not limited to:Electrical connection with one or more wires, just
Take formula computer disk, hard disk, random access storage device (RAM), read-only storage (ROM), erasable type and may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In the present invention, computer-readable recording medium can any include or store journey
The tangible medium of sequence, the program can be commanded the either device use or in connection of execution system, device.And at this
In invention, computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium beyond storage medium is read, the computer-readable medium, which can send, propagates or transmit, to be used for
By instruction execution system, device either device use or program in connection.Included on computer-readable medium
Program code can be transmitted with any appropriate medium, be included but is not limited to:Wirelessly, electric wire, optical cable, RF etc., or it is above-mentioned
Any appropriate combination.
Flow chart and block diagram in accompanying drawing, it is illustrated that according to the system of various embodiments of the invention, method and computer journey
Architectural framework in the cards, function and the operation of sequence product.At this point, each square frame in flow chart or block diagram can generation
The part of one module of table, program segment or code, a part for above-mentioned module, program segment or code include one or more
For realizing the executable instruction of defined logic function.It should also be noted that some as replace realization in, institute in square frame
The function of mark can also be with different from the order marked in accompanying drawing generation.For example, two square frames succeedingly represented are actual
On can perform substantially in parallel, they can also be performed in the opposite order sometimes, and this is depending on involved function.Also
It is noted that the combination of each square frame and block diagram in block diagram or flow chart or the square frame in flow chart, can use and perform rule
Fixed function or the special hardware based system of operation are realized, or can use the group of specialized hardware and computer instruction
Close to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard
The mode of part is realized.Described module can also be set within a processor, for example, can be described as:A kind of processor bag
Include starting module, data acquisition module and model and form module, wherein, the title of these modules is not formed under certain conditions
To the restriction of the module in itself.
As on the other hand, present invention also offers a kind of computer-readable medium, the computer-readable medium can be
Included in equipment described in above-described embodiment;Can also be individualism, and without be incorporated the equipment in.Above-mentioned calculating
Machine computer-readable recording medium carries one or more program, when said one or multiple programs are performed by the equipment, makes
Obtaining the equipment includes:Start Spark clusters, to import machine learning storehouse Mllib associated class;Obtain and parse source data, then
The source data is trained to the division of collection, checking collection and test set;Clicking rate is created in Mllib and estimates integrated study ladder
Degree lifting tree-model, clicking rate is then trained to estimate integrated study gradient lifting tree-model in Spark.
Technical scheme according to embodiments of the present invention, integrated study ladder is estimated using clicking rate is formed in Spark environment
The technological means of degree lifting tree-model, so overcoming traditional needs analyst to provide clicking rate according to business experience and estimate
The technical problem of formula, and then business realizes intelligentized technique effect substantially on whole line.
Above-mentioned embodiment, does not form limiting the scope of the invention.Those skilled in the art should be bright
It is white, depending on design requirement and other factors, various modifications, combination, sub-portfolio and replacement can occur.It is any
Modifications, equivalent substitutions and improvements made within the spirit and principles in the present invention etc., should be included in the scope of the present invention
Within.
Claims (14)
- A kind of 1. clicking rate predictor method, it is characterised in that including:Start Spark clusters, to import machine learning storehouse Mllib associated class;Obtain and parse source data, then the source data is trained to the division of collection, checking collection and test set;Clicking rate is created in Mllib and estimates integrated study gradient lifting tree-model, then trains clicking rate to estimate in Spark Integrated study gradient lifts tree-model.
- 2. according to the method for claim 1, it is characterised in that it is described to obtain and parse source data, including:Source data of the Spark load stores on catalogue HDFS, then parses the source data.
- 3. according to the method for claim 1, it is characterised in that described clicking rate estimates integrated study gradient boosted tree mould Type is included using decision tree as weak learner, the strong learner combined by least two weak learners.
- 4. according to the method for claim 3, it is characterised in that the establishment clicking rate estimates integrated study gradient boosted tree Model, including:Weak learner 1 is obtained using initial weight training in training set, to calculate the learning error of weak learner 1;The weight of training sample is updated according to the learning error of weak learner 1;Weak learner 2, repetition training are trained according to the training set after weight is adjusted, until weak learner quantity reaches pre- The number n first set;N weak learners are integrated by aggregation policy, obtain strong learner.
- 5. according to the method for claim 1, it is characterised in that train clicking rate to estimate integrated study gradient in Spark After lifting tree-model, including:The clicking rate after training is assessed in Spark Mllib and estimates integrated study gradient lifting tree-model.
- 6. according to the method described in claim any one of 1-5, it is characterised in that training clicking rate is estimated integrated in Spark After learning gradient lifting tree-model, in addition to:Clicking rate is preserved to estimate under integrated study gradient lifting tree-model to HDFS paths.
- A kind of 7. clicking rate estimating device, it is characterised in that including:Starting module, for starting Spark clusters, to import machine learning storehouse Mllib associated class;Data acquisition module, for obtaining and parsing source data, the source data is then trained collection, checking collects and test The division of collection;Model forms module, and integrated study gradient lifting tree-model, Ran Hou are estimated for creating clicking rate in Mllib Clicking rate is trained to estimate integrated study gradient lifting tree-model in Spark.
- 8. device according to claim 7, it is characterised in that the data acquisition module obtains and parses source data, bag Include:Source data of the Spark load stores on catalogue HDFS, then parses the source data.
- 9. device according to claim 7, it is characterised in that described clicking rate estimates integrated study gradient boosted tree mould Type is included using decision tree as weak learner, the strong learner combined by least two weak learners.
- 10. device according to claim 9, it is characterised in that the model formation module creation clicking rate is estimated integrated Learn gradient lifting tree-model, including:Weak learner 1 is obtained using initial weight training in training set, to calculate the learning error of weak learner 1;The weight of training sample is updated according to the learning error of weak learner 1;Weak learner 2, repetition training are trained according to the training set after weight is adjusted, until weak learner quantity reaches pre- The number n first set;N weak learners are integrated by aggregation policy, obtain strong learner.
- 11. device according to claim 7, it is characterised in that the model forms module and clicking rate is trained in Spark After estimating integrated study gradient lifting tree-model, including:The clicking rate after training is assessed in Spark Mllib and estimates integrated study gradient lifting tree-model.
- 12. according to the device described in claim any one of 7-11, it is characterised in that the model forms module in Spark After training clicking rate estimates integrated study gradient lifting tree-model, in addition to:Clicking rate is preserved to estimate under integrated study gradient lifting tree-model to HDFS paths.
- 13. a kind of electronic equipment, it is characterised in that including:One or more processors;Storage device, for storing one or more programs,When one or more of programs are by one or more of computing devices so that one or more of processors are real The now method as described in any in claim 1-6.
- 14. a kind of computer-readable medium, is stored thereon with computer program, it is characterised in that described program is held by processor The method as described in any in claim 1-6 is realized during row.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710701071.8A CN107622086A (en) | 2017-08-16 | 2017-08-16 | A kind of clicking rate predictor method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710701071.8A CN107622086A (en) | 2017-08-16 | 2017-08-16 | A kind of clicking rate predictor method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107622086A true CN107622086A (en) | 2018-01-23 |
Family
ID=61088864
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710701071.8A Pending CN107622086A (en) | 2017-08-16 | 2017-08-16 | A kind of clicking rate predictor method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107622086A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109615060A (en) * | 2018-11-27 | 2019-04-12 | 深圳前海微众银行股份有限公司 | CTR predictor method, device and computer readable storage medium |
CN110309406A (en) * | 2018-03-12 | 2019-10-08 | 阿里巴巴集团控股有限公司 | Clicking rate predictor method, device, equipment and storage medium |
CN110322039A (en) * | 2018-03-29 | 2019-10-11 | 腾讯科技(深圳)有限公司 | A kind of clicking rate predictor method, server and computer readable storage medium |
CN110619447A (en) * | 2018-06-20 | 2019-12-27 | 广州虎牙信息科技有限公司 | Anchor evaluation method, device, equipment and storage medium |
WO2020047819A1 (en) * | 2018-09-07 | 2020-03-12 | 深圳大学 | Click rate prediction method, electronic apparatus and computer-readable storage medium |
CN111476658A (en) * | 2020-04-13 | 2020-07-31 | 中国工商银行股份有限公司 | Loan continuous overdue prediction method and device |
CN112148919A (en) * | 2020-09-30 | 2020-12-29 | 哈尔滨理工大学 | Music click rate prediction method and device based on gradient lifting tree algorithm |
CN112306846A (en) * | 2019-07-31 | 2021-02-02 | 北京大学 | Mobile application black box testing method based on deep learning |
CN112651790A (en) * | 2021-01-19 | 2021-04-13 | 恩亿科(北京)数据科技有限公司 | OCPX self-adaptive learning method and system based on user reach in fast-moving industry |
CN112949864A (en) * | 2021-02-01 | 2021-06-11 | 北京三快在线科技有限公司 | Training method and device for pre-estimation model |
CN116611497A (en) * | 2023-07-20 | 2023-08-18 | 深圳须弥云图空间科技有限公司 | Click rate estimation model training method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104717124A (en) * | 2013-12-13 | 2015-06-17 | 腾讯科技(深圳)有限公司 | Friend recommendation method, device and server |
CN106056427A (en) * | 2016-05-25 | 2016-10-26 | 中南大学 | Spark-based big data hybrid model mobile recommending method |
CN106250461A (en) * | 2016-07-28 | 2016-12-21 | 北京北信源软件股份有限公司 | A kind of algorithm utilizing gradient lifting decision tree to carry out data mining based on Spark framework |
-
2017
- 2017-08-16 CN CN201710701071.8A patent/CN107622086A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104717124A (en) * | 2013-12-13 | 2015-06-17 | 腾讯科技(深圳)有限公司 | Friend recommendation method, device and server |
CN106056427A (en) * | 2016-05-25 | 2016-10-26 | 中南大学 | Spark-based big data hybrid model mobile recommending method |
CN106250461A (en) * | 2016-07-28 | 2016-12-21 | 北京北信源软件股份有限公司 | A kind of algorithm utilizing gradient lifting decision tree to carry out data mining based on Spark framework |
Non-Patent Citations (1)
Title |
---|
张兴: "基于Spark大数据平台的火电厂节能分析", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110309406A (en) * | 2018-03-12 | 2019-10-08 | 阿里巴巴集团控股有限公司 | Clicking rate predictor method, device, equipment and storage medium |
CN110309406B (en) * | 2018-03-12 | 2023-06-09 | 阿里巴巴集团控股有限公司 | Click rate estimation method, device, equipment and storage medium |
CN110322039B (en) * | 2018-03-29 | 2022-12-02 | 腾讯科技(深圳)有限公司 | Click rate estimation method, server and computer readable storage medium |
CN110322039A (en) * | 2018-03-29 | 2019-10-11 | 腾讯科技(深圳)有限公司 | A kind of clicking rate predictor method, server and computer readable storage medium |
CN110619447A (en) * | 2018-06-20 | 2019-12-27 | 广州虎牙信息科技有限公司 | Anchor evaluation method, device, equipment and storage medium |
CN110619447B (en) * | 2018-06-20 | 2023-03-24 | 广州虎牙信息科技有限公司 | Anchor evaluation method, device, equipment and storage medium |
WO2020047819A1 (en) * | 2018-09-07 | 2020-03-12 | 深圳大学 | Click rate prediction method, electronic apparatus and computer-readable storage medium |
CN109615060A (en) * | 2018-11-27 | 2019-04-12 | 深圳前海微众银行股份有限公司 | CTR predictor method, device and computer readable storage medium |
CN109615060B (en) * | 2018-11-27 | 2023-06-30 | 深圳前海微众银行股份有限公司 | CTR estimation method, CTR estimation device and computer-readable storage medium |
CN112306846A (en) * | 2019-07-31 | 2021-02-02 | 北京大学 | Mobile application black box testing method based on deep learning |
CN112306846B (en) * | 2019-07-31 | 2022-02-11 | 北京大学 | Mobile application black box testing method based on deep learning |
CN111476658A (en) * | 2020-04-13 | 2020-07-31 | 中国工商银行股份有限公司 | Loan continuous overdue prediction method and device |
CN112148919A (en) * | 2020-09-30 | 2020-12-29 | 哈尔滨理工大学 | Music click rate prediction method and device based on gradient lifting tree algorithm |
CN112651790A (en) * | 2021-01-19 | 2021-04-13 | 恩亿科(北京)数据科技有限公司 | OCPX self-adaptive learning method and system based on user reach in fast-moving industry |
CN112651790B (en) * | 2021-01-19 | 2024-04-12 | 恩亿科(北京)数据科技有限公司 | OCPX self-adaptive learning method and system based on user touch in quick-elimination industry |
CN112949864A (en) * | 2021-02-01 | 2021-06-11 | 北京三快在线科技有限公司 | Training method and device for pre-estimation model |
CN112949864B (en) * | 2021-02-01 | 2022-04-22 | 海南两心科技有限公司 | Training method and device for pre-estimation model |
CN116611497A (en) * | 2023-07-20 | 2023-08-18 | 深圳须弥云图空间科技有限公司 | Click rate estimation model training method and device |
CN116611497B (en) * | 2023-07-20 | 2023-10-03 | 深圳须弥云图空间科技有限公司 | Click rate estimation model training method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107622086A (en) | A kind of clicking rate predictor method and device | |
Qin et al. | The impact of AI on the advertising process: The Chinese experience | |
Cillo et al. | Niche tourism destinations’ online reputation management and competitiveness in big data era: Evidence from three Italian cases | |
CN104462593B (en) | A kind of method and apparatus that the push of user individual message related to resources is provided | |
JP6286549B2 (en) | Recommended results display method and equipment | |
CN103329151B (en) | Recommendation based on topic cluster | |
CN109840730B (en) | Method and device for data prediction | |
Liu et al. | Balancing between accuracy and fairness for interactive recommendation with reinforcement learning | |
RU2720954C1 (en) | Search index construction method and system using machine learning algorithm | |
CN109636430A (en) | Object identifying method and its system | |
CN110175895A (en) | A kind of item recommendation method and device | |
CN108932625A (en) | Analysis method, device, medium and the electronic equipment of user behavior data | |
CN109992715A (en) | Information displaying method, device, medium and calculating equipment | |
WO2023142520A1 (en) | Information recommendation method and apparatus | |
Zhang et al. | A new three-dimensional manufacturing service composition method under various structures using improved Flower Pollination Algorithm | |
Benmesbah et al. | An improved constrained learning path adaptation problem based on genetic algorithm | |
KR102274914B1 (en) | Method and device for online advertisement using early adopters | |
US10242069B2 (en) | Enhanced template curating | |
CN116308640A (en) | Recommendation method and related device | |
KR102238438B1 (en) | System for providing commercial product transaction service using price standardization | |
CN117651950A (en) | Interpreted natural language artifact recombination with context awareness | |
Li et al. | Web-scale personalized real-time recommender system on Suumo | |
CN108053260A (en) | A kind of method and system that extending user is determined according to statistics interest-degree | |
CN113449175A (en) | Hot data recommendation method and device | |
CN113327147A (en) | Method and device for displaying article information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180123 |