CN103914478B

CN103914478B - Webpage training method and system, webpage Forecasting Methodology and system

Info

Publication number: CN103914478B
Application number: CN201310003765.6A
Authority: CN
Inventors: 陈俊波; 薛贵荣; 李玉龙; 严孝伟; 李华康; 韩定; 韩定一
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2013-01-06
Filing date: 2013-01-06
Publication date: 2018-05-08
Anticipated expiration: 2033-01-06
Also published as: CN103914478A

Abstract

This application involves a kind of webpage training method and system, webpage Forecasting Methodology and system, including：The prior probability table of sort key word is obtained according to the data with existing associated with sort key word；Trained webpage is treated to be pre-processed to obtain web page text to be trained；Feature is extracted in the web page text to be trained according to prior probability table and show that the incidence relation feature vector between webpage to be trained and specified classification represents F1；Represent that the incidence relation feature vector F1 carries out model training to draw the classification results of the webpage to be trained.The application can handle the very strong bibliography system of isomerism at the same time, and utilize the bibliography system that relatively very few training data treatment scale is huge, in addition, by collecting browse/search behavior of the user on the whole network, rather than just the behavior in a home Web site, largely solves the sparse sex chromosome mosaicism of data.

Description

Webpage training method and system, webpage Forecasting Methodology and system

Technical field

This application involves internet arena, more particularly to a kind of classification and prediction for accessing user internet behavior.

Background technology

With the continuous popularization of computer technology, modern society's facility that very Dependency Specification technology is brought.With Computer and network technologies are more and more efficient, and safety and reliable, more and more whole sellers, retailer, consumer are selected mutual Network the transaction of enterprising product of doing business.Specific website is becoming the service provider of most rich commercial value on internet.

User can be browsed on specific website, and search, the rate of exchange, buy, payment, and a series of action such as evaluation comes Purchase meets the commodity of its commercial intention.

The scale of specific website also constantly increases with quantity at the same time.For example, Taobao, day cat, Jingdone district, Amazon, when working as, And plurality of medium and small specific website.

Commercial intention analysis result can allow specific website to provide personalized advertisement and recommendation results for each user, Personalized web site entrance.Meet the product and service of user's commercial intention by providing, trading volume and transaction can be greatly improved Volume.Therefore, the commercial intention analysis of user is extremely important for specific website.

For specific website, the commercial intention of its user is analyzed, there is following difficulty：

（1）For the behavior that user accesses internet, user is in the quantity of the behavior of a certain specific website It is very rare.The openness commercial intention that can lead to not comprehensively judge a user of data；Data sample is dilute at the same time The fiducial interval range that probability distribution parameters are estimated can be caused excessive less, so that it cannot make accurate prediction.

（2）The commodity bibliography system of specific website has very strong isomerism, i.e. the classification of any two specific websites System has very big difference.If use classical Web page classification method, then for any specific website, be required for Gather labeled data in large scale（The quantity of labeled data is directly proportional to the quantity of classification）, carried according to the difference of bibliography system The feature that form is different is taken, finally study obtains machine learning classification model.This can bring the cost of costliness.

（3）The commodity bibliography system of specific website is often very huge.In an example, commodity bibliography system is by one 5 layers of classification structure tree definition, the quantity of wherein leaf node reach 20,000 scale.Bibliography system is excessively huge to cause three Class problem, first, it is necessary to which the quantity of the training data of mark is very huge.Second, the accuracy rate of classification prediction can decline to a great extent. 3rd, it is desirable to the prediction of multi-class targets label is provided to each webpage；Multi-class targets label in turn again can cause accuracy rate into One step declines.

In summary, it is necessary to it is a kind of can overcome above three problem using user behavior analysis user's commercial intention Solution.

The content of the invention

The main purpose of the application is to provide a kind of webpage training program and webpage prediction scheme, to solve existing skill Art is difficult to that the website isomerism, the bibliography system that overcome be huge and Deta sparseness problem.

In order to solve the above-mentioned technical problem, the purpose of the application is achieved through the following technical solutions：

This application discloses a kind of webpage training method, for classifying to webpage, comprise the following steps：According to dividing The data with existing that class keywords are associated obtains the prior probability table of sort key word；Trained webpage is treated to be pre-processed to obtain Obtain web page text to be trained；Extracted according to prior probability table in the web page text to be trained feature draw webpage to be trained with The incidence relation feature vector between classification is specified to represent F1；F1, which carries out model training, to be represented to the incidence relation feature vector To draw the classification results of the webpage to be trained.

Preferably, in the method according to the application, the data with existing associated with sort key word includes Website bibliography system and user's click logs.

Preferably, in the method according to the application, obtained according to the data with existing associated with sort key word The step of prior probability table of sort key word, includes：Search key is obtained from user's click logs, and is filtered To sort key word；Calculate the frequency that each sort key word occurs in different classifications；Traveled through using bottom-up method Each non-leaf nodes classification in the website bibliography system, calculates information gain and the card side of each sort key word Examine.

Preferably, in the method according to the application, treat trained webpage carry out pretreatment include it is following in one A or multiple processing：Web analysis；Text extracting；Participle/stop words filtering.

Preferably, in the method according to the application, extracted according to prior probability table in web page text to be trained Feature show that the step of incidence relation feature vector between webpage to be trained and specified classification represents F1 includes：Classification is calculated to close Importance score of the keyword in webpage to be trained；The priori of importance score and sort key word for sort key word Probability carries out sliding-model control；According to the discretization importance of sort key word and the discretization prior probability of sort key word Obtain two dimensional character matrix；The two dimensional character matrix is calculated in the different piece of web page text to be trained respectively, will be all Matrix element is arranged in order to obtain the feature vector of a regular length.

Preferably, in the method according to the application, the different piece of the web page text to be trained includes webpage Title, Web page text, hyperlinks between Web pages text.

Preferably, in the method according to the application, F1, which carries out model training, to be represented to incidence relation feature vector Included with drawing the step of the classification results of webpage to be trained：Labeled data is carried out according to pre-set regression model algorithm Regression model is trained；And corresponding ROC curve is calculated for the regression model trained, and set based on ROC curve Different cutting threshold values is to obtain classification results.

Disclosed herein as well is a kind of webpage training system, for classifying to webpage, including：For basis and classification The data with existing that keyword is associated obtains the device of the prior probability table of sort key word；Carried out in advance for treating trained webpage Handle to obtain the device of web page text to be trained；For spy to be extracted in the web page text to be trained according to prior probability table Obtain out the device that the incidence relation feature vector between webpage to be trained and specified classification represents F1；For being closed to the association It is that feature vector represents that F1 carries out model training to draw the device of the classification results of the webpage to be trained.

Preferably, in the webpage training system, the data with existing associated with sort key word includes net Stand bibliography system and user's click logs.

Preferably, in the webpage training system, for being obtained according to the data with existing associated with sort key word The device of the prior probability table of score class keywords includes：For obtaining search key from user's click logs, and The device of sort key word is obtained by filtration；The device of the frequency occurred for calculating each sort key word in different classifications； For traveling through each non-leaf nodes classification in the website bibliography system using bottom-up method, each is calculated The information gain of sort key word and the device of Chi-square Test.

Preferably, in the webpage training system, include for treating the device that trained webpage is pre-processed：Net Page resolver；Text extracting device；Participle/stop words filtration apparatus.

Preferably, in the webpage training system, for being carried according to prior probability table in web page text to be trained Feature is taken to show that the incidence relation feature vector between webpage to be trained and specified classification represents that the device of F1 includes：For calculating The device of importance score of the sort key word in webpage to be trained；For the importance score for sort key word and The prior probability of sort key word carries out the device of sliding-model control；For the discretization importance according to sort key word and The discretization prior probability of sort key word obtains the device of two dimensional character matrix；For respectively in web page text to be trained not With the two dimensional character matrix is calculated in part, all matrix elements are arranged in order to obtain the feature vector of a regular length Device.

Preferably, in the webpage training system, the different piece of the web page text to be trained includes webpage mark Topic, Web page text, hyperlinks between Web pages text.

Preferably, in the webpage training system, for representing that F1 carries out model instruction to incidence relation feature vector White silk is included with the device for drawing the classification results of webpage to be trained：Regression modeling device, for according to pre-set recurrence mould Type algorithm carries out regression model training to labeled data；And ROC analytical equipments, for for the regression model trained Corresponding ROC curve is calculated, and different cutting threshold values is set to obtain classification results based on ROC curve.

This application discloses a kind of webpage Forecasting Methodology, for the classification knot drawn using webpage training method training Fruit, classification prediction is carried out for user view, including：Collect user access logs；Webpage is obtained from the user access logs Html source codes；The webpage html source codes are pre-processed to obtain web page text；According to prior probability table in the webpage Feature is extracted in text and show that the incidence relation feature vector between webpage and specified classification represents F2；According to the classification results Represent that F2 obtains user view classification prediction result with the incidence relation feature vector.

Preferably, in the method according to the application, webpage html source codes are obtained from the user access logs Step includes：Url screenings are carried out to the user access logs；The url after screening is crawled to obtain webpage html source codes.

Preferably, in the method according to the application, the step of user access logs progress url screenings, is wrapped Include：For the user access logs, the url set after duplicate removal is calculated；Web page importance score is calculated for each url；Root It is scored at the url after all duplicate removals according to Web page importance to be ranked up, some url gather as url to be crawled before taking.

Preferably, in the method according to the application, the step of being pre-processed to the webpage html source codes, wraps Include：Web analysis；Text extracting；Participle/stop words filtering.

Preferably, in the method according to the application, feature is extracted in web page text according to prior probability table and is obtained The incidence relation feature vector gone out between webpage and specified classification represents that the step of F2 includes：Sort key word is calculated in webpage Importance score；Carried out for the importance score of sort key word and the prior probability of sort key word at discretization Reason；Two dimensional character square is obtained according to the discretization prior probability of the discretization importance of sort key word and sort key word Battle array；The two dimensional character matrix is calculated in the different piece of web page text respectively, all matrix elements are arranged in order to obtain The feature vector of one regular length.

Preferably, in the method according to the application, according to the classification results and the incidence relation feature to Amount represents that the step of F2 obtains user view classification prediction result includes：According to the classification results and the incidence relation feature Vector representation F2, obtains the analysis result of webpage rank；The analysis result of webpage rank and the user access logs are carried out Attended operation, obtains the intention indicated by each web page access behavior of user, and carrying out converging operation in user's dimension obtains user It is intended to classification prediction result.

Disclosed herein as well is a kind of webpage forecasting system, for point drawn using the webpage training system training Class carries out classification prediction as a result, being directed to user view, including：For collecting the device of user access logs；For from the use Family access log obtains the device of webpage html source codes；It is literary to obtain webpage for being pre-processed to the webpage html source codes This device；Pass between webpage and specified classification is drawn for extracting feature in the web page text according to prior probability table Join the device of relationship characteristic vector representation F2；For representing that F2 is obtained according to the classification results and the incidence relation feature vector Obtain the device of user view classification prediction result.

Preferably, in the system, for obtaining the device bag of webpage html source codes from the user access logs Include：For the user access logs to be carried out with the device of url screenings；For crawling the url after screening to obtain webpage html The device of source code.

Preferably, in the system, the device for the user access logs to be carried out with url screenings includes：With In for the user access logs, the device that the url after duplicate removal gathers is calculated；It is important for calculating webpage for each url The device of property score；It is ranked up for being scored at the url after all duplicate removals according to Web page importance, some url before taking The device gathered as url to be crawled.

Preferably, in the system, the device for being pre-processed to the webpage html source codes includes：Webpage Resolver；Text extracting device；Participle/stop words filtration apparatus.

Preferably, in the system, webpage is drawn for extracting feature in web page text according to prior probability table Incidence relation feature vector between specified classification represents that the device of F2 includes：For calculating sort key word in webpage The device of importance score；For the importance score for sort key word and the prior probability of sort key word carry out from The device of dispersion processing；For the discretization importance and the discretization prior probability of sort key word according to sort key word Obtain the device of two dimensional character matrix；, will for calculating the two dimensional character matrix in the different piece of web page text respectively All matrix elements are arranged in order to obtain the device of the feature vector of a regular length.

Preferably, in the system, it is characterised in that for special according to the classification results and the incidence relation The device that sign vector representation F2 obtains user view classification prediction result includes：For according to the classification results and the association Relationship characteristic vector representation F2, obtains the device of the analysis result of webpage rank；For by the analysis result of webpage rank and institute State user access logs and be attached operation, obtain the intention indicated by each web page access behavior of user, user's dimension into Row converging operation obtains the device of user view classification prediction result.

Compared with prior art, according to the technical solution of the application, the very strong bibliography system of isomerism can be handled at the same time, And using the bibliography system that relatively very few training data treatment scale is huge, in addition, by collecting user on the whole network Browse/search behavior, rather than just the behavior in a home Web site, largely solve the openness of data and ask Topic.

Brief description of the drawings

Attached drawing described herein is used for providing further understanding of the present application, forms the part of the application, this Shen Schematic description and description please is used to explain the application, does not form the improper restriction to the application.In the accompanying drawings：

Fig. 1 is the flow chart of the webpage training method of the embodiment of the present application；

Fig. 2 is the ROC curve of the embodiment of the present application；

Fig. 3 is the block diagram of the webpage training system of the embodiment of the present application；

Fig. 4 is the flow chart of the webpage Forecasting Methodology of the embodiment of the present application；

Fig. 5 is the block diagram of the webpage forecasting system of the embodiment of the present application；

Fig. 6 is the Organization Chart of the exemplary system of effective identification user's commercial intention of the embodiment of the present application.

Embodiment

The main idea of the present application lies in that by the way that user's browse/search behavior on the internet, general is uniformly processed Data-interface, the sorting algorithm for possessing automatic adaptation ability, the application can solve Deta sparseness well, and bibliography system is different Structure, these three excessively huge important problems of classification quantity, and in unified flow service is provided for numerous websites at the same time.

To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with drawings and the specific embodiments, to this Application is described in further detail.

User on the internet browse with the intention of search behavior can be nature of business either non-commercial property , the intention of wherein nature of business can do further division according to the specific commodity bibliography system of specific website again.

The identification of the online commercial intention of user is a very important function for specific website.Pass through recommendation With the mode of advertisement, it can directly facilitate the exchange hand and turnover of specific website.

Still further aspect, the access behavior inside any specific website only occupy the whole network user's browse/search row For sub-fraction, substantial amounts of browse/search behavior occurs outside specific website.Utilize Activity recognition of the user on the whole network The commercial intention of user can bring great value to specific website.

Any specific website has the commodity bibliography system of oneself, for specific website, it is desirable to which analysis obtains The bibliography systems of user's commercial intention and its commodity be identical, otherwise, consistent recommendation and advertisement can not be provided.Especially need It should be noted that the bibliography system of commodity can be very huge.In the example provided, commodity bibliography system is by more Provided up to 5 layers of bibliography system trees, the quantity of all leaf nodes reaches 20,000.

The application is the plateform system of a processing commercial intention analysis, which collects user on the internet clear Look at/search behavior, and opening interface is to specific website.Specific website uploads commodity bibliography system by data-interface, uses Click logs, and training data sample are searched in family.With reference to behavior record of the user on the whole network and the specific letter of specific website Breath, extraction and the relevant feature of commercial intention, user is detected any from the Machine learning classifiers of these feature constructions automatically Online commercial intention in given specific website linguistic context（OCI）.

According to an embodiment of the present application, there is provided a kind of webpage training method, for classifying to webpage.

With reference to figure 1, Fig. 1 is the flow chart of the webpage training method 100 of the embodiment of the present application.The target of training flow is root The bibliography system data uploaded according to any one specific website by opening interface, user search for click data, mark number According to training obtains the commercial intention classification results for the website.

In step s 102, the prior probability of sort key word is obtained according to the data with existing associated with sort key word Table.The data with existing associated with sort key word can include website bibliography system and user's click logs.The step is real It is exactly the log processing flow that prior probability table is obtained according to user's click logs and website bibliography system structure on border.

For example, first, " specific website open interface " is passed through（For example, it may be e-commerce website open interface） The bibliography system for specific website is obtained, user searches for click logs, and the training data marked.Wherein, for spy Determine the bibliography system of website and user to search for click logs associated with sort key word, by for obtaining the elder generation of sort key word Test probability tables.And the training data marked will be used for being used for model training in below the step of.

Bibliography system：The e.g. bibliography system of tree, can use any structure that can express tree Change description language.For example, in an application example, extensible markup language is used（XML）To describe bibliography system：

User's click logs：E.g. tableau format, following field is included per a line：User's search key, search knot Structure clicks on the corresponding classification of commodity.

Training data：Such as provide and exceed a certain number of webpages, these webpage labels are corresponding in bibliography system one The matched classification of fixed number amount and unmatched classification.Following field is included per a line training data：Url, html source code, classification Id, annotation results.Such as：

http://www.**.com.cn/mama/a10102/ html source codes 4.3 match

http://www.**.com.cn/mama/a10102/ html source codes 3.1 mismatch

http:4.1 part of //www.**.com.cn/mama/a10102/ html source codes matches

Wherein, annotation results " matching " mark matched classification completely, " mismatch " identify completely irrelevant classification, " portion Divide matching " the relevant classification of identification division.

Step S102 may include steps of：Search key is obtained from user's click logs, and is obtained by filtration point Class keywords；Calculate the frequency that each sort key word occurs in different classifications；Website is traveled through using bottom-up method Each non-leaf nodes classification in bibliography system, calculates information gain and the Chi-square Test of each sort key word.

For example, pre-processed for data input by user, carry out complicated and simple conversion, capital and small letter conversion, coding turns Utf8 is changed to, and filters stop words.

The length limitation of search key, if keyword is not belonging to the entity vocabulary collected in advance（By brand word, product Word, product attribute descriptor etc. forms）, then filtering keys word length<All keywords of=4byte.（Explanation：>=4byte In utf8 codings, represent>=2 Chinese characters, or>=4 English characters）

The frequency that each keyword occurs in different classifications is calculated respectively, and is gathered according to keyword as major key Close, obtain using keyword as major key, classification is distributed as the data format of attribute.

Each non-leaf nodes in classification structure is traveled through using bottom-up method（In other words, classification）, meter Calculate information gain and the Chi-square Test of each keyword.For example, calculate the information gain of each keyword （Information Gain）Dimension about subtracts weights, if the dimension about subtracts weights and is less than certain threshold value, keyword is occurred All frequencies in subcategory are aggregated into current classification, and empty frequency of the keyword in all subcategories.It is if current Node is root node, then deletes the keyword.

For example, it is keyword to define K, and C is the information gain IG calculating sides of all classifications, then K under present node Method is as follows：

IG (K)=H (C)-H (C | K)

Wherein H is comentropy, defines P (x_i) it is that variable X takes x_iProbability, drawn by following formula：

Calculate the card side's computational methods for the keyword for meeting specified criteria（χ²）Dimension about subtract weights, if the dimension about subtracts Weights are less than certain threshold value, then all frequencies appeared in keyword in subcategory are aggregated into current classification, and empty this Frequency of the keyword in all subcategories.If present node is root node, the keyword is deleted.

Definition K is keyword,For every other keyword, { C_iIt is given bibliography system.

Count following contingency table（Contingency Table）

Card side's computational methods：

Wherein T_{I, j}For the theoretical value of the i-th row jth row：

Participate in the condition of keyword calculated：

And

| (i, j) | T_{I, j}==0 } |=0

In step S104, treat trained webpage and pre-processed to obtain web page text to be trained.The purpose of pretreatment It is as follows：1）Html source codes are resolved to the data of structuring；2）Filter the noise data of non-text；3）Text is carried out at participle Reason, and filtered stop words as noise data.

In other words, step S104 can include one or more of following processing：Web analysis；Text extracting；Point Word/stop words filtering.

In web analysis step, the input of web analysis flow is html source codes.Html source codes are parsed, obtain each member The positional information and semantic information of element.For example, given html source codes, it is known which text is appeared in hyperlink, which Hyperlink text occurs in header, which hyperlink text occurs in footer, etc..The output of web analysis flow is text Shelves objectification Model Structure Tree（DocumentObject Model Tree）.

In text extracting step, the input of text extracting flow is document object Model Structure Tree.According to web analysis Output, filter header/footer/navigation/advertisement/sidebar, obtain the text of webpage.The output of text extracting flow is to carry The document object Model Structure Tree of label information.Element in each tree carries a label, for representing this yuan Whether element belongs to Web page text.

In participle/stop words filtration step, the input of participle/stop words filtering process is the document pair with body tag As changing Model Structure Tree.Using pre-set Chinese Word Automatic Segmentation, Chinese Word Automatic Segmentation has many kinds, such as CDWS, n- Gram, hidden Markov model, CRF（Conditional RandomField）Etc..In an application example, it can adopt Chinese Word Automatic Segmentation is used as by the use of CRF.The frequency that function word in text grammer part not only occurs in article is high but also right Almost without in all senses, i.e. classification of disturbance, and make it that text dimensionality is excessive, influences classification effectiveness for classification.This part vocabulary leads to Deactivation vocabulary is crossed to be filtered.Participle/output of stop words filtering process is by the text in each element segment/stops Document object Model Structure Tree after word processing.

In step s 106, extracted according to prior probability table in web page text to be trained feature draw webpage to be trained with The incidence relation feature vector between classification is specified to represent F1.The step is actually the class drawn according to log processing flow Mesh prior probability table is in the webpage by pretreatment process（Have already been through structuring processing, participle, noise filtering etc.）In point Analysis obtains the feature extraction flow of feature vector.

Step S106 may include steps of：Calculate importance score of the sort key word in webpage to be trained；Pin The prior probability of importance score and sort key word to sort key word carries out sliding-model control；According to sort key word Discretization importance and the discretization prior probability of sort key word obtain two dimensional character matrix；Respectively in webpage to be trained Two dimensional character matrix is calculated in the different piece of text, all matrix elements are arranged in order to obtain the feature of a regular length Vector.Wherein, the different piece of web page text to be trained can include web page title, Web page text, hyperlinks between Web pages text.

For example, importance score of the keyword in webpage is calculated first, considers the TF/IDF letters of keyword Breath, the positional information that keyword occurs, the font color information of keyword, the important of keyword is calculated by regression model Property score.The value range of score is [0,1].

The prior probability of importance score and keyword then for keyword carries out sliding-model control：Setting is corresponding Threshold value, turns to " high, higher, relatively low, low " four kinds of possible values by the importance score of keyword is discrete；Set corresponding threshold Value, turns to " high, higher, relatively low, low " four kinds of possible values by the prior probability score of keyword is discrete.

Then the significance level according to keyword in webpage, and the prior probability that log processing workflow management obtains Table, can be calculated a two-dimentional eigenmatrix, as follows：

Importance prior probability	It is high	It is higher	It is relatively low	It is low
					It is high	n_1,1	n_1,2	n_1,3	n_1,4
It is higher	n_2,1	n_2,2	n_2,3	n_2,4
					It is relatively low	n_3,1	n_3,2	n_3,3	n_3,4
It is low	n_4,1	n_4,2	n_4,3	n_4,4

For example, n_2,3Represent that importance is higher, the quantity of the relatively low keyword of prior probability.

Finally respectively in web page title, Web page text, calculates as above eigenmatrix in hyperlinks between Web pages text.Therefore, obtain To the matrix of 34 × 4 dimensions, a total of 3 × 4 × 4=48 element.This 48 elements are arranged in order to obtain a fixed length The feature vector of degree.This feature vector is the output result of feature extraction flow.

In step S108, represent that incidence relation feature vector F1 carries out model training to show that webpage to be trained divides Class result.

Step S108 may include steps of：Regression modeling and ROC analyses.Wherein, regression modeling can include basis Pre-set regression model algorithm carries out regression model training to labeled data.ROC analyses can include being directed to what is trained Regression model calculates corresponding ROC curve, and sets different cutting threshold values based on ROC curve to obtain classification results.At this The application scenarios of text（Classification substantial amounts）Under, with reference to regression model and ROC analytical technologies, more classification predictions can be solved and asked Topic, to lift classification predictablity rate and recall rate.

For example, the feature vector obtained with reference to labeled data and characteristic extracting module that specific website uploads, root Model training is carried out to commercial intention according to pre-set regression model algorithm.Regression algorithm has many kinds, such as SVM（Support Vector machine）, Logistic Regression etc..In an example, it is trained using SVM regression algorithms.

After completing regression algorithm models fitting, closed in test set and calculate corresponding ROC for the model（Receiver Operating Characteristic）Curve.In an example, the ROC curve obtained is as shown in Figure 2.

The transverse axis of ROC curve is FPR（False Positive Rate）, longitudinal axis TPR（True PositiveRate）. It is defined as follows：

Wherein：

FP（False Positive, false positive）：In test set, true tag is false, and model thinks genuine data point Number.

TP（True Positive, true positives）：In test set, true tag is true, and model thinks genuine data point Number.

FN（False Positive, false negative）：In test set, true tag is true, and model thinks false data point Number.

TN（False Positive, true negative）：In test set, true tag is false, and model thinks false data point Number.

FPR has weighed the accuracy rate of model, and the index is more low better.TPR has weighed the recall rate of model, and the index is higher Better.Different cutting threshold values can be set according to different application scenarios, regression problem is converted into classification problem again, is obtained To the Web page classifying result met the requirements.

For example, in the application scenarios of advertisement, it is desirable to have the recall rate of higher, while can tolerate to a certain extent accurately The not high situation of rate, in such a scenario, can set a threshold value, it is desirable to higher TPR, while tolerate FPR also relatively It is high.Opposite, among the anti-application practised fraud, it is desirable to very high accuracy rate, then a threshold value can be set, ensured very low FPR, while tolerate certain presence of failing to judge.

According to an embodiment of the present application, there is provided a kind of webpage training system, for classifying to webpage.

With reference to figure 3, Fig. 3 is the block diagram of the webpage training system 300 of the embodiment of the present application.Webpage training system 300 includes Device 302 for the prior probability table that sort key word is obtained according to the data with existing associated with sort key word；For Trained webpage is treated to be pre-processed to obtain the device 304 of web page text to be trained；For waiting to instruct according to prior probability table Practice the dress for the incidence relation feature vector expression F1 that extraction feature in web page text is drawn between webpage to be trained and specified classification Put 306；For drawing the dress of the classification results of webpage to be trained to incidence relation feature vector expression F1 progress model trainings Put 308.

The data with existing associated with sort key word can include website bibliography system and user's click logs.For The device 302 of the prior probability table of sort key word is obtained according to the data with existing associated with sort key word to be included： For obtaining search key from user's click logs, and the device of sort key word is obtained by filtration；For calculating each point The device for the frequency that class keywords occur in different classifications；For using in bottom-up method traversal website bibliography system Each non-leaf nodes classification, calculate the information gain of each sort key word and the device of Chi-square Test.

It can include for treating the device 304 that trained webpage is pre-processed：Web analysis device；Text extracting fills Put；Participle/stop words filtration apparatus.

Webpage to be trained and specified classification are drawn for extracting feature in web page text to be trained according to prior probability table Between incidence relation feature vector represent F1 device 306 can include：For calculating sort key word in webpage to be trained In importance score device；For the importance score for sort key word and the prior probability of sort key word into The device of row sliding-model control；For the discretization importance and the discretization priori of sort key word according to sort key word Probability obtains the device of two dimensional character matrix；For two dimensional character square to be calculated in the different piece of web page text to be trained respectively Battle array, all matrix elements are arranged in order to obtain the device of the feature vector of a regular length.Web page text to be trained is not It can include web page title, Web page text, hyperlinks between Web pages text with part.

For drawing the classification results of webpage to be trained to incidence relation feature vector expression F1 progress model trainings Device can include regression modeling device and ROC analytical equipments.Regression modeling device can be used for according to pre-set recurrence Model algorithm carries out regression model training to labeled data；ROC analytical equipments can be used for being directed to trained regression model meter Corresponding ROC curve is calculated, and different cutting threshold values is set to obtain classification results based on ROC curve.

Training flow in the application by following innovation solve the bibliography system isomerism referred in background technology with And the problem that classification quantity is excessively huge：

（1）The bibliography system and user for excavating specific website search for click logs, obtain the prior probability table of keyword.

（2）With reference to the prior probability table and web page text feature of keyword, associating between webpage and specified classification is generated Relationship characteristic vector representation.

（3）Classification problem is converted to the regression problem of webpage and the incidence relation of specified classification, passes through obtained association Regression model is trained in relationship characteristic vector representation.

（4）The ROC curve of regression model result and classification annotation result is calculated, classifying rationally threshold value is inverse by regression problem Switch to classification problem.

Since the feature vector of incidence relation between webpage and classification represents it is unrelated with specific category information, so logical Aforesaid way is crossed, the application can handle the very strong bibliography system of isomerism at the same time, and utilize relatively very few training number According to the bibliography system that treatment scale is huge.

According to an embodiment of the present application, there is provided a kind of webpage Forecasting Methodology, for utilizing above-mentioned webpage training method The classification results that 100 training are drawn, classification prediction is carried out for user view.The target of pre- flow gauge is in the whole network according to user On browse/search behavior, provide comprehensive commercial intention analysis result for given any specific website.

With reference to figure 4, Fig. 4 is the flow chart of the webpage Forecasting Methodology 400 of the embodiment of the present application.

In step S402, user access logs are collected.

For example, global cookie are distributed first on the whole network.Cookies is that one kind can allow Website server Low volume data is stored into the hard disk or memory of client, or a kind of technology of data is read from the hard disk of client. Cookies is a very small text being placed in by Web server on your hard disk when you browse certain website, it can To record the information such as your User ID, password, browsed webpage, residence time.Cookie is usually used in identifying user.

Then when user, which accesses, is furnished with the webpage of global cookie, which can return to user to server back end Access log.

Then back-end server is responsible for collection, stores user access logs.Access log form：Access log is user's visit Ask the record that browses of internet, including following field：CookieId, IP, UserAgent, access Url, Refer Url, access Time, search key.

In step s 404, webpage html source codes are obtained from user access logs.

Step S404 may include steps of：Url screenings are carried out to user access logs；Crawl screening after url with Obtain webpage html source codes.Wherein, the step of carrying out url screenings to user access logs may include steps of：For with Family access log, calculates the url set after duplicate removal；Web page importance score is calculated for each url；Obtained according to Web page importance It is divided into the url after all duplicate removals to be ranked up, some url gather as url to be crawled before taking.Url screening processes are The url lists analyzed from user access logs table are screened, select the url lists for being considered high quality.

For example, first against the user access logs being collected into, the url set after duplicate removal is calculated.

Then calculated for each url as properties：PV（Page View, i.e. the webpage is accessed how many times）, UV （User View, i.e. the webpage was accessed by how many people）, PageRank（The webpage being calculated according to hyperlinks between Web pages relation Importance index）, PageType（It is divided into 404 webpages, landing page, list page, content pages）.

Then Web page importance score is calculated：

S=(α × sigmoid (PV)+β × sigmoid (UV)) × PR × PT

Wherein：

α, β：Linear regulation coefficient, meets constraint（α ＞ 0, β ＞ 0, alpha+beta=1）

PR：PageRank, value range are 0~10

PT：PageType, is centrifugal pump, when PageType is page 404, or during landing page, and value 0, when When PageType is list page, value 0.5, when PageType is content pages, value 1.

Importance score is calculated according to above-mentioned calculation formula, is ranked up for all url, takes Top N bars url to be used as and treats The url set crawled.

The url lists that last basis provides, crawl and store corresponding html source codes.

In step S406, webpage html source codes are pre-processed to obtain web page text.

Similar to step S104, step S406 may include steps of：Web analysis；Text extracting；Participle/stop words Filtering.

In web analysis step, html source codes are parsed, obtain the positional information and semantic information of each element.For example, Given html source codes, it is known which text is appeared in hyperlink, which hyperlink text occurs in header, which is super Link text occurs, etc. in footer.The output of web analysis flow is document object Model Structure Tree（Document Object Model Tree）.

In text extracting step, according to the output of web analysis, header/footer/navigation/advertisement/sidebar is filtered, is obtained The text of webpage.

In participle/stop words filtration step, using pre-set Chinese Word Automatic Segmentation, Chinese Word Automatic Segmentation has very much Kind, such as CDWS, n-gram, hidden Markov model, CRF（Conditional RandomField）Etc..In an application In example, CRF can be used as Chinese Word Automatic Segmentation.What the function word in text grammer part not only occurred in article Frequency is high, and to classifying almost without in all senses, i.e. classification of disturbance, and make it that text dimensionality is excessive, influences classification effectiveness. This part vocabulary is filtered by disabling vocabulary.

In step S408, feature is extracted in web page text according to prior probability table and is drawn between webpage and specified classification Incidence relation feature vector represent F2.Preferably, regression model and cutting threshold value are obtained using training flow, with reference to above-mentioned step Suddenly the feature vector obtained represents F2, obtains the classification results of webpage.It is noted that Web page classifying result can belong to multiple classes Not, it is also possible to be not belonging to any one classification.

Similar to step S106, step S408 may include steps of：It is important in webpage to calculate sort key word Property score；The prior probability of importance score and sort key word for sort key word carries out sliding-model control；According to The discretization importance of sort key word and the discretization prior probability of sort key word obtain two dimensional character matrix；Exist respectively Two dimensional character matrix is calculated in the different piece of web page text, all matrix elements are arranged in order to obtain regular length Feature vector.

In step S410, represent that F2 obtains the prediction of user view classification according to classification results and incidence relation feature vector As a result.

Step S410 may include steps of：F2 is represented according to classification results and incidence relation feature vector, obtains net The analysis result of page rank；The analysis result of webpage rank is attached with user access logs（join）Operation, is used Intention indicated by each web page access behavior in family, carries out converging operation in user's dimension and obtains user view classification prediction knot Fruit.

For example, machine learning classification, the feature vector that binding characteristic extraction module obtains are carried out first, and are being instructed Practice the model file of output in flow, classify according to pre-set classifier algorithm to the commercial intention preference of webpage.

Followed by the polymerization of user's dimension.

User access logs are converted to the Table in relevant database first, comprising field CookieId, are accessed Url, access time.The Table output of Machine learning classifiers module being also converted into relevant database, includes field Url, commercial intention analysis result.That is the input of the polymerization process of user's dimension has two：（1）Web page analysis result table, The table has two fields：Webpage url, webpage commercial intention classification results；（2）User access logs, the table have three fields：With Family id, accesses url, access time.

Then this two Table are subjected to Join operations using Url as Key, obtain connection table, including field CookieId, accesses Url, access time, commercial intention analysis result.

Converging operation is finally carried out, by the commercial intention analysis result of identical CookieId as Key according to CookieId It is aggregated to together.Time decay factor is considered in aggregating algorithm, following formula describe the i-th of some given CookieId The polymerization result of a business category weight：

Wherein,

W_i：The polymerization result of i-th of commercial intention classified weight of given CookieId；

W_{I, j}：I-th of commercial intention of given CookieId is sorted in the weight on j-th of page of the user's access；

Dj：Given CookieId accesses the number of days of the time gap current time of j-th of page.

That is the output of the polymerization process of user's dimension is user's commercial intention analysis result table, which has two words Section：User id, user's commercial intention classification results.

In a preferred embodiment, training flow is intended according to known training data and above-mentioned process flow The regression model file got togather, and the cutting threshold value analyzed according to ROC curve.In " the machine learning point of pre- flow gauge In class device " module, regression model file and cutting threshold value that application training flow obtains, obtain the prediction result of webpage rank. Pre- flow gauge come cluster web pages prediction result according to user by analyzing to obtain the commercial intention analysis result of user class.

According to an embodiment of the present application, there is provided a kind of webpage forecasting system, for utilizing above-mentioned webpage training system The classification results that training is drawn, classification prediction is carried out for user view.

With reference to figure 5, Fig. 5 is the block diagram of the webpage forecasting system 500 of the embodiment of the present application.Webpage forecasting system 500 includes For collecting the device 502 of user access logs；For obtaining the device 504 of webpage html source codes from user access logs；With In being pre-processed to webpage html source codes to obtain the device 506 of web page text；For literary in webpage according to prior probability table Extraction feature draws the device 508 of the incidence relation feature vector expression F2 between webpage and specified classification in this；For basis Classification results and incidence relation feature vector represent that F2 obtains the device 510 of user view classification prediction result.

Device 504 for obtaining webpage html source codes from user access logs can include：For accessing day to user Will carries out the device of url screenings；For crawling the url after screening to obtain the device of webpage html source codes.Wherein, for The device that family access log carries out url screenings can include：For for user access logs, calculating the url set after duplicate removal Device；For calculating the device of Web page importance score for each url；It is all for being scored at according to Web page importance Url after duplicate removal is ranked up, the device that some url gather as url to be crawled before taking.

Device 506 for being pre-processed to webpage html source codes can include：Web analysis device；Text extracting fills Put；Participle/stop words filtration apparatus.

Draw for extracting feature in web page text according to prior probability table and associate pass between webpage and specified classification It is that feature vector represents that the device 508 of F2 can include：For calculating the dress of importance score of the sort key word in webpage Put；The dress of sliding-model control is carried out for the prior probability of the importance score for sort key word and sort key word Put；Two dimensional character is obtained for the discretization importance according to sort key word and the discretization prior probability of sort key word The device of matrix；For calculating two dimensional character matrix in the different piece of web page text respectively, by all matrix elements successively Arrangement obtains the device of the feature vector of a regular length.

For representing that F2 obtains the dress of user view classification prediction result according to classification results and incidence relation feature vector Putting 510 can include：For representing F2 according to classification results and incidence relation feature vector, the analysis result of webpage rank is obtained Device；For the analysis result of webpage rank and user access logs to be attached operation, obtain each webpage of user and visit Ask the intention indicated by behavior, carrying out converging operation in user's dimension obtains the device of user view classification prediction result.

Pre- flow gauge in the application innovates the problem for solving the Deta sparseness referred in background technology by following：

（1）Browse/search behavior of the user on the whole network is collected, rather than just the behavior in a home Web site, this will The sparse sex chromosome mosaicism of data is largely solved, in the example of an application, it is clear that 5,000,000,000 users can be collected into daily Look at log information.

（2）The user for possessing commercial value is extracted from user's view log and browses webpage, and is directed to web page contents Analyse in depth.

（3）Embedded interface is provided, the customized demand of multiple specific websites can be embedded in pre- flow gauge.

（4）Using same a user data, for commercial intention point of its customization of any given specific website adaptive analysis Analyse result.

Fig. 6 shows the Organization Chart of the exemplary system of effective identification user's commercial intention of the application.As can be seen that from From the point of view of system layer, system is divided into three levels：Data gathering layer, data prediction layer, data analysis layer.From data flow From the point of view of, system can be divided into two flows：Training flow, pre- flow gauge.

In conclusion present application addresses the Three Difficult Issues enumerated in background technology.It is embodied in：

（1）The application collects and analyzes all on the whole network of user and browses, search record, and is built based on analysis result Found complete files on each of customers, the fine sparse sex chromosome mosaicism that must solve data.

（2）The application provides unified data-interface to any potential specific website, passes through unified interface and processing Flow, can well adapt to the very strong situation of bibliography system isomerism, can be fast on the premise of training data scale is comparatively small Speed establishes accurate grader for different bibliography systems.

（3）The application can be suitably used for very huge bibliography system, not require growth of the training data with classification quantity And linear increase, more classifications and the situation without classification can be handled well, and ensure the accuracy of classification prediction.

It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the application can use the computer for wherein including computer usable program code in one or more Usable storage medium（Including but not limited to magnetic disk storage, CD-ROM, optical memory etc.）The computer program production of upper implementation The form of product.

The foregoing is merely embodiments herein, is not limited to the application, for those skilled in the art For member, the application can have various modifications and variations.All any modifications within spirit herein and principle, made, Equivalent substitution, improvement etc., should be included within the scope of claims hereof.

Claims

A kind of 1. webpage training method, for classifying to webpage, it is characterised in that including：

The classification prior probability table of sort key word is obtained according to the data with existing associated with sort key word, it is described with classifying The data with existing that keyword is associated includes website bibliography system；

Trained webpage is treated to be pre-processed to obtain web page text to be trained；

Feature is extracted in the web page text to be trained according to classification prior probability table and draws webpage to be trained and specified classification Between incidence relation feature vector represent F1；

Represent that the incidence relation feature vector F1 carries out model training to draw the classification results of the webpage to be trained.
2. according to the method described in claim 1, it is characterized in that, the data with existing associated with sort key word includes User's click logs.
3. according to the method described in claim 2, it is characterized in that, obtained according to the data with existing associated with sort key word The step of classification prior probability table of sort key word, includes：

Search key is obtained from user's click logs, and sort key word is obtained by filtration；

Calculate the frequency that each sort key word occurs in different classifications；

Each non-leaf nodes classification in the website bibliography system is traveled through using bottom-up method, calculates each The information gain of sort key word and Chi-square Test.
4. according to the method described in claim 1, it is characterized in that, treat trained webpage carry out pretreatment include it is following in one A or multiple processing：

Web analysis；

Text extracting；

Participle/stop words filtering.
5. according to the method described in claim 1, it is characterized in that, according to classification prior probability table in web page text to be trained Extraction feature show that the step of incidence relation feature vector between webpage to be trained and specified classification represents F1 includes：

Calculate importance score of the sort key word in webpage to be trained；

The classification prior probability of importance score and sort key word for sort key word carries out sliding-model control；

Two-dimentional spy is obtained according to the discretization classification prior probability of the discretization importance of sort key word and sort key word Levy matrix；

The two dimensional character matrix is calculated in the different piece of web page text to be trained respectively, all matrix elements are arranged successively Row obtain the feature vector of a regular length.
6. according to the method described in claim 5, it is characterized in that, the different piece of the web page text to be trained includes webpage Title, Web page text, hyperlinks between Web pages text.
7. according to the method described in claim 1, it is characterized in that, F1, which carries out model training, to be represented to incidence relation feature vector Included with drawing the step of the classification results of webpage to be trained：

Regression model training is carried out to labeled data according to pre-set regression model algorithm；And

Corresponding ROC curve is calculated for the regression model trained, and different cutting threshold values is set based on ROC curve To obtain classification results.
A kind of 8. webpage training system, for classifying to webpage, it is characterised in that including：

Device for the classification prior probability table that sort key word is obtained according to the data with existing associated with sort key word, The data with existing associated with sort key word includes website bibliography system；

Pre-processed for treating trained webpage to obtain the device of web page text to be trained；

Draw webpage to be trained with specifying for extracting feature in the web page text to be trained according to classification prior probability table Incidence relation feature vector between classification represents the device of F1；

For drawing the classification knot of the webpage to be trained to incidence relation feature vector expression F1 progress model trainings The device of fruit.
A kind of 9. webpage Forecasting Methodology, for utilizing the webpage training method training according to any one of claim 1-7 The classification results drawn, classification prediction is carried out for user view, it is characterised in that including：

Collect user access logs；

Webpage html source codes are obtained from the user access logs；

The webpage html source codes are pre-processed to obtain web page text；

Feature is extracted in the web page text according to classification prior probability table and draws associating between webpage and specified classification It is that feature vector represents F2；

Represent that F2 obtains user view classification prediction result according to the classification results and the incidence relation feature vector.
10. according to the method described in claim 9, it is characterized in that, obtain webpage html source codes from the user access logs The step of include：

Url screenings are carried out to the user access logs；

The url after screening is crawled to obtain webpage html source codes.
11. according to the method described in claim 10, it is characterized in that, the user access logs are carried out with the step of url screenings Suddenly include：

For the user access logs, the url set after duplicate removal is calculated；

Web page importance score is calculated for each url；

It is scored at the url after all duplicate removals according to Web page importance to be ranked up, some url are as to be crawled before taking Url gathers.
12. according to the method described in claim 9, it is characterized in that, extracted according to classification prior probability table in web page text Feature show that the step of incidence relation feature vector between webpage and specified classification represents F2 includes：

Calculate importance score of the sort key word in webpage；

The classification prior probability of importance score and sort key word for sort key word carries out sliding-model control；

Two-dimentional spy is obtained according to the discretization classification prior probability of the discretization importance of sort key word and sort key word Levy matrix；

The two dimensional character matrix is calculated in the different piece of web page text respectively, all matrix elements are arranged in order to obtain The feature vector of one regular length.
13. according to the method described in claim 9, it is characterized in that, according to the classification results and the incidence relation feature The step of vector representation F2 acquisition user view classification prediction results, includes：

F2 is represented according to the classification results and the incidence relation feature vector, obtains the analysis result of webpage rank；

The analysis result of webpage rank and the user access logs are attached operation, obtain each web page access row of user For indicated intention, carry out converging operation in user's dimension and obtain user view classification prediction result.
A kind of 14. webpage forecasting system, for the classification drawn using webpage training system according to claim 8 training As a result, carry out classification prediction for user view, it is characterised in that including：

For collecting the device of user access logs；

For obtaining the device of webpage html source codes from the user access logs；

For being pre-processed the webpage html source codes to obtain the device of web page text；

Pass between webpage and specified classification is drawn for extracting feature in the web page text according to classification prior probability table Join the device of relationship characteristic vector representation F2；

For representing that F2 obtains user view classification prediction result according to the classification results and the incidence relation feature vector Device.