CN103914478B - Webpage training method and system, webpage Forecasting Methodology and system - Google Patents
Webpage training method and system, webpage Forecasting Methodology and system Download PDFInfo
- Publication number
- CN103914478B CN103914478B CN201310003765.6A CN201310003765A CN103914478B CN 103914478 B CN103914478 B CN 103914478B CN 201310003765 A CN201310003765 A CN 201310003765A CN 103914478 B CN103914478 B CN 103914478B
- Authority
- CN
- China
- Prior art keywords
- webpage
- classification
- trained
- key word
- sort key
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000012549 training Methods 0.000 title claims abstract description 67
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000004458 analytical method Methods 0.000 claims description 43
- 239000011159 matrix material Substances 0.000 claims description 30
- 238000001914 filtration Methods 0.000 claims description 19
- 238000012216 screening Methods 0.000 claims description 16
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 11
- 238000005520 cutting process Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 8
- 238000000546 chi-square test Methods 0.000 claims description 5
- 235000013399 edible fruits Nutrition 0.000 claims description 3
- 206010068052 Mosaicism Diseases 0.000 abstract description 3
- 210000003765 sex chromosome Anatomy 0.000 abstract description 3
- 230000006399 behavior Effects 0.000 description 19
- 238000005516 engineering process Methods 0.000 description 9
- 230000011218 segmentation Effects 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 235000014510 cooky Nutrition 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 5
- 238000006116 polymerization reaction Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 241001446467 Mama Species 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000000205 computational method Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000009193 crawling Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000010130 dispersion processing Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000007306 turnover Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Abstract
This application involves a kind of webpage training method and system, webpage Forecasting Methodology and system, including:The prior probability table of sort key word is obtained according to the data with existing associated with sort key word;Trained webpage is treated to be pre-processed to obtain web page text to be trained;Feature is extracted in the web page text to be trained according to prior probability table and show that the incidence relation feature vector between webpage to be trained and specified classification represents F1;Represent that the incidence relation feature vector F1 carries out model training to draw the classification results of the webpage to be trained.The application can handle the very strong bibliography system of isomerism at the same time, and utilize the bibliography system that relatively very few training data treatment scale is huge, in addition, by collecting browse/search behavior of the user on the whole network, rather than just the behavior in a home Web site, largely solves the sparse sex chromosome mosaicism of data.
Description
Technical field
This application involves internet arena, more particularly to a kind of classification and prediction for accessing user internet behavior.
Background technology
With the continuous popularization of computer technology, modern society's facility that very Dependency Specification technology is brought.With
Computer and network technologies are more and more efficient, and safety and reliable, more and more whole sellers, retailer, consumer are selected mutual
Network the transaction of enterprising product of doing business.Specific website is becoming the service provider of most rich commercial value on internet.
User can be browsed on specific website, and search, the rate of exchange, buy, payment, and a series of action such as evaluation comes
Purchase meets the commodity of its commercial intention.
The scale of specific website also constantly increases with quantity at the same time.For example, Taobao, day cat, Jingdone district, Amazon, when working as,
And plurality of medium and small specific website.
Commercial intention analysis result can allow specific website to provide personalized advertisement and recommendation results for each user,
Personalized web site entrance.Meet the product and service of user's commercial intention by providing, trading volume and transaction can be greatly improved
Volume.Therefore, the commercial intention analysis of user is extremely important for specific website.
For specific website, the commercial intention of its user is analyzed, there is following difficulty:
(1)For the behavior that user accesses internet, user is in the quantity of the behavior of a certain specific website
It is very rare.The openness commercial intention that can lead to not comprehensively judge a user of data;Data sample is dilute at the same time
The fiducial interval range that probability distribution parameters are estimated can be caused excessive less, so that it cannot make accurate prediction.
(2)The commodity bibliography system of specific website has very strong isomerism, i.e. the classification of any two specific websites
System has very big difference.If use classical Web page classification method, then for any specific website, be required for
Gather labeled data in large scale(The quantity of labeled data is directly proportional to the quantity of classification), carried according to the difference of bibliography system
The feature that form is different is taken, finally study obtains machine learning classification model.This can bring the cost of costliness.
(3)The commodity bibliography system of specific website is often very huge.In an example, commodity bibliography system is by one
5 layers of classification structure tree definition, the quantity of wherein leaf node reach 20,000 scale.Bibliography system is excessively huge to cause three
Class problem, first, it is necessary to which the quantity of the training data of mark is very huge.Second, the accuracy rate of classification prediction can decline to a great extent.
3rd, it is desirable to the prediction of multi-class targets label is provided to each webpage;Multi-class targets label in turn again can cause accuracy rate into
One step declines.
In summary, it is necessary to it is a kind of can overcome above three problem using user behavior analysis user's commercial intention
Solution.
The content of the invention
The main purpose of the application is to provide a kind of webpage training program and webpage prediction scheme, to solve existing skill
Art is difficult to that the website isomerism, the bibliography system that overcome be huge and Deta sparseness problem.
In order to solve the above-mentioned technical problem, the purpose of the application is achieved through the following technical solutions:
This application discloses a kind of webpage training method, for classifying to webpage, comprise the following steps:According to dividing
The data with existing that class keywords are associated obtains the prior probability table of sort key word;Trained webpage is treated to be pre-processed to obtain
Obtain web page text to be trained;Extracted according to prior probability table in the web page text to be trained feature draw webpage to be trained with
The incidence relation feature vector between classification is specified to represent F1;F1, which carries out model training, to be represented to the incidence relation feature vector
To draw the classification results of the webpage to be trained.
Preferably, in the method according to the application, the data with existing associated with sort key word includes
Website bibliography system and user's click logs.
Preferably, in the method according to the application, obtained according to the data with existing associated with sort key word
The step of prior probability table of sort key word, includes:Search key is obtained from user's click logs, and is filtered
To sort key word;Calculate the frequency that each sort key word occurs in different classifications;Traveled through using bottom-up method
Each non-leaf nodes classification in the website bibliography system, calculates information gain and the card side of each sort key word
Examine.
Preferably, in the method according to the application, treat trained webpage carry out pretreatment include it is following in one
A or multiple processing:Web analysis;Text extracting;Participle/stop words filtering.
Preferably, in the method according to the application, extracted according to prior probability table in web page text to be trained
Feature show that the step of incidence relation feature vector between webpage to be trained and specified classification represents F1 includes:Classification is calculated to close
Importance score of the keyword in webpage to be trained;The priori of importance score and sort key word for sort key word
Probability carries out sliding-model control;According to the discretization importance of sort key word and the discretization prior probability of sort key word
Obtain two dimensional character matrix;The two dimensional character matrix is calculated in the different piece of web page text to be trained respectively, will be all
Matrix element is arranged in order to obtain the feature vector of a regular length.
Preferably, in the method according to the application, the different piece of the web page text to be trained includes webpage
Title, Web page text, hyperlinks between Web pages text.
Preferably, in the method according to the application, F1, which carries out model training, to be represented to incidence relation feature vector
Included with drawing the step of the classification results of webpage to be trained:Labeled data is carried out according to pre-set regression model algorithm
Regression model is trained;And corresponding ROC curve is calculated for the regression model trained, and set based on ROC curve
Different cutting threshold values is to obtain classification results.
Disclosed herein as well is a kind of webpage training system, for classifying to webpage, including:For basis and classification
The data with existing that keyword is associated obtains the device of the prior probability table of sort key word;Carried out in advance for treating trained webpage
Handle to obtain the device of web page text to be trained;For spy to be extracted in the web page text to be trained according to prior probability table
Obtain out the device that the incidence relation feature vector between webpage to be trained and specified classification represents F1;For being closed to the association
It is that feature vector represents that F1 carries out model training to draw the device of the classification results of the webpage to be trained.
Preferably, in the webpage training system, the data with existing associated with sort key word includes net
Stand bibliography system and user's click logs.
Preferably, in the webpage training system, for being obtained according to the data with existing associated with sort key word
The device of the prior probability table of score class keywords includes:For obtaining search key from user's click logs, and
The device of sort key word is obtained by filtration;The device of the frequency occurred for calculating each sort key word in different classifications;
For traveling through each non-leaf nodes classification in the website bibliography system using bottom-up method, each is calculated
The information gain of sort key word and the device of Chi-square Test.
Preferably, in the webpage training system, include for treating the device that trained webpage is pre-processed:Net
Page resolver;Text extracting device;Participle/stop words filtration apparatus.
Preferably, in the webpage training system, for being carried according to prior probability table in web page text to be trained
Feature is taken to show that the incidence relation feature vector between webpage to be trained and specified classification represents that the device of F1 includes:For calculating
The device of importance score of the sort key word in webpage to be trained;For the importance score for sort key word and
The prior probability of sort key word carries out the device of sliding-model control;For the discretization importance according to sort key word and
The discretization prior probability of sort key word obtains the device of two dimensional character matrix;For respectively in web page text to be trained not
With the two dimensional character matrix is calculated in part, all matrix elements are arranged in order to obtain the feature vector of a regular length
Device.
Preferably, in the webpage training system, the different piece of the web page text to be trained includes webpage mark
Topic, Web page text, hyperlinks between Web pages text.
Preferably, in the webpage training system, for representing that F1 carries out model instruction to incidence relation feature vector
White silk is included with the device for drawing the classification results of webpage to be trained:Regression modeling device, for according to pre-set recurrence mould
Type algorithm carries out regression model training to labeled data;And ROC analytical equipments, for for the regression model trained
Corresponding ROC curve is calculated, and different cutting threshold values is set to obtain classification results based on ROC curve.
This application discloses a kind of webpage Forecasting Methodology, for the classification knot drawn using webpage training method training
Fruit, classification prediction is carried out for user view, including:Collect user access logs;Webpage is obtained from the user access logs
Html source codes;The webpage html source codes are pre-processed to obtain web page text;According to prior probability table in the webpage
Feature is extracted in text and show that the incidence relation feature vector between webpage and specified classification represents F2;According to the classification results
Represent that F2 obtains user view classification prediction result with the incidence relation feature vector.
Preferably, in the method according to the application, webpage html source codes are obtained from the user access logs
Step includes:Url screenings are carried out to the user access logs;The url after screening is crawled to obtain webpage html source codes.
Preferably, in the method according to the application, the step of user access logs progress url screenings, is wrapped
Include:For the user access logs, the url set after duplicate removal is calculated;Web page importance score is calculated for each url;Root
It is scored at the url after all duplicate removals according to Web page importance to be ranked up, some url gather as url to be crawled before taking.
Preferably, in the method according to the application, the step of being pre-processed to the webpage html source codes, wraps
Include:Web analysis;Text extracting;Participle/stop words filtering.
Preferably, in the method according to the application, feature is extracted in web page text according to prior probability table and is obtained
The incidence relation feature vector gone out between webpage and specified classification represents that the step of F2 includes:Sort key word is calculated in webpage
Importance score;Carried out for the importance score of sort key word and the prior probability of sort key word at discretization
Reason;Two dimensional character square is obtained according to the discretization prior probability of the discretization importance of sort key word and sort key word
Battle array;The two dimensional character matrix is calculated in the different piece of web page text respectively, all matrix elements are arranged in order to obtain
The feature vector of one regular length.
Preferably, in the method according to the application, according to the classification results and the incidence relation feature to
Amount represents that the step of F2 obtains user view classification prediction result includes:According to the classification results and the incidence relation feature
Vector representation F2, obtains the analysis result of webpage rank;The analysis result of webpage rank and the user access logs are carried out
Attended operation, obtains the intention indicated by each web page access behavior of user, and carrying out converging operation in user's dimension obtains user
It is intended to classification prediction result.
Disclosed herein as well is a kind of webpage forecasting system, for point drawn using the webpage training system training
Class carries out classification prediction as a result, being directed to user view, including:For collecting the device of user access logs;For from the use
Family access log obtains the device of webpage html source codes;It is literary to obtain webpage for being pre-processed to the webpage html source codes
This device;Pass between webpage and specified classification is drawn for extracting feature in the web page text according to prior probability table
Join the device of relationship characteristic vector representation F2;For representing that F2 is obtained according to the classification results and the incidence relation feature vector
Obtain the device of user view classification prediction result.
Preferably, in the system, for obtaining the device bag of webpage html source codes from the user access logs
Include:For the user access logs to be carried out with the device of url screenings;For crawling the url after screening to obtain webpage html
The device of source code.
Preferably, in the system, the device for the user access logs to be carried out with url screenings includes:With
In for the user access logs, the device that the url after duplicate removal gathers is calculated;It is important for calculating webpage for each url
The device of property score;It is ranked up for being scored at the url after all duplicate removals according to Web page importance, some url before taking
The device gathered as url to be crawled.
Preferably, in the system, the device for being pre-processed to the webpage html source codes includes:Webpage
Resolver;Text extracting device;Participle/stop words filtration apparatus.
Preferably, in the system, webpage is drawn for extracting feature in web page text according to prior probability table
Incidence relation feature vector between specified classification represents that the device of F2 includes:For calculating sort key word in webpage
The device of importance score;For the importance score for sort key word and the prior probability of sort key word carry out from
The device of dispersion processing;For the discretization importance and the discretization prior probability of sort key word according to sort key word
Obtain the device of two dimensional character matrix;, will for calculating the two dimensional character matrix in the different piece of web page text respectively
All matrix elements are arranged in order to obtain the device of the feature vector of a regular length.
Preferably, in the system, it is characterised in that for special according to the classification results and the incidence relation
The device that sign vector representation F2 obtains user view classification prediction result includes:For according to the classification results and the association
Relationship characteristic vector representation F2, obtains the device of the analysis result of webpage rank;For by the analysis result of webpage rank and institute
State user access logs and be attached operation, obtain the intention indicated by each web page access behavior of user, user's dimension into
Row converging operation obtains the device of user view classification prediction result.
Compared with prior art, according to the technical solution of the application, the very strong bibliography system of isomerism can be handled at the same time,
And using the bibliography system that relatively very few training data treatment scale is huge, in addition, by collecting user on the whole network
Browse/search behavior, rather than just the behavior in a home Web site, largely solve the openness of data and ask
Topic.
Brief description of the drawings
Attached drawing described herein is used for providing further understanding of the present application, forms the part of the application, this Shen
Schematic description and description please is used to explain the application, does not form the improper restriction to the application.In the accompanying drawings:
Fig. 1 is the flow chart of the webpage training method of the embodiment of the present application;
Fig. 2 is the ROC curve of the embodiment of the present application;
Fig. 3 is the block diagram of the webpage training system of the embodiment of the present application;
Fig. 4 is the flow chart of the webpage Forecasting Methodology of the embodiment of the present application;
Fig. 5 is the block diagram of the webpage forecasting system of the embodiment of the present application;
Fig. 6 is the Organization Chart of the exemplary system of effective identification user's commercial intention of the embodiment of the present application.
Embodiment
The main idea of the present application lies in that by the way that user's browse/search behavior on the internet, general is uniformly processed
Data-interface, the sorting algorithm for possessing automatic adaptation ability, the application can solve Deta sparseness well, and bibliography system is different
Structure, these three excessively huge important problems of classification quantity, and in unified flow service is provided for numerous websites at the same time.
To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with drawings and the specific embodiments, to this
Application is described in further detail.
User on the internet browse with the intention of search behavior can be nature of business either non-commercial property
, the intention of wherein nature of business can do further division according to the specific commodity bibliography system of specific website again.
The identification of the online commercial intention of user is a very important function for specific website.Pass through recommendation
With the mode of advertisement, it can directly facilitate the exchange hand and turnover of specific website.
Still further aspect, the access behavior inside any specific website only occupy the whole network user's browse/search row
For sub-fraction, substantial amounts of browse/search behavior occurs outside specific website.Utilize Activity recognition of the user on the whole network
The commercial intention of user can bring great value to specific website.
Any specific website has the commodity bibliography system of oneself, for specific website, it is desirable to which analysis obtains
The bibliography systems of user's commercial intention and its commodity be identical, otherwise, consistent recommendation and advertisement can not be provided.Especially need
It should be noted that the bibliography system of commodity can be very huge.In the example provided, commodity bibliography system is by more
Provided up to 5 layers of bibliography system trees, the quantity of all leaf nodes reaches 20,000.
The application is the plateform system of a processing commercial intention analysis, which collects user on the internet clear
Look at/search behavior, and opening interface is to specific website.Specific website uploads commodity bibliography system by data-interface, uses
Click logs, and training data sample are searched in family.With reference to behavior record of the user on the whole network and the specific letter of specific website
Breath, extraction and the relevant feature of commercial intention, user is detected any from the Machine learning classifiers of these feature constructions automatically
Online commercial intention in given specific website linguistic context(OCI).
According to an embodiment of the present application, there is provided a kind of webpage training method, for classifying to webpage.
With reference to figure 1, Fig. 1 is the flow chart of the webpage training method 100 of the embodiment of the present application.The target of training flow is root
The bibliography system data uploaded according to any one specific website by opening interface, user search for click data, mark number
According to training obtains the commercial intention classification results for the website.
In step s 102, the prior probability of sort key word is obtained according to the data with existing associated with sort key word
Table.The data with existing associated with sort key word can include website bibliography system and user's click logs.The step is real
It is exactly the log processing flow that prior probability table is obtained according to user's click logs and website bibliography system structure on border.
For example, first, " specific website open interface " is passed through(For example, it may be e-commerce website open interface)
The bibliography system for specific website is obtained, user searches for click logs, and the training data marked.Wherein, for spy
Determine the bibliography system of website and user to search for click logs associated with sort key word, by for obtaining the elder generation of sort key word
Test probability tables.And the training data marked will be used for being used for model training in below the step of.
Bibliography system:The e.g. bibliography system of tree, can use any structure that can express tree
Change description language.For example, in an application example, extensible markup language is used(XML)To describe bibliography system:
User's click logs:E.g. tableau format, following field is included per a line:User's search key, search knot
Structure clicks on the corresponding classification of commodity.
Training data:Such as provide and exceed a certain number of webpages, these webpage labels are corresponding in bibliography system one
The matched classification of fixed number amount and unmatched classification.Following field is included per a line training data:Url, html source code, classification
Id, annotation results.Such as:
http://www.**.com.cn/mama/a10102/ html source codes 4.3 match
http://www.**.com.cn/mama/a10102/ html source codes 3.1 mismatch
http:4.1 part of //www.**.com.cn/mama/a10102/ html source codes matches
Wherein, annotation results " matching " mark matched classification completely, " mismatch " identify completely irrelevant classification, " portion
Divide matching " the relevant classification of identification division.
Step S102 may include steps of:Search key is obtained from user's click logs, and is obtained by filtration point
Class keywords;Calculate the frequency that each sort key word occurs in different classifications;Website is traveled through using bottom-up method
Each non-leaf nodes classification in bibliography system, calculates information gain and the Chi-square Test of each sort key word.
For example, pre-processed for data input by user, carry out complicated and simple conversion, capital and small letter conversion, coding turns
Utf8 is changed to, and filters stop words.
The length limitation of search key, if keyword is not belonging to the entity vocabulary collected in advance(By brand word, product
Word, product attribute descriptor etc. forms), then filtering keys word length<All keywords of=4byte.(Explanation:>=4byte
In utf8 codings, represent>=2 Chinese characters, or>=4 English characters)
The frequency that each keyword occurs in different classifications is calculated respectively, and is gathered according to keyword as major key
Close, obtain using keyword as major key, classification is distributed as the data format of attribute.
Each non-leaf nodes in classification structure is traveled through using bottom-up method(In other words, classification), meter
Calculate information gain and the Chi-square Test of each keyword.For example, calculate the information gain of each keyword
(Information Gain)Dimension about subtracts weights, if the dimension about subtracts weights and is less than certain threshold value, keyword is occurred
All frequencies in subcategory are aggregated into current classification, and empty frequency of the keyword in all subcategories.It is if current
Node is root node, then deletes the keyword.
For example, it is keyword to define K, and C is the information gain IG calculating sides of all classifications, then K under present node
Method is as follows:
IG (K)=H (C)-H (C | K)
Wherein H is comentropy, defines P (xi) it is that variable X takes xiProbability, drawn by following formula:
Calculate the card side's computational methods for the keyword for meeting specified criteria(χ2)Dimension about subtract weights, if the dimension about subtracts
Weights are less than certain threshold value, then all frequencies appeared in keyword in subcategory are aggregated into current classification, and empty this
Frequency of the keyword in all subcategories.If present node is root node, the keyword is deleted.
Definition K is keyword,For every other keyword, { CiIt is given bibliography system.
Count following contingency table(Contingency Table)
Card side's computational methods:
Wherein TI, jFor the theoretical value of the i-th row jth row:
Participate in the condition of keyword calculated:
And
| (i, j) | TI, j==0 } |=0
In step S104, treat trained webpage and pre-processed to obtain web page text to be trained.The purpose of pretreatment
It is as follows:1)Html source codes are resolved to the data of structuring;2)Filter the noise data of non-text;3)Text is carried out at participle
Reason, and filtered stop words as noise data.
In other words, step S104 can include one or more of following processing:Web analysis;Text extracting;Point
Word/stop words filtering.
In web analysis step, the input of web analysis flow is html source codes.Html source codes are parsed, obtain each member
The positional information and semantic information of element.For example, given html source codes, it is known which text is appeared in hyperlink, which
Hyperlink text occurs in header, which hyperlink text occurs in footer, etc..The output of web analysis flow is text
Shelves objectification Model Structure Tree(DocumentObject Model Tree).
In text extracting step, the input of text extracting flow is document object Model Structure Tree.According to web analysis
Output, filter header/footer/navigation/advertisement/sidebar, obtain the text of webpage.The output of text extracting flow is to carry
The document object Model Structure Tree of label information.Element in each tree carries a label, for representing this yuan
Whether element belongs to Web page text.
In participle/stop words filtration step, the input of participle/stop words filtering process is the document pair with body tag
As changing Model Structure Tree.Using pre-set Chinese Word Automatic Segmentation, Chinese Word Automatic Segmentation has many kinds, such as CDWS, n-
Gram, hidden Markov model, CRF(Conditional RandomField)Etc..In an application example, it can adopt
Chinese Word Automatic Segmentation is used as by the use of CRF.The frequency that function word in text grammer part not only occurs in article is high but also right
Almost without in all senses, i.e. classification of disturbance, and make it that text dimensionality is excessive, influences classification effectiveness for classification.This part vocabulary leads to
Deactivation vocabulary is crossed to be filtered.Participle/output of stop words filtering process is by the text in each element segment/stops
Document object Model Structure Tree after word processing.
In step s 106, extracted according to prior probability table in web page text to be trained feature draw webpage to be trained with
The incidence relation feature vector between classification is specified to represent F1.The step is actually the class drawn according to log processing flow
Mesh prior probability table is in the webpage by pretreatment process(Have already been through structuring processing, participle, noise filtering etc.)In point
Analysis obtains the feature extraction flow of feature vector.
Step S106 may include steps of:Calculate importance score of the sort key word in webpage to be trained;Pin
The prior probability of importance score and sort key word to sort key word carries out sliding-model control;According to sort key word
Discretization importance and the discretization prior probability of sort key word obtain two dimensional character matrix;Respectively in webpage to be trained
Two dimensional character matrix is calculated in the different piece of text, all matrix elements are arranged in order to obtain the feature of a regular length
Vector.Wherein, the different piece of web page text to be trained can include web page title, Web page text, hyperlinks between Web pages text.
For example, importance score of the keyword in webpage is calculated first, considers the TF/IDF letters of keyword
Breath, the positional information that keyword occurs, the font color information of keyword, the important of keyword is calculated by regression model
Property score.The value range of score is [0,1].
The prior probability of importance score and keyword then for keyword carries out sliding-model control:Setting is corresponding
Threshold value, turns to " high, higher, relatively low, low " four kinds of possible values by the importance score of keyword is discrete;Set corresponding threshold
Value, turns to " high, higher, relatively low, low " four kinds of possible values by the prior probability score of keyword is discrete.
Then the significance level according to keyword in webpage, and the prior probability that log processing workflow management obtains
Table, can be calculated a two-dimentional eigenmatrix, as follows:
Importance prior probability | It is high | It is higher | It is relatively low | It is low |
It is high | n1,1 | n1,2 | n1,3 | n1,4 |
It is higher | n2,1 | n2,2 | n2,3 | n2,4 |
It is relatively low | n3,1 | n3,2 | n3,3 | n3,4 |
It is low | n4,1 | n4,2 | n4,3 | n4,4 |
For example, n2,3Represent that importance is higher, the quantity of the relatively low keyword of prior probability.
Finally respectively in web page title, Web page text, calculates as above eigenmatrix in hyperlinks between Web pages text.Therefore, obtain
To the matrix of 34 × 4 dimensions, a total of 3 × 4 × 4=48 element.This 48 elements are arranged in order to obtain a fixed length
The feature vector of degree.This feature vector is the output result of feature extraction flow.
In step S108, represent that incidence relation feature vector F1 carries out model training to show that webpage to be trained divides
Class result.
Step S108 may include steps of:Regression modeling and ROC analyses.Wherein, regression modeling can include basis
Pre-set regression model algorithm carries out regression model training to labeled data.ROC analyses can include being directed to what is trained
Regression model calculates corresponding ROC curve, and sets different cutting threshold values based on ROC curve to obtain classification results.At this
The application scenarios of text(Classification substantial amounts)Under, with reference to regression model and ROC analytical technologies, more classification predictions can be solved and asked
Topic, to lift classification predictablity rate and recall rate.
For example, the feature vector obtained with reference to labeled data and characteristic extracting module that specific website uploads, root
Model training is carried out to commercial intention according to pre-set regression model algorithm.Regression algorithm has many kinds, such as SVM(Support
Vector machine), Logistic Regression etc..In an example, it is trained using SVM regression algorithms.
After completing regression algorithm models fitting, closed in test set and calculate corresponding ROC for the model(Receiver
Operating Characteristic)Curve.In an example, the ROC curve obtained is as shown in Figure 2.
The transverse axis of ROC curve is FPR(False Positive Rate), longitudinal axis TPR(True PositiveRate).
It is defined as follows:
Wherein:
FP(False Positive, false positive):In test set, true tag is false, and model thinks genuine data point
Number.
TP(True Positive, true positives):In test set, true tag is true, and model thinks genuine data point
Number.
FN(False Positive, false negative):In test set, true tag is true, and model thinks false data point
Number.
TN(False Positive, true negative):In test set, true tag is false, and model thinks false data point
Number.
FPR has weighed the accuracy rate of model, and the index is more low better.TPR has weighed the recall rate of model, and the index is higher
Better.Different cutting threshold values can be set according to different application scenarios, regression problem is converted into classification problem again, is obtained
To the Web page classifying result met the requirements.
For example, in the application scenarios of advertisement, it is desirable to have the recall rate of higher, while can tolerate to a certain extent accurately
The not high situation of rate, in such a scenario, can set a threshold value, it is desirable to higher TPR, while tolerate FPR also relatively
It is high.Opposite, among the anti-application practised fraud, it is desirable to very high accuracy rate, then a threshold value can be set, ensured very low
FPR, while tolerate certain presence of failing to judge.
According to an embodiment of the present application, there is provided a kind of webpage training system, for classifying to webpage.
With reference to figure 3, Fig. 3 is the block diagram of the webpage training system 300 of the embodiment of the present application.Webpage training system 300 includes
Device 302 for the prior probability table that sort key word is obtained according to the data with existing associated with sort key word;For
Trained webpage is treated to be pre-processed to obtain the device 304 of web page text to be trained;For waiting to instruct according to prior probability table
Practice the dress for the incidence relation feature vector expression F1 that extraction feature in web page text is drawn between webpage to be trained and specified classification
Put 306;For drawing the dress of the classification results of webpage to be trained to incidence relation feature vector expression F1 progress model trainings
Put 308.
The data with existing associated with sort key word can include website bibliography system and user's click logs.For
The device 302 of the prior probability table of sort key word is obtained according to the data with existing associated with sort key word to be included:
For obtaining search key from user's click logs, and the device of sort key word is obtained by filtration;For calculating each point
The device for the frequency that class keywords occur in different classifications;For using in bottom-up method traversal website bibliography system
Each non-leaf nodes classification, calculate the information gain of each sort key word and the device of Chi-square Test.
It can include for treating the device 304 that trained webpage is pre-processed:Web analysis device;Text extracting fills
Put;Participle/stop words filtration apparatus.
Webpage to be trained and specified classification are drawn for extracting feature in web page text to be trained according to prior probability table
Between incidence relation feature vector represent F1 device 306 can include:For calculating sort key word in webpage to be trained
In importance score device;For the importance score for sort key word and the prior probability of sort key word into
The device of row sliding-model control;For the discretization importance and the discretization priori of sort key word according to sort key word
Probability obtains the device of two dimensional character matrix;For two dimensional character square to be calculated in the different piece of web page text to be trained respectively
Battle array, all matrix elements are arranged in order to obtain the device of the feature vector of a regular length.Web page text to be trained is not
It can include web page title, Web page text, hyperlinks between Web pages text with part.
For drawing the classification results of webpage to be trained to incidence relation feature vector expression F1 progress model trainings
Device can include regression modeling device and ROC analytical equipments.Regression modeling device can be used for according to pre-set recurrence
Model algorithm carries out regression model training to labeled data;ROC analytical equipments can be used for being directed to trained regression model meter
Corresponding ROC curve is calculated, and different cutting threshold values is set to obtain classification results based on ROC curve.
Training flow in the application by following innovation solve the bibliography system isomerism referred in background technology with
And the problem that classification quantity is excessively huge:
(1)The bibliography system and user for excavating specific website search for click logs, obtain the prior probability table of keyword.
(2)With reference to the prior probability table and web page text feature of keyword, associating between webpage and specified classification is generated
Relationship characteristic vector representation.
(3)Classification problem is converted to the regression problem of webpage and the incidence relation of specified classification, passes through obtained association
Regression model is trained in relationship characteristic vector representation.
(4)The ROC curve of regression model result and classification annotation result is calculated, classifying rationally threshold value is inverse by regression problem
Switch to classification problem.
Since the feature vector of incidence relation between webpage and classification represents it is unrelated with specific category information, so logical
Aforesaid way is crossed, the application can handle the very strong bibliography system of isomerism at the same time, and utilize relatively very few training number
According to the bibliography system that treatment scale is huge.
According to an embodiment of the present application, there is provided a kind of webpage Forecasting Methodology, for utilizing above-mentioned webpage training method
The classification results that 100 training are drawn, classification prediction is carried out for user view.The target of pre- flow gauge is in the whole network according to user
On browse/search behavior, provide comprehensive commercial intention analysis result for given any specific website.
With reference to figure 4, Fig. 4 is the flow chart of the webpage Forecasting Methodology 400 of the embodiment of the present application.
In step S402, user access logs are collected.
For example, global cookie are distributed first on the whole network.Cookies is that one kind can allow Website server
Low volume data is stored into the hard disk or memory of client, or a kind of technology of data is read from the hard disk of client.
Cookies is a very small text being placed in by Web server on your hard disk when you browse certain website, it can
To record the information such as your User ID, password, browsed webpage, residence time.Cookie is usually used in identifying user.
Then when user, which accesses, is furnished with the webpage of global cookie, which can return to user to server back end
Access log.
Then back-end server is responsible for collection, stores user access logs.Access log form:Access log is user's visit
Ask the record that browses of internet, including following field:CookieId, IP, UserAgent, access Url, Refer Url, access
Time, search key.
In step s 404, webpage html source codes are obtained from user access logs.
Step S404 may include steps of:Url screenings are carried out to user access logs;Crawl screening after url with
Obtain webpage html source codes.Wherein, the step of carrying out url screenings to user access logs may include steps of:For with
Family access log, calculates the url set after duplicate removal;Web page importance score is calculated for each url;Obtained according to Web page importance
It is divided into the url after all duplicate removals to be ranked up, some url gather as url to be crawled before taking.Url screening processes are
The url lists analyzed from user access logs table are screened, select the url lists for being considered high quality.
For example, first against the user access logs being collected into, the url set after duplicate removal is calculated.
Then calculated for each url as properties:PV(Page View, i.e. the webpage is accessed how many times), UV
(User View, i.e. the webpage was accessed by how many people), PageRank(The webpage being calculated according to hyperlinks between Web pages relation
Importance index), PageType(It is divided into 404 webpages, landing page, list page, content pages).
Then Web page importance score is calculated:
S=(α × sigmoid (PV)+β × sigmoid (UV)) × PR × PT
Wherein:
α, β:Linear regulation coefficient, meets constraint(α > 0, β > 0, alpha+beta=1)
PR:PageRank, value range are 0~10
PT:PageType, is centrifugal pump, when PageType is page 404, or during landing page, and value 0, when
When PageType is list page, value 0.5, when PageType is content pages, value 1.
Importance score is calculated according to above-mentioned calculation formula, is ranked up for all url, takes Top N bars url to be used as and treats
The url set crawled.
The url lists that last basis provides, crawl and store corresponding html source codes.
In step S406, webpage html source codes are pre-processed to obtain web page text.
Similar to step S104, step S406 may include steps of:Web analysis;Text extracting;Participle/stop words
Filtering.
In web analysis step, html source codes are parsed, obtain the positional information and semantic information of each element.For example,
Given html source codes, it is known which text is appeared in hyperlink, which hyperlink text occurs in header, which is super
Link text occurs, etc. in footer.The output of web analysis flow is document object Model Structure Tree(Document
Object Model Tree).
In text extracting step, according to the output of web analysis, header/footer/navigation/advertisement/sidebar is filtered, is obtained
The text of webpage.
In participle/stop words filtration step, using pre-set Chinese Word Automatic Segmentation, Chinese Word Automatic Segmentation has very much
Kind, such as CDWS, n-gram, hidden Markov model, CRF(Conditional RandomField)Etc..In an application
In example, CRF can be used as Chinese Word Automatic Segmentation.What the function word in text grammer part not only occurred in article
Frequency is high, and to classifying almost without in all senses, i.e. classification of disturbance, and make it that text dimensionality is excessive, influences classification effectiveness.
This part vocabulary is filtered by disabling vocabulary.
In step S408, feature is extracted in web page text according to prior probability table and is drawn between webpage and specified classification
Incidence relation feature vector represent F2.Preferably, regression model and cutting threshold value are obtained using training flow, with reference to above-mentioned step
Suddenly the feature vector obtained represents F2, obtains the classification results of webpage.It is noted that Web page classifying result can belong to multiple classes
Not, it is also possible to be not belonging to any one classification.
Similar to step S106, step S408 may include steps of:It is important in webpage to calculate sort key word
Property score;The prior probability of importance score and sort key word for sort key word carries out sliding-model control;According to
The discretization importance of sort key word and the discretization prior probability of sort key word obtain two dimensional character matrix;Exist respectively
Two dimensional character matrix is calculated in the different piece of web page text, all matrix elements are arranged in order to obtain regular length
Feature vector.
For example, importance score of the keyword in webpage is calculated first, considers the TF/IDF letters of keyword
Breath, the positional information that keyword occurs, the font color information of keyword, the important of keyword is calculated by regression model
Property score.The value range of score is [0,1].
The prior probability of importance score and keyword then for keyword carries out sliding-model control:Setting is corresponding
Threshold value, turns to " high, higher, relatively low, low " four kinds of possible values by the importance score of keyword is discrete;Set corresponding threshold
Value, turns to " high, higher, relatively low, low " four kinds of possible values by the prior probability score of keyword is discrete.
Then the significance level according to keyword in webpage, and the prior probability that log processing workflow management obtains
Table, can be calculated a two-dimentional eigenmatrix, as follows:
Importance prior probability | It is high | It is higher | It is relatively low | It is low |
It is high | n1,1 | n1,2 | n1,3 | n1,4 |
It is higher | n2,1 | n2,2 | n2,3 | n2,4 |
It is relatively low | n3,1 | n3,2 | n3,3 | n3,4 |
It is low | n4,1 | n4,2 | n4,3 | n4,4 |
For example, n2,3Represent that importance is higher, the quantity of the relatively low keyword of prior probability.
Finally respectively in web page title, Web page text, calculates as above eigenmatrix in hyperlinks between Web pages text.Therefore, obtain
To the matrix of 34 × 4 dimensions, a total of 3 × 4 × 4=48 element.This 48 elements are arranged in order to obtain a fixed length
The feature vector of degree.This feature vector is the output result of feature extraction flow.
In step S410, represent that F2 obtains the prediction of user view classification according to classification results and incidence relation feature vector
As a result.
Step S410 may include steps of:F2 is represented according to classification results and incidence relation feature vector, obtains net
The analysis result of page rank;The analysis result of webpage rank is attached with user access logs(join)Operation, is used
Intention indicated by each web page access behavior in family, carries out converging operation in user's dimension and obtains user view classification prediction knot
Fruit.
For example, machine learning classification, the feature vector that binding characteristic extraction module obtains are carried out first, and are being instructed
Practice the model file of output in flow, classify according to pre-set classifier algorithm to the commercial intention preference of webpage.
Followed by the polymerization of user's dimension.
User access logs are converted to the Table in relevant database first, comprising field CookieId, are accessed
Url, access time.The Table output of Machine learning classifiers module being also converted into relevant database, includes field
Url, commercial intention analysis result.That is the input of the polymerization process of user's dimension has two:(1)Web page analysis result table,
The table has two fields:Webpage url, webpage commercial intention classification results;(2)User access logs, the table have three fields:With
Family id, accesses url, access time.
Then this two Table are subjected to Join operations using Url as Key, obtain connection table, including field
CookieId, accesses Url, access time, commercial intention analysis result.
Converging operation is finally carried out, by the commercial intention analysis result of identical CookieId as Key according to CookieId
It is aggregated to together.Time decay factor is considered in aggregating algorithm, following formula describe the i-th of some given CookieId
The polymerization result of a business category weight:
Wherein,
Wi:The polymerization result of i-th of commercial intention classified weight of given CookieId;
WI, j:I-th of commercial intention of given CookieId is sorted in the weight on j-th of page of the user's access;
Dj:Given CookieId accesses the number of days of the time gap current time of j-th of page.
That is the output of the polymerization process of user's dimension is user's commercial intention analysis result table, which has two words
Section:User id, user's commercial intention classification results.
In a preferred embodiment, training flow is intended according to known training data and above-mentioned process flow
The regression model file got togather, and the cutting threshold value analyzed according to ROC curve.In " the machine learning point of pre- flow gauge
In class device " module, regression model file and cutting threshold value that application training flow obtains, obtain the prediction result of webpage rank.
Pre- flow gauge come cluster web pages prediction result according to user by analyzing to obtain the commercial intention analysis result of user class.
According to an embodiment of the present application, there is provided a kind of webpage forecasting system, for utilizing above-mentioned webpage training system
The classification results that training is drawn, classification prediction is carried out for user view.
With reference to figure 5, Fig. 5 is the block diagram of the webpage forecasting system 500 of the embodiment of the present application.Webpage forecasting system 500 includes
For collecting the device 502 of user access logs;For obtaining the device 504 of webpage html source codes from user access logs;With
In being pre-processed to webpage html source codes to obtain the device 506 of web page text;For literary in webpage according to prior probability table
Extraction feature draws the device 508 of the incidence relation feature vector expression F2 between webpage and specified classification in this;For basis
Classification results and incidence relation feature vector represent that F2 obtains the device 510 of user view classification prediction result.
Device 504 for obtaining webpage html source codes from user access logs can include:For accessing day to user
Will carries out the device of url screenings;For crawling the url after screening to obtain the device of webpage html source codes.Wherein, for
The device that family access log carries out url screenings can include:For for user access logs, calculating the url set after duplicate removal
Device;For calculating the device of Web page importance score for each url;It is all for being scored at according to Web page importance
Url after duplicate removal is ranked up, the device that some url gather as url to be crawled before taking.
Device 506 for being pre-processed to webpage html source codes can include:Web analysis device;Text extracting fills
Put;Participle/stop words filtration apparatus.
Draw for extracting feature in web page text according to prior probability table and associate pass between webpage and specified classification
It is that feature vector represents that the device 508 of F2 can include:For calculating the dress of importance score of the sort key word in webpage
Put;The dress of sliding-model control is carried out for the prior probability of the importance score for sort key word and sort key word
Put;Two dimensional character is obtained for the discretization importance according to sort key word and the discretization prior probability of sort key word
The device of matrix;For calculating two dimensional character matrix in the different piece of web page text respectively, by all matrix elements successively
Arrangement obtains the device of the feature vector of a regular length.
For representing that F2 obtains the dress of user view classification prediction result according to classification results and incidence relation feature vector
Putting 510 can include:For representing F2 according to classification results and incidence relation feature vector, the analysis result of webpage rank is obtained
Device;For the analysis result of webpage rank and user access logs to be attached operation, obtain each webpage of user and visit
Ask the intention indicated by behavior, carrying out converging operation in user's dimension obtains the device of user view classification prediction result.
Pre- flow gauge in the application innovates the problem for solving the Deta sparseness referred in background technology by following:
(1)Browse/search behavior of the user on the whole network is collected, rather than just the behavior in a home Web site, this will
The sparse sex chromosome mosaicism of data is largely solved, in the example of an application, it is clear that 5,000,000,000 users can be collected into daily
Look at log information.
(2)The user for possessing commercial value is extracted from user's view log and browses webpage, and is directed to web page contents
Analyse in depth.
(3)Embedded interface is provided, the customized demand of multiple specific websites can be embedded in pre- flow gauge.
(4)Using same a user data, for commercial intention point of its customization of any given specific website adaptive analysis
Analyse result.
Fig. 6 shows the Organization Chart of the exemplary system of effective identification user's commercial intention of the application.As can be seen that from
From the point of view of system layer, system is divided into three levels:Data gathering layer, data prediction layer, data analysis layer.From data flow
From the point of view of, system can be divided into two flows:Training flow, pre- flow gauge.
In conclusion present application addresses the Three Difficult Issues enumerated in background technology.It is embodied in:
(1)The application collects and analyzes all on the whole network of user and browses, search record, and is built based on analysis result
Found complete files on each of customers, the fine sparse sex chromosome mosaicism that must solve data.
(2)The application provides unified data-interface to any potential specific website, passes through unified interface and processing
Flow, can well adapt to the very strong situation of bibliography system isomerism, can be fast on the premise of training data scale is comparatively small
Speed establishes accurate grader for different bibliography systems.
(3)The application can be suitably used for very huge bibliography system, not require growth of the training data with classification quantity
And linear increase, more classifications and the situation without classification can be handled well, and ensure the accuracy of classification prediction.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program
Product.Therefore, the application can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the application can use the computer for wherein including computer usable program code in one or more
Usable storage medium(Including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)The computer program production of upper implementation
The form of product.
The foregoing is merely embodiments herein, is not limited to the application, for those skilled in the art
For member, the application can have various modifications and variations.All any modifications within spirit herein and principle, made,
Equivalent substitution, improvement etc., should be included within the scope of claims hereof.
Claims (14)
- A kind of 1. webpage training method, for classifying to webpage, it is characterised in that including:The classification prior probability table of sort key word is obtained according to the data with existing associated with sort key word, it is described with classifying The data with existing that keyword is associated includes website bibliography system;Trained webpage is treated to be pre-processed to obtain web page text to be trained;Feature is extracted in the web page text to be trained according to classification prior probability table and draws webpage to be trained and specified classification Between incidence relation feature vector represent F1;Represent that the incidence relation feature vector F1 carries out model training to draw the classification results of the webpage to be trained.
- 2. according to the method described in claim 1, it is characterized in that, the data with existing associated with sort key word includes User's click logs.
- 3. according to the method described in claim 2, it is characterized in that, obtained according to the data with existing associated with sort key word The step of classification prior probability table of sort key word, includes:Search key is obtained from user's click logs, and sort key word is obtained by filtration;Calculate the frequency that each sort key word occurs in different classifications;Each non-leaf nodes classification in the website bibliography system is traveled through using bottom-up method, calculates each The information gain of sort key word and Chi-square Test.
- 4. according to the method described in claim 1, it is characterized in that, treat trained webpage carry out pretreatment include it is following in one A or multiple processing:Web analysis;Text extracting;Participle/stop words filtering.
- 5. according to the method described in claim 1, it is characterized in that, according to classification prior probability table in web page text to be trained Extraction feature show that the step of incidence relation feature vector between webpage to be trained and specified classification represents F1 includes:Calculate importance score of the sort key word in webpage to be trained;The classification prior probability of importance score and sort key word for sort key word carries out sliding-model control;Two-dimentional spy is obtained according to the discretization classification prior probability of the discretization importance of sort key word and sort key word Levy matrix;The two dimensional character matrix is calculated in the different piece of web page text to be trained respectively, all matrix elements are arranged successively Row obtain the feature vector of a regular length.
- 6. according to the method described in claim 5, it is characterized in that, the different piece of the web page text to be trained includes webpage Title, Web page text, hyperlinks between Web pages text.
- 7. according to the method described in claim 1, it is characterized in that, F1, which carries out model training, to be represented to incidence relation feature vector Included with drawing the step of the classification results of webpage to be trained:Regression model training is carried out to labeled data according to pre-set regression model algorithm;AndCorresponding ROC curve is calculated for the regression model trained, and different cutting threshold values is set based on ROC curve To obtain classification results.
- A kind of 8. webpage training system, for classifying to webpage, it is characterised in that including:Device for the classification prior probability table that sort key word is obtained according to the data with existing associated with sort key word, The data with existing associated with sort key word includes website bibliography system;Pre-processed for treating trained webpage to obtain the device of web page text to be trained;Draw webpage to be trained with specifying for extracting feature in the web page text to be trained according to classification prior probability table Incidence relation feature vector between classification represents the device of F1;For drawing the classification knot of the webpage to be trained to incidence relation feature vector expression F1 progress model trainings The device of fruit.
- A kind of 9. webpage Forecasting Methodology, for utilizing the webpage training method training according to any one of claim 1-7 The classification results drawn, classification prediction is carried out for user view, it is characterised in that including:Collect user access logs;Webpage html source codes are obtained from the user access logs;The webpage html source codes are pre-processed to obtain web page text;Feature is extracted in the web page text according to classification prior probability table and draws associating between webpage and specified classification It is that feature vector represents F2;Represent that F2 obtains user view classification prediction result according to the classification results and the incidence relation feature vector.
- 10. according to the method described in claim 9, it is characterized in that, obtain webpage html source codes from the user access logs The step of include:Url screenings are carried out to the user access logs;The url after screening is crawled to obtain webpage html source codes.
- 11. according to the method described in claim 10, it is characterized in that, the user access logs are carried out with the step of url screenings Suddenly include:For the user access logs, the url set after duplicate removal is calculated;Web page importance score is calculated for each url;It is scored at the url after all duplicate removals according to Web page importance to be ranked up, some url are as to be crawled before taking Url gathers.
- 12. according to the method described in claim 9, it is characterized in that, extracted according to classification prior probability table in web page text Feature show that the step of incidence relation feature vector between webpage and specified classification represents F2 includes:Calculate importance score of the sort key word in webpage;The classification prior probability of importance score and sort key word for sort key word carries out sliding-model control;Two-dimentional spy is obtained according to the discretization classification prior probability of the discretization importance of sort key word and sort key word Levy matrix;The two dimensional character matrix is calculated in the different piece of web page text respectively, all matrix elements are arranged in order to obtain The feature vector of one regular length.
- 13. according to the method described in claim 9, it is characterized in that, according to the classification results and the incidence relation feature The step of vector representation F2 acquisition user view classification prediction results, includes:F2 is represented according to the classification results and the incidence relation feature vector, obtains the analysis result of webpage rank;The analysis result of webpage rank and the user access logs are attached operation, obtain each web page access row of user For indicated intention, carry out converging operation in user's dimension and obtain user view classification prediction result.
- A kind of 14. webpage forecasting system, for the classification drawn using webpage training system according to claim 8 training As a result, carry out classification prediction for user view, it is characterised in that including:For collecting the device of user access logs;For obtaining the device of webpage html source codes from the user access logs;For being pre-processed the webpage html source codes to obtain the device of web page text;Pass between webpage and specified classification is drawn for extracting feature in the web page text according to classification prior probability table Join the device of relationship characteristic vector representation F2;For representing that F2 obtains user view classification prediction result according to the classification results and the incidence relation feature vector Device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310003765.6A CN103914478B (en) | 2013-01-06 | 2013-01-06 | Webpage training method and system, webpage Forecasting Methodology and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310003765.6A CN103914478B (en) | 2013-01-06 | 2013-01-06 | Webpage training method and system, webpage Forecasting Methodology and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103914478A CN103914478A (en) | 2014-07-09 |
CN103914478B true CN103914478B (en) | 2018-05-08 |
Family
ID=51040170
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310003765.6A Expired - Fee Related CN103914478B (en) | 2013-01-06 | 2013-01-06 | Webpage training method and system, webpage Forecasting Methodology and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103914478B (en) |
Families Citing this family (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104217016B (en) * | 2014-09-22 | 2018-02-02 | 北京国双科技有限公司 | Webpage search keyword statistical method and device |
CN104199969B (en) * | 2014-09-22 | 2017-10-03 | 北京国双科技有限公司 | Web data analysis method and device |
CN105262812A (en) * | 2015-10-16 | 2016-01-20 | 浪潮(北京)电子信息产业有限公司 | Log data processing method based on cloud computing platform, log data processing device and log data processing system |
CN106951422B (en) * | 2016-01-07 | 2021-05-28 | 腾讯科技(深圳)有限公司 | Webpage training method and device, and search intention identification method and device |
CN110442417A (en) * | 2016-01-08 | 2019-11-12 | 第四范式(北京)技术有限公司 | Feature Extraction Method, machine learning method and its device |
CN107180035A (en) * | 2016-03-09 | 2017-09-19 | 阿里巴巴集团控股有限公司 | A kind of training pattern information output method and device |
CN107291723B (en) * | 2016-03-30 | 2021-04-30 | 阿里巴巴集团控股有限公司 | Method and device for classifying webpage texts and method and device for identifying webpage texts |
CN107292412A (en) * | 2016-03-31 | 2017-10-24 | 阿里巴巴集团控股有限公司 | A kind of problem Forecasting Methodology and forecasting system |
CN107463573B (en) * | 2016-06-02 | 2020-10-13 | 阿里巴巴(中国)有限公司 | Content information providing method, device, browser, electronic device and server |
CN106600356B (en) * | 2016-10-27 | 2019-12-24 | 杭州王道科技有限公司 | Multi-platform e-commerce information aggregation method and system |
CN107006389B (en) * | 2016-11-04 | 2021-06-22 | 深圳市沃特沃德股份有限公司 | Terminal and pet action signal identification method and device |
CN108874823A (en) * | 2017-05-12 | 2018-11-23 | 阿里巴巴集团控股有限公司 | The implementation method and device of intelligent customer service |
CN108038124B (en) * | 2017-11-06 | 2020-08-28 | 广东广业开元科技有限公司 | PDF document acquisition and processing method, system and device based on big data |
CN108763313A (en) * | 2018-04-26 | 2018-11-06 | 平安科技(深圳)有限公司 | On-line training method, server and the storage medium of model |
CN110955772B (en) * | 2018-09-26 | 2023-06-06 | 阿里巴巴集团控股有限公司 | Text structured model component deployment method, device, equipment and storage medium |
CN109508394A (en) * | 2018-10-18 | 2019-03-22 | 青岛聚看云科技有限公司 | A kind of training method and device of multi-medium file search order models |
RU2743898C1 (en) | 2018-11-16 | 2021-03-01 | Общество С Ограниченной Ответственностью "Яндекс" | Method for performing tasks |
CN109597890B (en) * | 2018-11-23 | 2022-09-09 | 福建榕基软件股份有限公司 | Data perception automatic distribution method and storage device |
CN111324831A (en) * | 2018-12-17 | 2020-06-23 | 中国移动通信集团北京有限公司 | Method and device for detecting fraudulent website |
RU2744032C2 (en) * | 2019-04-15 | 2021-03-02 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system for determining result of task execution in crowdsourced environment |
RU2744038C2 (en) | 2019-05-27 | 2021-03-02 | Общество С Ограниченной Ответственностью «Яндекс» | Method and a system for determining the result of a task in the crowdsourcing environment |
RU2019128272A (en) | 2019-09-09 | 2021-03-09 | Общество С Ограниченной Ответственностью «Яндекс» | Method and System for Determining User Performance in a Computer Crowdsourced Environment |
RU2019135532A (en) | 2019-11-05 | 2021-05-05 | Общество С Ограниченной Ответственностью «Яндекс» | Method and system for selecting a label from a plurality of labels for a task in a crowdsourced environment |
CN111242318A (en) * | 2020-01-13 | 2020-06-05 | 拉扎斯网络科技(上海)有限公司 | Business model training method and device based on heterogeneous feature library |
RU2020107002A (en) | 2020-02-14 | 2021-08-16 | Общество С Ограниченной Ответственностью «Яндекс» | METHOD AND SYSTEM FOR RECEIVING A LABEL FOR A DIGITAL PROBLEM PERFORMED IN A CROWDSORING ENVIRONMENT |
CN111611508B (en) * | 2020-05-28 | 2020-12-15 | 江苏易安联网络技术有限公司 | Identification method and device for actual website access of user |
TWI819247B (en) * | 2020-10-26 | 2023-10-21 | 國立臺灣大學 | Prediction method based on unstructured data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101814083A (en) * | 2010-01-08 | 2010-08-25 | 上海复歌信息科技有限公司 | Automatic webpage classification method and system |
CN102207961A (en) * | 2011-05-25 | 2011-10-05 | 盛乐信息技术(上海)有限公司 | Automatic web page classification method and device |
CN102402566A (en) * | 2011-08-09 | 2012-04-04 | 江苏欣网视讯科技有限公司 | Web user behavior analysis method based on Chinese webpage automatic classification technology |
CN102521248A (en) * | 2011-11-14 | 2012-06-27 | 北京亿赞普网络技术有限公司 | Network user classification method and device |
-
2013
- 2013-01-06 CN CN201310003765.6A patent/CN103914478B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101814083A (en) * | 2010-01-08 | 2010-08-25 | 上海复歌信息科技有限公司 | Automatic webpage classification method and system |
CN102207961A (en) * | 2011-05-25 | 2011-10-05 | 盛乐信息技术(上海)有限公司 | Automatic web page classification method and device |
CN102402566A (en) * | 2011-08-09 | 2012-04-04 | 江苏欣网视讯科技有限公司 | Web user behavior analysis method based on Chinese webpage automatic classification technology |
CN102521248A (en) * | 2011-11-14 | 2012-06-27 | 北京亿赞普网络技术有限公司 | Network user classification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN103914478A (en) | 2014-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103914478B (en) | Webpage training method and system, webpage Forecasting Methodology and system | |
US9317613B2 (en) | Large scale entity-specific resource classification | |
CN103226578B (en) | Towards the website identification of medical domain and the method for webpage disaggregated classification | |
CN106599022B (en) | User portrait forming method based on user access data | |
CN107862022B (en) | Culture resource recommendation system | |
KR102174717B1 (en) | Product recommendation system and method based on user purchase criterion and product review | |
CN104199833B (en) | The clustering method and clustering apparatus of a kind of network search words | |
CN109145216A (en) | Network public-opinion monitoring method, device and storage medium | |
CN106339502A (en) | Modeling recommendation method based on user behavior data fragmentation cluster | |
TWI695277B (en) | Automatic website data collection method | |
CN102831199A (en) | Method and device for establishing interest model | |
CN102855282B (en) | A kind of document recommendation method and device | |
CN103874994A (en) | Method and apparatus for automatically summarizing the contents of electronic documents | |
CN107506472B (en) | Method for classifying browsed webpages of students | |
CN103823824A (en) | Method and system for automatically constructing text classification corpus by aid of internet | |
CN103810162A (en) | Method and system for recommending network information | |
CN108874996A (en) | website classification method and device | |
KR20210047229A (en) | Recommendation System and METHOD Reflecting Purchase Criteria and Product Reviews Sentiment Analysis | |
CN107895303A (en) | A kind of method of the personalized recommendation based on OCEAN models | |
WO2016040304A1 (en) | A method for detection and characterization of technical emergence and associated methods | |
CN108446333B (en) | Big data text mining processing system and method thereof | |
Lee et al. | Web document classification using topic modeling based document ranking | |
CN106886577A (en) | A kind of various dimensions web page browsing behavior evaluation method | |
Shete et al. | Survey Paper on Web Content Extraction & Classification | |
Ramanathan et al. | Creating user profiles using wikipedia |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20180508 |