CN103020067B - A kind of method and apparatus determining type of webpage - Google Patents

A kind of method and apparatus determining type of webpage Download PDF

Info

Publication number
CN103020067B
CN103020067B CN201110282850.1A CN201110282850A CN103020067B CN 103020067 B CN103020067 B CN 103020067B CN 201110282850 A CN201110282850 A CN 201110282850A CN 103020067 B CN103020067 B CN 103020067B
Authority
CN
China
Prior art keywords
identified
type
webpage
preset
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110282850.1A
Other languages
Chinese (zh)
Other versions
CN103020067A (en
Inventor
黄际洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110282850.1A priority Critical patent/CN103020067B/en
Publication of CN103020067A publication Critical patent/CN103020067A/en
Application granted granted Critical
Publication of CN103020067B publication Critical patent/CN103020067B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of method and apparatus determining type of webpage, wherein method includes: all query that in S1, acquisition search daily record, webpage to be identified is corresponding time clicked;S2, determining that each n-gram word group (n gram) of query acquired in step S1 constitutes the characteristic vector of described webpage to be identified, n is default one or more positive integers;S3, dependency between characteristic vector based on described webpage to be identified and the characteristic vector of each preset kind, determine the type of described webpage to be identified.The present invention has efficiency and the speed that raising type of webpage determines, anti-cheating ability is strong, the advantages such as applicable surface is wider.

Description

Method and device for determining webpage type
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of computers, in particular to a method and a device for determining webpage types.
[ background of the invention ]
With the rapid development of network technology and the continuous enrichment of network information, users have become accustomed to obtaining information of interest from the network through search engines. In search engine technology, whether demand analysis, search result ranking, or personalized search, operations may involve determining the type of web page. For example, in the requirement analysis, the search requirement of a query can be determined by analyzing the type of a clicked webpage corresponding to the query in a search log; in search result sorting, determining the sorting of the web pages in the search results according to the consistency between the web page type and the query search requirement; in the personalized search, the search habit or the search interest of the user is determined by analyzing the types of the webpages clicked and browsed by the user in the search log, so that a personalized search result which accords with the search habit or the search interest of the user is provided for the user.
The existing mode for determining the webpage type mainly extracts the text feature vector of a webpage text, and classifies each webpage to determine the webpage type by using a classifier, wherein when the text feature vector of the webpage text is extracted, webpage content needs to be downloaded, text analysis is performed on the webpage content, and core words and weight thereof are extracted to form the text feature vector. This approach has the following drawbacks:
defect one: the downloading and analysis of the web page content are required, and the efficiency and the speed are low for mass data.
And defect two: in order to improve the ranking of a plurality of websites in a search engine, a large number of category keywords can be artificially added into the webpages, and the accuracy of determining the types of the webpages is greatly influenced by the cheating means.
And a third defect: there are a large number of web pages of different forms in the network, and the wide variety of web page forms brings difficulty in analyzing web page contents.
[ summary of the invention ]
In view of the above, the present invention provides a method and an apparatus for determining a type of a web page, so as to solve the above-mentioned drawbacks in the prior art.
The specific technical scheme is as follows:
a method of determining a type of a web page, the method comprising:
s1, acquiring all queries corresponding to the clicked webpages to be identified in the search logs;
s2, determining that each n-gram of the query obtained in the step S1 forms a feature vector of the webpage to be identified, wherein n is one or more preset positive integers;
s3, determining the type of the webpage to be identified based on the correlation between the characteristic vector of the webpage to be identified and the characteristic vectors of the preset types.
According to a preferred embodiment of the present invention, the step S1 further includes: acquiring the title of the webpage to be identified;
the step S2 further includes: and determining each n-gram of the title of the webpage to be identified, and forming the feature vector of the webpage to be identified by using each n-gram of the title of the webpage to be identified and each n-gram of the query obtained in the step S1.
According to a preferred embodiment of the present invention, the feature vectors of the preset types are formed in advance based on n-grams of the corpus of each preset type.
According to a preferred embodiment of the present invention, the method for obtaining the corpus of the preset type includes:
a1, acquiring the seed query of the preset type;
a2, acquiring clicked web pages corresponding to the seed query in the search log, and reserving web pages with the clicked times larger than a set clicked time threshold;
a3, determining all queries corresponding to the clicked web pages retained in the search log in the step A2, and recording the clicked times of the web pages corresponding to the queries to obtain the training corpus of the preset type; or, determining all queries and web page titles corresponding to the clicked web pages reserved in the search log in the step a2, and recording the clicked times and the occurrence times of the web page titles corresponding to the queries to obtain the corpus of the preset type.
According to a preferred embodiment of the present invention, the step S3 specifically includes:
calculating the overlapping rate between the feature vector of the webpage to be identified and the feature vector of each preset type, and determining the type of the webpage to be identified according to the calculated overlapping rate; or,
calculating the similarity between the feature vector of the webpage to be identified and the feature vector of each preset type, and determining the type of the webpage to be identified according to the calculated similarity; or,
training a classifier by taking the feature vector of each preset type as a feature, taking the feature vector of the webpage to be recognized as the input of the classifier, and determining the type of the webpage to be recognized according to the classification result of the classifier.
According to a preferred embodiment of the present invention, calculating the overlapping ratio between the feature vector of the to-be-identified web page and the feature vector of the preset type includes:
calculating the value obtained by multiplying the occurrence times of the overlapped n-grams between the feature vector of the webpage to be identified and the feature vector of the preset type in the feature vector of the webpage to be identified by the weight sum of the overlapped n-grams in the feature vector of the preset type and then dividing the value by the sum of the occurrence times of all the n-grams in the feature vector of the webpage to be identified;
wherein the weight of the n-gram in the feature vector of the preset type is as follows: the ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams in the corpus of the preset type.
According to a preferred embodiment of the present invention, calculating the similarity between the feature vector of the to-be-identified web page and each of the feature vectors of the preset types includes:
calculating cosine similarity between the feature vector of the webpage to be identified and feature vectors of all preset types;
wherein the weight of each n-gram in the feature vector of the preset type is as follows: the word frequency tf of each n-gram and the reverse document frequency idf; the weight of each n-gram in the feature vector of the webpage to be identified is as follows: tf index idf of each n-gram.
According to a preferred embodiment of the present invention, when training a classifier by using feature vectors of each preset type as features, the weight of n-gram in the feature vectors of the preset type is: the ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams, or tf index idf of each n-gram.
According to a preferred embodiment of the present invention, the classifier is: a maximum entropy classifier or a Support Vector Machine (SVM) classifier.
According to a preferred embodiment of the present invention, the determining the type of the web page to be identified according to the calculated overlapping rate includes: determining a preset type with an overlap rate larger than a set overlap rate threshold value as the type of the webpage to be identified; or determining N1 preset types with the top overlapping rates as the types of the web pages to be identified, wherein N1 is a preset positive integer; or determining the grade of the webpage to be identified on each type according to the corresponding relation between a preset overlapping rate value and the type grade;
the determining the type of the webpage to be identified according to the calculated similarity comprises the following steps: determining a preset type with the similarity larger than a set similarity threshold as the type of the webpage to be identified; or determining N2 preset types with the top similarity as the types of the web pages to be identified, wherein N2 is a preset positive integer; or determining the grade of the webpage to be identified on each type according to the corresponding relation between the preset similarity value and the type grade.
An apparatus for determining a type of a web page, the apparatus comprising:
the query acquisition unit is used for acquiring all queries corresponding to the clicked webpages to be identified in the search logs;
the first vector determining unit is used for determining that each n-gram of the query acquired by the query acquiring unit forms a feature vector of the webpage to be identified, wherein n is a preset positive integer or a plurality of positive integers;
and the type determining unit is used for determining the type of the webpage to be identified based on the correlation between the feature vector of the webpage to be identified and the feature vectors of the preset types.
According to a preferred embodiment of the present invention, the apparatus further comprises: the title acquisition unit is used for acquiring the title of the webpage to be identified;
the first vector determining unit is further configured to determine each n-gram of the title of the webpage to be identified, and form the feature vector of the webpage to be identified by using each n-gram of the title of the webpage to be identified and each n-gram of the query acquired by the query acquiring unit together.
According to a preferred embodiment of the present invention, the apparatus further comprises: and the second vector determining unit is used for forming the feature vectors of the preset types in advance based on the n-grams of the training corpora of the preset types.
According to a preferred embodiment of the present invention, the apparatus further comprises: the corpus acquiring unit is used for acquiring the seed query of the preset type; acquiring clicked web pages corresponding to the seed query in the search log, and reserving web pages with the clicked times larger than a set clicked time threshold; determining all the queries corresponding to the clicked retained web pages, and recording the clicked times of the web pages corresponding to the queries to obtain the training corpus of the preset type, or determining all the queries and the web page titles corresponding to the clicked retained web pages, and recording the clicked times of the web pages and the occurrence times of the web page titles corresponding to the queries to obtain the training corpus of the preset type.
According to a preferred embodiment of the present invention, the type determining unit calculates an overlap ratio between the feature vector of the web page to be identified and each of the feature vectors of the preset types, and determines the type of the web page to be identified according to the calculated overlap ratio; or,
calculating the similarity between the feature vector of the webpage to be identified and the feature vector of each preset type, and determining the type of the webpage to be identified according to the calculated similarity; or,
training a classifier by taking the feature vector of each preset type as a feature, taking the feature vector of the webpage to be recognized as the input of the classifier, and determining the type of the webpage to be recognized according to the classification result of the classifier.
According to a preferred embodiment of the present invention, when calculating the overlap ratio between the feature vector of the to-be-identified web page and each preset type of feature vector, the type determining unit specifically calculates a value obtained by multiplying the occurrence number of n-grams overlapped between the feature vector of the to-be-identified web page and the preset type of feature vector in the to-be-identified web page by the sum of the weights of the overlapped n-grams in the preset type of feature vector, and then dividing the value by the sum of the occurrence number of all n-grams in the feature vector of the to-be-identified web page;
wherein the weight of the n-gram in the feature vector of the preset type is as follows: the ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams in the corpus of the preset type.
According to a preferred embodiment of the present invention, when the type determining unit calculates the similarity between the feature vector of the to-be-identified web page and each of the feature vectors of the preset types, specifically calculates the cosine similarity between the feature vector of the to-be-identified web page and each of the feature vectors of the preset types;
wherein the weight of each n-gram in the feature vector of the preset type is as follows: the word frequency tf of each n-gram and the reverse document frequency idf; the weight of each n-gram in the feature vector of the webpage to be identified is as follows: tf index idf of each n-gram.
According to a preferred embodiment of the present invention, when the type determining unit trains the classifier using the feature vectors of the preset types as features, the weight of the n-gram in the feature vectors of the preset types is: the ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams, or tf index idf of each n-gram.
According to a preferred embodiment of the present invention, the classifier is: a maximum entropy classifier or a Support Vector Machine (SVM) classifier.
According to a preferred embodiment of the present invention, when determining the type of the web page to be identified according to the calculated overlap ratio, the type determining unit determines a preset type, in which the overlap ratio is greater than a set overlap ratio threshold, as the type of the web page to be identified; or determining N1 preset types with the top overlapping rates as the types of the web pages to be identified, wherein N1 is a preset positive integer; or determining the grade of the webpage to be identified on each type according to the corresponding relation between a preset overlapping rate value and the type grade;
the type determining unit determines a preset type with the similarity larger than a set similarity threshold as the type of the webpage to be identified when determining the type of the webpage to be identified according to the calculated similarity; or determining N2 preset types with the top similarity as the types of the web pages to be identified, wherein N2 is a preset positive integer; or determining the grade of the webpage to be identified on each type according to the corresponding relation between the preset similarity value and the type grade.
According to the technical scheme, the method and the device provided by the invention have the following advantages:
1) the characteristic vector comes from the search log, the webpage content does not need to be downloaded and analyzed, the efficiency and the speed are improved, the method is suitable for the requirement of massive webpages to be identified, and the effect is more obvious.
2) Because the characteristic vector is from the search log instead of the webpage content, the identification of the webpage type cannot be influenced by cheating means of artificially adding a large number of category keywords in the webpage, and the identification accuracy is improved.
3) The webpage category determining mode of the invention is irrelevant to the webpage content and the webpage form, so the invention has wider application range.
[ description of the drawings ]
FIG. 1 is a flow chart of a main method according to a first embodiment of the present invention;
fig. 2 is a flowchart of a method for obtaining corpus of a preset type according to a second embodiment of the present invention;
fig. 3 is a block diagram of an apparatus for determining a type of a web page according to a sixth embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
The first embodiment,
After the search behavior of the user is analyzed, it is found that after the user submits the query to search, the webpage clicked in the search result can usually reflect the requirement of the user, and conversely, the query corresponding to the clicked webpage can also reflect the type of the webpage. Based on this, the method provided by the invention is shown in fig. 1, and mainly comprises the following steps:
step 101: and acquiring all queries corresponding to the clicked webpages to be identified in the search logs.
In the embodiment of the invention, all the queries corresponding to the clicked web pages to be identified in the search log are collected, and the queries reflect the types of the web pages to be identified, so that the feature vectors of the web pages to be identified are determined through the queries.
In addition, a user usually clicks on a certain webpage after searching, which is largely influenced by a webpage title (title), so that the titles usually represent important information of the webpage. Here, the title of the web page to be identified can be further obtained to form a feature vector of the web page to be identified.
Step 102: and determining that each n-gram of the query obtained in the step 101 forms a feature vector of the webpage to be identified.
Here, the concept of n-gram is briefly introduced, and the n-gram is a combination of n words of the minimum granularity occurring in sequence, wherein n is one or more preset positive integers. For example, for a query "the general practice of simple family dishes", after performing word segmentation processing and removing stop words on the query, assuming that n is 1, 2, 3, and 4, the determined n-gram is as follows:
1-gram: simple, home-made dish, making method and complete;
2-gram: simple home vegetables and home vegetable preparation methods are complete;
3-gram: the simple home-made dish making method and the home-made dish making method are complete;
4-gram: the simple home-made dish has complete preparation method.
If the title of the webpage to be identified is obtained in step 101, n-gram of the title can be determined at the same time, and the n-gram of the query together form a feature vector of the webpage to be identified.
In addition, in the feature vector of the webpage to be identified, the occurrence times of each n-gram in the query and title acquired in step 101 are recorded simultaneously.
Step 103: and determining the type of the webpage to be identified based on the correlation between the feature vector of the webpage to be identified and the feature vectors of the preset types.
In this step, the feature vectors of each preset type are formed in advance based on the n-grams of the corpus of each preset type, and the preset types include but are not limited to: software, pictures, videos, maps, games, novels, music, etc.
Each preset type of training corpus comprises a plurality of query sets corresponding to types of webpages when being clicked in a search log, and can also further comprise titles of a plurality of corresponding types of webpages, and the clicked times of the webpages in the search log are recorded. The forming process of each preset type of corpus will be described in detail in example two.
And then determining n-grams of the corpus of the preset type, and determining the weight of each n-gram based on the occurrence frequency of each n-gram in the corpus to form a feature vector of each n-gram.
In this step, the correlation between the feature vector of the web page to be identified and the feature vector of each preset type may be determined in three ways:
first, the overlapping rate between the feature vector of the to-be-identified webpage and each preset type of feature vector is calculated, and the correlation between the two is represented by the overlapping rate, which is specifically referred to in the third embodiment.
And secondly, calculating the similarity between the feature vector of the webpage to be identified and the feature vector of each preset type, and characterizing the correlation between the feature vector of the webpage to be identified and the feature vector of each preset type through the similarity, specifically referring to the fourth embodiment.
Training a classifier by taking the feature vector of each preset type as a feature, and determining the correlation between the feature vector of the webpage to be identified and the feature vector of each preset type by using the classifier, specifically referring to the fifth embodiment.
The web pages are usually identified by urls, which are used in the following embodiments of the present invention. The following is a brief description of the process of obtaining the corpus of each preset type through the embodiment.
Example II,
Fig. 2 is a flowchart of a method for acquiring a predetermined type of corpus according to a second embodiment of the present invention, and as shown in fig. 2, the method for acquiring a certain type of corpus includes the following steps:
step 201: a seed query of this type is obtained.
The seed query can fully reflect the requirement of the type, and as the number of the seed queries does not need to be large, usually dozens of seeds, a manual configuration mode can be adopted.
Taking the recipe class as an example, the configured seed query may be: the method for making the home-made dish, the menu, the commonly used menu, the Sichuan menu and the like. For convenience of understanding and example, two seed queries, "family dish practice" and "family dish practice summary" are used herein as examples.
Step 202: and acquiring a clicked url corresponding to the seed query in the search log, and reserving the url with the clicked time larger than a set click time threshold.
For example, the url with the clicked time satisfying the clicked time threshold in the clicked url corresponding to the seed query "home dish making method" and "home dish making method is mostly" is shown in table 1:
TABLE 1
url Number of times of being clicked
http://www.meishij.net/chufang/diy/jiangchangcaipu/ 127
http://www.ukdyw.cn/jiachang/ 19
http://www.fancai.com/ 17
http://www.scccjm.com/Get/jiaoninyishou/ 12
Step 203: and acquiring all queries and titles corresponding to the urls acquired in the step 202 when the urls are clicked in the search logs, and recording the total clicked times and the occurrence times of the titles of all the urls corresponding to the queries to form the training corpus of the type.
Continuing with the above example, the obtained corpus includes all the queries corresponding to url clicked and the url's title, and of course, it may also include only all the queries corresponding to url clicked, and all the titles including url at the same time are taken as an example. Specifically, as shown in table 2.
TABLE 2
Three ways of determining the type of the webpage to be identified are described in detail below through a third embodiment, a fourth embodiment and a fifth embodiment, respectively.
Example III,
In the embodiment, the type of the webpage to be identified is determined by calculating the overlapping rate between the feature vector of the webpage to be identified and each preset type of feature vector.
In this case, the manner of obtaining the feature vector of each preset type from the corpus of each preset type is as follows: determining each n-gram of the corpus of the preset type, counting the occurrence times of each n-gram, and determining the weight of each n-gram based on the occurrence times of each n-gram, thereby obtaining the feature vector of each preset type. Wherein the weight of the n-gram may be a ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams.
In determining each n-gram of the corpus, in order to prevent ambiguity problems caused by excessively small granularity, n-grams with larger granularity or even the whole query, for example, 3-grams, 4-grams, etc., can be used.
Assuming a 4-gram and the entire query are used, the resulting feature vectors for the recipe classes can be as shown in Table 3.
TABLE 3
Assume that the url of the web page to be identified is "http:// key. yaolan. com/long/97534/", and assume that the feature vectors of the web page to be identified obtained according to steps 101 and 102 in the first embodiment are shown in table 4.
TABLE 4
n-gram Number of occurrences
Simple menu 68
Home vegetable making home vegetable menu 18
Large full cradle mother-infant knowledge base of simple homely menu 2
Simple homely menu 1
Home vegetable making method of home vegetable 1
Large full cradle for family menu 1
Menu big full cradle mother and baby 1
Mother-infant knowledge base of large full cradle 1
And then respectively calculating the overlapping rate between the feature vector of the webpage to be identified and each preset type of feature vector, wherein the overlapping rate is the value obtained by multiplying the occurrence times of all overlapped n-grams by the sum of the weights of the n-grams in the preset type of feature vector and then dividing the sum by the occurrence times of all n-grams in the feature vector of the webpage to be identified.
Taking the example shown in table 3 and table 4, the overlapping ratio between the feature vector of the web page to be identified and the feature vector of the menu category is: (68 × 0.0341+18 × 0.0012+1 × 0.0012)/(68+18+2+1+1+1+1+1) ═ 0.0252
And then, determining the type of the webpage to be identified according to the overlapping rate between the feature vector of the webpage to be identified and the feature vector of each preset type. Including but not limited to the following:
1) and determining the preset type with the overlapping rate larger than the set overlapping rate threshold value as the type of the webpage to be identified.
2) And determining the preset types with the first N1 overlap rates as the types of the web pages to be identified, wherein N1 is a preset positive integer.
3) And determining the grade of the webpage to be identified on each type according to the corresponding relation between the preset overlapping rate value and the type grade. For example, the type classes may be classified by the overlap ratio values as: and determining whether the webpage to be identified is high confidence, medium confidence or low confidence on each type according to the overlap rate value.
Example four,
In the embodiment, the type of the webpage to be identified is determined by calculating the similarity between the feature vector of the webpage to be identified and the feature vector of the preset type.
In this case, the manner of obtaining the feature vector of each preset type from the corpus of each preset type is as follows: determining each n-gram of the corpus of the preset type, and determining the weight of each n-gram based on tf (idf) of each n-gram (the weight of each n-gram can also be determined based on other modes such as the occurrence number of each n-gram, and the like, and the weights are not exhaustive here), so as to obtain the feature vector of each preset type.
In the feature vectors of the webpage to be identified, tf (t) idf of each n-gram is calculated as a weight value of the feature vector based on the occurrence frequency of each n-gram, similarity between the feature vector of the webpage to be identified and each preset type of feature vector is calculated in a cosine similarity mode, the similarity reflects semantic similarity of the two feature vectors, and the type of the webpage to be identified is determined according to the calculated similarity. Including but not limited to the following:
1) and determining the preset type with the similarity larger than the set similarity threshold as the type of the webpage to be identified.
2) And determining the preset types with the top N2 similarity degrees as the types of the web pages to be identified, wherein N2 is a preset positive integer.
3) And determining the grade of the webpage to be identified on each type according to the corresponding relation between the similarity value and the type grade. For example, the category classes may be classified by similarity values as: and determining whether the webpage to be identified is high confidence, medium confidence or low confidence on each type according to the similarity value.
Assuming that the preset type with the highest similarity is determined as the type of the query to be identified, table 5 shows several examples of the identified types of urls.
TABLE 5
Url of webpage to be identified Cosine similarity Categories
http://www.27txt.com/txt-xx/13/txt-47950.htm 0.532006 Novel
http://netatm.cn/html/kehuantxt/soft679.htm 0.551953 Novel
http://www.xiaoshuo8.cc/88/88183/ 0.371882 Novel
http://softbbs.pconline.com.cn/9076400.html 0.343795 Software
http://iask.sina.com.cn/b/5988818.html 0.228622 Software
http://download.pchome.net/php/dl.php?sid=5001 0.209444 Software
http://game.ce.cn/wy/jy/200809/16/t20080916_16817429.shtml 0.369045 Game machine
http://www.9u.com/game/longOL/2009/1230/24385.html 0.138906 Game machine
http://3dmgame.chnren.com/bbs/showtopic-820330.html 0.091144 Game machine
Example V,
In the embodiment, a classifier is trained by using the feature vectors of each preset type as features in advance, and the classifier is used for determining the type of the webpage to be identified.
In this case, the manner of obtaining the feature vector of each preset type from the corpus of each preset type is as follows: determining each n-gram of the training corpus of the preset type, and determining the weight of each n-gram based on the occurrence times or tf (idf) of each n-gram, so as to obtain the feature vector of each preset type. Wherein, when determining the weight of each n-gram based on the occurrence number of each n-gram, the ratio of the occurrence number of the n-gram to the total occurrence number of all n-grams can be used. When determining the weight of each n-gram based on tf index of each n-gram, the value of tf index of each n-gram can be directly used as the weight of each n-gram. Other weight determination means may also be used and are not exhaustive here.
Then, the feature vector of the web page to be recognized determined in steps 101 and 102 is used as an input of a classifier, and the type of the web page to be recognized can be determined by using a classifier such as a maximum entropy classifier, a Support Vector Machine (SVM) classifier, and the like. Since the maximum entropy classifier and the SVM classifier are existing mature technologies, they are not described in detail here.
Table 6 shows the results obtained after the classifier classifies the urls to be identified.
TABLE 6
Url to be identified Class determined by classifier
http://www.ttmeishi.com/CaiPu/869e1e21321a1856.htm Menu category
http://caipuwu.com.cn/zt/jiachangcai/09295B62009/ Menu category
http://www.socaipu.com/6/13721.html Menu category
http://www.97book.cc/ Novel class
http://www.wenxuewu.com/files/article/fulltext/3/3537.html Novel class
http://www.hongxiu.com/x/80888/ Novel class
http://play.zol.com.cn/detail/97783_1.html Game class
http://www.7k7k.com/tag/11 Game class
http://games.qq.com/zt/2009/zwjs/ Game class
The above is a detailed description of the method provided by the present invention, and the following is a detailed description of the apparatus provided by the present invention with reference to the sixth embodiment.
Example six,
Fig. 3 is a block diagram of an apparatus for determining a type of a web page according to a sixth embodiment of the present invention, as shown in fig. 3, the apparatus may include: a query acquisition unit 301, a first vector determination unit 302, and a type determination unit 303.
The query obtaining unit 301 obtains all the queries corresponding to the clicked webpages to be identified in the search logs.
The first vector determining unit 302 determines that each n-gram of the query obtained by the query obtaining unit 301 constitutes a feature vector of the webpage to be identified, where n is a preset positive integer or multiple positive integers.
The type determining unit 303 determines the type of the web page to be identified based on the correlation between the feature vector of the web page to be identified and each preset type of feature vector.
Since a user is influenced by the titles of web pages to a great extent when clicking a certain web page after searching, the titles also represent important information of the web pages, and therefore, the title of the web page to be identified can be further obtained for forming a feature vector of the web page to be identified. At this time, the apparatus further includes: a title obtaining unit 304, configured to obtain a title of the web page to be identified.
Correspondingly, the first vector determining unit 302 is further configured to determine each n-gram of the title of the to-be-identified web page, and form the feature vector of the to-be-identified web page together with each n-gram of the query acquired by the query acquiring unit.
The first vector determining unit 302 records the occurrence frequency of each n-gram in the query acquired by the query acquiring unit 301 and the title acquired by the title acquiring unit 304 in the feature vector of the webpage to be identified, and provides the occurrence frequency for the subsequent type determining unit 303 to use.
Still further, the apparatus may further comprise: a second vector determination unit 305, configured to form a feature vector of a preset type in advance based on the n-gram of the corpus of each preset type.
The preset types include, but are not limited to: software, pictures, videos, maps, games, novels, music, etc.
The function of obtaining the training corpus is realized by a corpus obtaining unit 306 in the device, and the corpus obtaining unit 306 is used for obtaining a seed query of a preset type; acquiring clicked web pages corresponding to the seed query in the search log, and reserving web pages with the click times larger than a set click time threshold; determining all the queries corresponding to the clicked retained web pages, recording the clicked times of the web pages corresponding to the queries to obtain the training corpus of the preset type, or determining all the queries and the web page titles corresponding to the clicked retained web pages, recording the clicked times of the web pages corresponding to the queries and the occurrence times of the web page titles to obtain the training corpus of the preset type.
The type determining unit 303 may adopt the following three ways when determining the type of the web page to be identified:
and the first mode is that the overlapping rate between the feature vector of the webpage to be identified and the feature vector of each preset type is calculated, and the type of the webpage to be identified is determined according to the calculated overlapping rate.
When calculating the overlapping rate between the feature vector of the to-be-identified webpage and each preset type of feature vector, the type determining unit 303 specifically calculates a value obtained by multiplying the occurrence frequency of the n-gram overlapped between the feature vector of the to-be-identified webpage and the feature vector of the preset type in the feature vector of the to-be-identified webpage by the weight sum of the overlapped n-gram in the feature vector of the preset type, and then dividing the value by the sum of the occurrence frequencies of all the n-grams in the feature vector of the to-be-identified webpage.
Wherein the weight of the n-gram in the feature vector of the preset type is as follows: the ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams in the corpus of the preset type. The weight of the n-gram in the preset type of feature vector may be determined by the second vector determination unit 305.
In this manner, in order to prevent ambiguity problems caused by too small a granularity, the second vector determination unit 305 may employ n-grams with larger granularity or even the entire query, for example, 3-grams, 4-grams, etc.
And secondly, calculating the similarity between the feature vector of the webpage to be identified and the feature vector of each preset type, and determining the type of the webpage to be identified according to the calculated similarity.
When calculating the similarity between the feature vector of the to-be-identified web page and the feature vector of each preset type, the type determining unit 303 specifically calculates the cosine similarity between the feature vector of the to-be-identified web page and the feature vector of each preset type, where the similarity represents the semantic similarity between the two feature vectors.
Wherein the weight of each n-gram in the feature vector of the preset type is as follows: tf index idf of each n-gram; the weight of each n-gram in the feature vector of the webpage to be identified is as follows: tf index idf of each n-gram. Each n-gram in the feature vectors of the preset type can be determined by the second vector determination unit 305, and the weight of each n-gram in the feature vectors of the web pages to be identified can be determined by the first vector determination unit 302.
And thirdly, training a classifier by taking the feature vector of each preset type as a feature in advance, taking the feature vector of the webpage to be recognized as the input of the classifier, and determining the type of the webpage to be recognized according to the classification result of the classifier.
When the type determining unit 303 trains a classifier by using the feature vectors of each preset type as features, the weight of the n-gram in the feature vectors of the preset type is: the ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams, or tf index idf of each n-gram. Each n-gram in the above-described preset type of feature vector may be determined by the second vector determination unit 305.
Wherein, the classifier can be: a maximum entropy classifier or a Support Vector Machine (SVM) classifier, etc.
Corresponding to the first mode, when determining the type of the web page to be identified according to the calculated overlap ratio, the type determining unit 303 determines a preset type, in which the overlap ratio is greater than a set overlap ratio threshold, as the type of the web page to be identified; or determining the preset types with the first overlap ratio of N1 as the types of the web pages to be identified, wherein N1 is a preset positive integer; or determining the grade of the webpage to be identified on each type according to the corresponding relation between the preset overlapping rate value and the type grade.
Corresponding to the second mode, when determining the type of the web page to be identified according to the calculated similarity, the type determining unit 303 determines a preset type, of which the similarity is greater than a set similarity threshold, as the type of the web page to be identified; or determining N2 preset types with the similarity ranked in the top as the types of the web pages to be identified, wherein N2 is a preset positive integer; or determining the grade of the webpage to be identified on each type according to the corresponding relation between the preset similarity value and the type grade.
After the method and the device are used for determining the type of the webpage, the method and the device can be used for, but are not limited to, the following applications:
1) and sorting the search results. When the search results are ranked, the right of ranking the web pages with the web page types consistent with the search requirements of the query input by the user can be improved, so that the web pages meeting the search requirements of the user are ranked at the front positions in the search results as much as possible.
2) And (5) analyzing the demand. And analyzing the search requirement of the query according to the type of the clicked url corresponding to the query in the search log, so that a search result which meets the search requirement of the user better can be returned in search sorting or vertical search.
3) And (5) personalized searching. The search habit or the search interest of the user is determined by analyzing the types of the webpages clicked and browsed by the user in the search log of the user, so that the personalized search result which accords with the search habit or the search interest of the user is provided for the user.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (18)

1. A method for determining a type of a web page, the method comprising:
s1, acquiring all queries corresponding to the clicked webpages to be identified in the search logs;
s2, determining that each n-gram of the query obtained in the step S1 forms a feature vector of the webpage to be identified, wherein n is one or more preset positive integers;
s3, determining the type of the webpage to be identified based on the correlation between the characteristic vector of the webpage to be identified and the characteristic vectors of all preset types;
the feature vectors of the preset types are formed in advance based on n-grams of training corpora of each preset type; the preset type of corpus obtaining mode comprises:
acquiring the seed query of the preset type; acquiring clicked web pages corresponding to the seed query in the search log, and reserving web pages with the clicked times larger than a set clicked time threshold; and determining all the queries corresponding to the reserved webpages when being clicked, and recording the clicked times of the webpages corresponding to the queries to obtain the training corpus of the preset type.
2. The method according to claim 1, wherein the step S1 further comprises: acquiring the title of the webpage to be identified;
the step S2 further includes: and determining each n-gram of the title of the webpage to be identified, and forming the feature vector of the webpage to be identified by using each n-gram of the title of the webpage to be identified and each n-gram of the query obtained in the step S1.
3. The method according to claim 1, wherein the method for obtaining the corpus of the preset type further comprises:
determining the webpage titles of the reserved webpages, recording the occurrence times of the webpage titles, and forming the corpus of the preset type by the recorded webpage clicked times corresponding to the queries and the occurrence times of the webpage titles.
4. The method according to any one of claims 1 to 3, wherein the step S3 specifically comprises:
calculating the overlapping rate between the feature vector of the webpage to be identified and the feature vector of each preset type, and determining the type of the webpage to be identified according to the calculated overlapping rate; or,
calculating the similarity between the feature vector of the webpage to be identified and the feature vector of each preset type, and determining the type of the webpage to be identified according to the calculated similarity; or,
training a classifier by taking the feature vector of each preset type as a feature, taking the feature vector of the webpage to be recognized as the input of the classifier, and determining the type of the webpage to be recognized according to the classification result of the classifier.
5. The method according to claim 4, wherein calculating the overlapping rate between the feature vector of the webpage to be identified and the feature vector of the preset type comprises:
calculating the value obtained by multiplying the occurrence times of the overlapped n-grams between the feature vector of the webpage to be identified and the feature vector of the preset type in the feature vector of the webpage to be identified by the weight sum of the overlapped n-grams in the feature vector of the preset type and then dividing the value by the sum of the occurrence times of all the n-grams in the feature vector of the webpage to be identified;
wherein the weight of the n-gram in the feature vector of the preset type is as follows: the ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams in the corpus of the preset type.
6. The method according to claim 4, wherein calculating the similarity between the feature vector of the webpage to be identified and each preset type of feature vector comprises:
calculating cosine similarity between the feature vector of the webpage to be identified and feature vectors of all preset types;
wherein the weight of each n-gram in the feature vector of the preset type is as follows: the word frequency tf of each n-gram and the reverse document frequency idf; the weight of each n-gram in the feature vector of the webpage to be identified is as follows: tf index idf of each n-gram.
7. The method according to claim 4, wherein when training out the classifier using the feature vectors of each preset type as features, the weight of the n-gram in the feature vectors of the preset type is: the ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams, or tf index idf of each n-gram.
8. The method of claim 4, wherein the classifier is: a maximum entropy classifier or a Support Vector Machine (SVM) classifier.
9. The method of claim 4, wherein determining the type of the web page to be identified according to the calculated overlap ratio comprises: determining a preset type with an overlap rate larger than a set overlap rate threshold value as the type of the webpage to be identified; or determining N1 preset types with the top overlapping rates as the types of the web pages to be identified, wherein N1 is a preset positive integer; or determining the grade of the webpage to be identified on each type according to the corresponding relation between a preset overlapping rate value and the type grade;
the determining the type of the webpage to be identified according to the calculated similarity comprises the following steps: determining a preset type with the similarity larger than a set similarity threshold as the type of the webpage to be identified; or determining N2 preset types with the top similarity as the types of the web pages to be identified, wherein N2 is a preset positive integer; or determining the grade of the webpage to be identified on each type according to the corresponding relation between the preset similarity value and the type grade.
10. An apparatus for determining a type of a web page, the apparatus comprising:
the query acquisition unit is used for acquiring all queries corresponding to the clicked webpages to be identified in the search logs;
the first vector determining unit is used for determining that each n-gram of the query acquired by the query acquiring unit forms a feature vector of the webpage to be identified, wherein n is a preset positive integer or a plurality of positive integers;
the type determining unit is used for determining the type of the webpage to be identified based on the correlation between the characteristic vector of the webpage to be identified and the characteristic vectors of all preset types;
the second vector determining unit is used for forming the feature vectors of the preset types in advance based on the n-grams of the training corpora of the preset types;
the corpus acquiring unit is used for acquiring the seed query of the preset type; acquiring clicked web pages corresponding to the seed query in the search log, and reserving web pages with the clicked times larger than a set clicked time threshold; and determining all the queries corresponding to the reserved webpages when being clicked, and recording the clicked times of the webpages corresponding to the queries to obtain the training corpus of the preset type.
11. The apparatus of claim 10, further comprising: the title acquisition unit is used for acquiring the title of the webpage to be identified;
the first vector determining unit is further configured to determine each n-gram of the title of the webpage to be identified, and form the feature vector of the webpage to be identified by using each n-gram of the title of the webpage to be identified and each n-gram of the query acquired by the query acquiring unit together.
12. The apparatus according to claim 10, wherein the corpus acquiring unit further records the number of occurrences of each web page title by determining the web page title of the reserved web page, and the recorded number of times that the web page corresponding to each query is clicked and the number of occurrences of each web page title together form the corpus of the preset type.
13. The apparatus according to any one of claims 10 to 12, wherein the type determining unit calculates an overlap ratio between the feature vector of the web page to be identified and each feature vector of a preset type, and determines the type of the web page to be identified according to the calculated overlap ratio; or,
calculating the similarity between the feature vector of the webpage to be identified and the feature vector of each preset type, and determining the type of the webpage to be identified according to the calculated similarity; or,
training a classifier by taking the feature vector of each preset type as a feature, taking the feature vector of the webpage to be recognized as the input of the classifier, and determining the type of the webpage to be recognized according to the classification result of the classifier.
14. The apparatus according to claim 13, wherein the type determining unit, when calculating the overlap ratio between the feature vector of the web page to be identified and each of the feature vectors of the preset types, specifically calculates a value obtained by multiplying the number of occurrences of the n-gram overlapped between the feature vector of the web page to be identified and the feature vector of the preset type in the feature vector of the web page to be identified by the sum of the weights of the overlapped n-gram in the feature vector of the preset type, and dividing the value by the sum of the number of occurrences of all n-grams in the feature vector of the web page to be identified;
wherein the weight of the n-gram in the feature vector of the preset type is as follows: the ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams in the corpus of the preset type.
15. The apparatus according to claim 13, wherein the type determining unit specifically calculates cosine similarity between the feature vector of the web page to be identified and each preset type of feature vector when calculating similarity between the feature vector of the web page to be identified and each preset type of feature vector;
wherein the weight of each n-gram in the feature vector of the preset type is as follows: the word frequency tf of each n-gram and the reverse document frequency idf; the weight of each n-gram in the feature vector of the webpage to be identified is as follows: tf index idf of each n-gram.
16. The apparatus according to claim 13, wherein the type determining unit, when training out the classifier with the feature vectors of each preset type as features, weights of n-grams in the feature vectors of the preset type are: the ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams, or tf index idf of each n-gram.
17. The apparatus of claim 13, wherein the classifier is: a maximum entropy classifier or a Support Vector Machine (SVM) classifier.
18. The apparatus according to claim 13, wherein the type determining unit determines a preset type having an overlap ratio greater than a set overlap ratio threshold as the type of the web page to be identified when determining the type of the web page to be identified according to the calculated overlap ratio; or determining N1 preset types with the top overlapping rates as the types of the web pages to be identified, wherein N1 is a preset positive integer; or determining the grade of the webpage to be identified on each type according to the corresponding relation between a preset overlapping rate value and the type grade;
the type determining unit determines a preset type with the similarity larger than a set similarity threshold as the type of the webpage to be identified when determining the type of the webpage to be identified according to the calculated similarity; or determining N2 preset types with the top similarity as the types of the web pages to be identified, wherein N2 is a preset positive integer; or determining the grade of the webpage to be identified on each type according to the corresponding relation between the preset similarity value and the type grade.
CN201110282850.1A 2011-09-21 2011-09-21 A kind of method and apparatus determining type of webpage Active CN103020067B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110282850.1A CN103020067B (en) 2011-09-21 2011-09-21 A kind of method and apparatus determining type of webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110282850.1A CN103020067B (en) 2011-09-21 2011-09-21 A kind of method and apparatus determining type of webpage

Publications (2)

Publication Number Publication Date
CN103020067A CN103020067A (en) 2013-04-03
CN103020067B true CN103020067B (en) 2016-07-13

Family

ID=47968683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110282850.1A Active CN103020067B (en) 2011-09-21 2011-09-21 A kind of method and apparatus determining type of webpage

Country Status (1)

Country Link
CN (1) CN103020067B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544210B (en) * 2013-09-02 2017-01-18 烟台中科网络技术研究所 System and method for identifying webpage types
CN106156116A (en) * 2015-04-07 2016-11-23 富士通株式会社 Information issuing method and system
CN105975573B (en) * 2016-05-04 2019-08-13 北京广利核系统工程有限公司 A kind of file classification method based on KNN
CN107038183B (en) * 2016-10-09 2021-01-29 北京百度网讯科技有限公司 Webpage labeling method and device
CN108345599B (en) * 2017-01-23 2021-12-14 阿里巴巴集团控股有限公司 Webpage type determination method and device and computer readable medium
CN107943940A (en) * 2017-11-23 2018-04-20 网易(杭州)网络有限公司 Data processing method, medium, system and electronic equipment
CN108334631A (en) * 2018-02-24 2018-07-27 武汉斗鱼网络科技有限公司 Method, corresponding medium and the equipment of synonym for excavating direct broadcasting room search term
CN110889050B (en) * 2018-09-07 2024-07-30 北京搜狗科技发展有限公司 Method and device for mining brand words
CN111241431A (en) * 2018-11-28 2020-06-05 顺丰科技有限公司 Webpage classification method and device
CN111259273A (en) * 2018-11-30 2020-06-09 顺丰科技有限公司 Webpage classification model construction method, classification method and device
CN110276001B (en) * 2019-06-20 2021-10-08 北京百度网讯科技有限公司 Checking page identification method and device, computing equipment and medium
CN112100530B (en) * 2020-08-03 2023-12-22 百度在线网络技术(北京)有限公司 Webpage classification method and device, electronic equipment and storage medium
CN112287272B (en) * 2020-10-27 2023-05-23 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6138129A (en) * 1997-12-16 2000-10-24 World One Telecom, Ltd. Method and apparatus for providing automated searching and linking of electronic documents
CN101038596A (en) * 2007-04-29 2007-09-19 北京搜狗科技发展有限公司 Method and system for classifying website
CN101178728A (en) * 2007-11-21 2008-05-14 北京搜狗科技发展有限公司 Web side navigation method and system
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN102081667A (en) * 2011-01-23 2011-06-01 浙江大学 Chinese text classification method based on Base64 coding

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6138129A (en) * 1997-12-16 2000-10-24 World One Telecom, Ltd. Method and apparatus for providing automated searching and linking of electronic documents
CN101038596A (en) * 2007-04-29 2007-09-19 北京搜狗科技发展有限公司 Method and system for classifying website
CN101178728A (en) * 2007-11-21 2008-05-14 北京搜狗科技发展有限公司 Web side navigation method and system
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN102081667A (en) * 2011-01-23 2011-06-01 浙江大学 Chinese text classification method based on Base64 coding

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于向量空间模型的多主题 Web 文本分类方法;周炎涛 等;《计算机应用研究》;20080115;第25卷(第1期);第142-144页 *

Also Published As

Publication number Publication date
CN103020067A (en) 2013-04-03

Similar Documents

Publication Publication Date Title
CN103020067B (en) A kind of method and apparatus determining type of webpage
Begelman et al. Automated tag clustering: Improving search and exploration in the tag space
Sun et al. Web-page summarization using clickthrough data
US9405805B2 (en) Identification and ranking of news stories of interest
US7895196B2 (en) Computer system for identifying storylines that emerge from highly ranked web search results
US8145623B1 (en) Query ranking based on query clustering and categorization
CN101853272B (en) Search engine technology based on relevance feedback and clustering
Lu et al. A content-based method to enhance tag recommendation
US10152478B2 (en) Apparatus, system and method for string disambiguation and entity ranking
Kang et al. Modeling user interest in social media using news media and wikipedia
CN101968819B (en) Audio/video intelligent catalog information acquisition method facing to wide area network
US20070162448A1 (en) Adaptive hierarchy structure ranking algorithm
US20080082486A1 (en) Platform for user discovery experience
US9405803B2 (en) Ranking signals in mixed corpora environments
KR100896702B1 (en) Apparatus for providing Aspect-based Documents Clustering that raises Reliability and Method therefor
CN103294778A (en) Method and system for pushing messages
Kennedy et al. Query-adaptive fusion for multimodal search
CN109815401A (en) A kind of name disambiguation method applied to Web people search
CN104503988A (en) Searching method and device
Roy et al. Discovering and understanding word level user intent in web search queries
Kato et al. Query by analogical example: relational search using web search engine indices
US9779140B2 (en) Ranking signals for sparse corpora
CN108509449B (en) Information processing method and server
Balasubramanian et al. Topic pages: An alternative to the ten blue links
CN102999520B (en) A kind of method and apparatus of search need identification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant