CN101650715A - Method and device for screening links on web pages - Google Patents

Method and device for screening links on web pages Download PDF

Info

Publication number
CN101650715A
CN101650715A CN200810071574A CN200810071574A CN101650715A CN 101650715 A CN101650715 A CN 101650715A CN 200810071574 A CN200810071574 A CN 200810071574A CN 200810071574 A CN200810071574 A CN 200810071574A CN 101650715 A CN101650715 A CN 101650715A
Authority
CN
China
Prior art keywords
website
link
search mission
domain name
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200810071574A
Other languages
Chinese (zh)
Other versions
CN101650715B (en
Inventor
陈奋
腾达
吴鸿伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN 200810071574 priority Critical patent/CN101650715B/en
Publication of CN101650715A publication Critical patent/CN101650715A/en
Application granted granted Critical
Publication of CN101650715B publication Critical patent/CN101650715B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention provides a method for screening links on web pages, which comprises the following steps: extracting the links from the home page of a search task web site and part of the web pages; splitting the links into a domain name part and a relative path part; performing intelligent analysis on the relative paths of the links of which the same domain names are the same withthat of the search task web site; and according to the result of the intelligent analysis, extracting characteristic expressions of the links correlative with the search task web site, and/or filtering the characteristic expressions of the links not correlative with the search task web site. The embodiment of the invention also provides a device for screening the likes on the web pages. The embodiment of the invention achieves the improvement of the working efficiency of web crawlers by extracting the links correlative with the search task web site and filtering the links not correlative withthe search task web site.

Description

A kind of method and apparatus that screens links on web pages
Technical field
The present invention relates to the communications field, relate in particular to a kind of method and apparatus of searching for filtration.
Background technology
Along with the development of information network, a large amount of electronic data is by information network storage and transmission, and information network has become the main mode that information is obtained.Search engine is a kind of important tool of seeking the information that satisfies certain needs from the data message of magnanimity, obtains in the application in present information, is bringing into play and is coming important more effect approximately.Web crawlers is an important component part of search engine system, and it plays a part analyzing web page, obtains link and downloads webpage.
Existing in the web page interlinkage much and search for the useless link of order for certain, how to filter out these useless links, is the gordian technique that improves the web crawlers operational efficiency, and particularly in the vertical search field, it is most important that this technology especially seems.
In the prior art, web crawlers mainly contains two kinds to the mode of the filtration of useless link in the search: (1) adopts the artificial mode that filtering rule filters of setting; (2) adopt the information filtering mode.The mode that adopts artificial setting filtering rule to filter can reach the effect of filtration to a certain extent, but this mode workload is big, dumb, changes as the website, and the filtering rule setting also must be changed.The mode of employing information filtering must be with the download content analysis of webpage, and this has increased the weight of the workload of web crawlers and the workload that the backstage is analyzed to a certain extent, has reduced search efficiency.
Summary of the invention
Embodiment of the present invention discloses a kind of method and a kind of device that screens links on web pages that screens links on web pages.
The disclosed a kind of method of screening links on web pages of embodiment of the present invention comprises:
From the homepage of search mission website and the part correlation page, extract link;
Split described domain name part and the relative path part of being linked as;
Relative path to the domain name link identical with the domain name of search mission website carries out intellectual analysis;
According to the result of described intellectual analysis, the feature representation formula of extraction and described search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and described search mission website.
Embodiment of the present invention disclosed a kind of screen links on web pages device, comprising:
The link extraction module is used for extracting link from the homepage and the partial page of search mission website;
Link splits module, is used to split described domain name part and the relative path part of being linked as;
Intelligent analysis module is used for the relative path of the domain name link identical with the domain name of search mission website is carried out intellectual analysis;
Chain feature screening module, link to each other with described intelligent analysis module, be used for according to the intellectual analysis result of described intelligent analysis module the feature representation formula of extraction and search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and search mission website to described relative path.
Embodiment of the present invention is filtered and irrelevant the linking in search mission website by extracting link relevant with the search mission website, has realized the raising of web crawlers work efficiency.
Description of drawings
A kind of method synoptic diagram that screens links on web pages that Fig. 1 provides for one embodiment of the present invention;
The extraction of introducing web page characteristics that Fig. 2 provides for another embodiment of the present invention and with the synoptic diagram of the irrelevant filter method that links of search mission;
Fig. 3 is the disclosed a kind of structure drawing of device that screens links on web pages of another embodiment of the present invention;
Fig. 4 distinguishes the decision tree synoptic diagram for categories of websites.
Embodiment
In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention is done detailed description further below in conjunction with embodiment.Embodiment of the present invention is carried out feature extraction to web page interlinkage, and the link on the webpage is screened, and extracts the link of valuable information, the link of filtering useless, thus thereby reach work efficiency and the search accuracy that the purpose of filtering link improves web crawlers.
One embodiment of the present invention provides a kind of method of screening links on web pages, as shown in Figure 1, comprises step:
Step 101, from the homepage of search mission website and the part correlation page, extract link;
In this step, the extraction of link can realize by the link extraction algorithm.The part correlation page here refers to the pairing page of the link that comprises in the homepage.Can also sum up the rule of the various Types of website in advance, and download the homepage and the part correlation content of pages of described search mission website, according to homepage and part correlation content of pages, and the rule of the various Types of website, judge the Type of website that described search mission website is affiliated.The rule of the Type of website has reflected the expression-form of the characteristic and the feature of the layout that links in the webpage, valuable Info Link.Extract link according to the Type of website of search mission website then, improved the efficient of extracting link like this.
Step 102, described domain name part and the relative path part of being linked as of fractionation;
In this step, can also at first filter out the different link of domain name of domain name and described search mission website, these links can be considered directly incoherent with the search mission website.
Step 103, the relative path of the domain name link identical with the domain name of search mission website is carried out intellectual analysis;
In this step, intellectual analysis can use various statistics, clustering algorithm, and its effect is a plurality of link relative paths are classified or to divide into groups, and is convenient to later step the group that meets a screening rule is extracted or filters out.We will introduce concrete screening rule and process in detail in example afterwards.
Here our intelligent analysis method of being exemplified below:
At first, be delegation with the relative path of domain name each link identical with the domain name of search mission website, form a chain matrice;
Secondly, chain matrice is converted to numerical matrix, with the length of that the longest link line width, not enough benefit 0 as matrix;
Then, by intelligent algorithm, format numerical matrix is a fuzzy matrix, calculates each similarity between capable of fuzzy matrix.According to described similarity, with the row grouping of described fuzzy matrix.Because it is corresponding that fuzzy matrix and numerical matrix, chain matrice all are row, this has just realized indirectly that row with the row of numerical matrix and chain matrice is by different classes of grouping.
Step 104, according to the result of described intellectual analysis, extract the feature representation formula with described search mission website peer link, and/or filtration and the described search mission website feature representation formula that has nothing to do and link.
In this step, clustering result has realized matrix is divided into groups, can be according to the line number order of grouping, judge that whether the pairing link of grouping is and described search mission website peer link.
In the use of web crawlers, can only carry out the function of the feature representation formula of extraction and described search mission website peer link, the link that these feature representation formulas are corresponding useful, web crawlers utilizes these links to work on.
In order to make the disclosed technical scheme of present embodiment more clear, another embodiment of the present invention will be by way of example, introduces the extraction of web page characteristics and the filter method that has nothing to do and link with search mission in detail, as shown in Figure 2, mainly may further comprise the steps:
The rule of step 201, the various Types of website of summary.
Because the chain feature of dissimilar websites exists different expression-forms, when chain feature was extracted, dissimilar websites needed to adopt Different Strategies.When table 1 has shown the web site url feature extraction, the guidance strategy that the chain feature of dissimilar websites is extracted.
Table 1
The Type of website The number of features of valuable Info Link The expression-form of feature
The BBS forum website 2 to 3 (list page feature and model information characteristics) It generally is identical character string
Blog blog website 2 to 3 (list page feature and Blog content feature) Generally be identical character string or meet certain
The regular expression rule of type
News website More than 3 or 3 Generally be to meet certain type regular expression rule
Shown in the table 1, the chain feature expression formula can be continuous character style, or meets certain regular form, and each feature can be represented with the form of feature representation formula like this.The guidance strategy of the chain feature of the dissimilar websites of mentioning at table 1, table 2 have shown the chain feature rule example of dissimilar websites.
Table 2
Web site name The Type of website The feature of valuable Info Link The expression-form of feature
* nets forum The BBS forum website ??tableforum/App/index.php?bbsid??App/view.php?bbsid= Identical character string
* net blog Blog blog website ??[0-9]{1,10}.html The Numeral Rules
* News Network News website ?? Content[0-9]{1,10}.htm?? www.inetcop.com.cn (1) character and the numeral rule of (2) continuation character that combines
Therefore, we can at first judge the type of website before the web site url feature extraction, to obtain the guidance strategy of web site url feature extraction.
The various Type of website as BBS forum website, blog website, news portal website, enterprise web site and other types website, all has its specific composition rule, as contains some significant character, contains certain specific component content etc.In this step, we utilize as data excavation, machine learning scheduling algorithm, sum up the rule of all kinds website.In subsequent step, web crawlers will utilize the type under the Type of website rule judgment task website, thereby obtain the guidance strategy that chain feature is extracted.
This step detailed process is described below:
A) the some training materials of webpage of collection all kinds website as machine learning.
B) by analyzing a large amount of various types of webpages, extract the contents attribute feature of webpage, and represent attribute-bit as shown in table 3 and attributive character corresponding tables with attribute-bit.Like this, a kind of webpage of website just can have been represented with the combination of one or more attribute-bits.
Table 3
Attribute-bit Attributive character is described
??A 1 Whether URL contains the bbs character
??A 2 Whether URL contains the blog character
??A 3 Whether URL contains the date characteristic character
??A 4 Contain bbs or " community " character in the Meta label in the web page contents
??A 5 Contain blog or " blog " character in the Meta label in the web page contents
??A 6 The navigation bar message length
??A 7 The body matter text size
??A 8 Whether comprise temporal characteristics
??A 9 Whether contain and reply or comment
??A 10 Whether contain peer link
C) sorting algorithm of employing machine learning algorithm or data mining, as neural network algorithm, decision tree classification algorithm etc., the training material of learning to collect, thereby obtain explaining the guidance rule of certain Type of website, this rule can be represented by the contents attribute feature, specifically can make up and represent by the attribute representation who represents attributive character, for example:
If (A 6And A 7And A 10) then (this website belongs to the news website type);
This expression formula is represented: if a website possesses A simultaneously 6, A 7, A 10Three pairing attributive character of attribute representation, promptly according to shown in the table 3, this website has navigation bar message length, body matter text size simultaneously and contains three features such as peer link, and then this website belongs to the website of news type.
Here our list of rules of obtaining for example by decision tree, as shown in Figure 4.
C 1Expression news website type;
C 2The expression forum Type of website;
C 3The expression blog Type of website;
C 4The expression other types;
That extracts is regular as follows:
C 1Rule: if (A 6=1and A 7=1and A 10=1) then C 1
if(A 6=1and?A 3=1)then?C 1
C 2Rule: if (A 8=1 and A 1=1) then C 2
if(A 8=1and?A 4=1)then?C 2
C 3Rule: if (A 5=1) then C 3
if(A 2=1)then?C 3
C 4Rule: if (A 6=1and A 7=1andA 10=0) then C 4
if(A 6=1and?A 7=0and?A 3=0)then?C 4
if(A 6=0and?A 2=0and?A 5=0and?A 8=0)then?C 4
if(A 6=0and?A 2=0and?A 5=0and?A 8=1and?A 1=0?and?A 4=0)then?C 4
Here, 1 expression comprises this feature, and 0 expression does not comprise this feature.
Step 202, download the homepage and the partial page content of search mission website, utilize the judgment rule of the Type of website of step 1, judge the type under this search mission website.
In this step, at first the table of comparisons 3 extracts the attributive character of webpage, and the Type of website rule of utilizing step 1 to obtain is then judged the task Type of website.
Step 203, according to the type of search mission website, by the link extraction algorithm, from the homepage of search mission website and partial page, extract all-links.
In this step, the link extraction algorithm can adopt the method for regular expression or character feature coupling.The place that may occur linking in Webpage mainly contains following a few place:
1,<BASE href=URL 〉, represent basic URL, in HTML, routing information is often specified by URL, and relative URL decides according to basic URL;
2,<and Ahref=URL 〉, this is the most frequently used link form, other places that are used to be linked to another WEB page or are linked to the same page;
3,<and LINK href=URL 〉, be used to link CSS style sheet address or the javascript page address that the WEB page uses;
4,<and FRAME src=URL 〉, in frame page, be used to link framework page address pointed;
5,<IFRAME src=URL 〉, IFRAME is a kind of form of framework, what it was different with FRAME is that IFRAME can be embedded in the arbitrary portion in the webpage.
If adopt the character feature matching algorithm, go for the link of character string " href " and " src " back exactly;
Do if the employing regular expression, we construct regular expression:<.*? (src|href) s*=s*[" | ']? (?<uri〉[^ ' " s]+).We just can obtain all-links in the webpage by this regular expression.
Step 204, fractionation are linked as domain name part and relative path part.
Step 205, filter out the domain name link different with the domain name of search mission website.
Step 206, relative path is added up cluster, extract the feature representation formula that link relevant with the search mission website, filtering has nothing to do with the search mission website links.
Here, the irrelevant link in search mission website is generally with the useful very low junk information of the degree of correlation that links and links with it.
In this step, the relative path that links that domain name is identical with the domain name of search mission is partly formed a chain matrice, utilizes its corresponding codes, as ASCII coding, Unicode coding etc., is converted to numerical matrix.Here be encoded to example with ASCII, introduce the process of conversion.
On a webpage, contain a plurality of links in the webpage, the character type relative path of each link is formed a matrix, and is as shown in table 4:
Table 4
According to the corresponding decimal system numerical value of ASCII, be converted to numerical matrix as shown in table 5, with the length of that the longest link line width, not enough benefit 0 as matrix.Wherein, X 1To X 7Representing matrix first row is to the 7th row respectively.
Table 5
For this numerical matrix, with intelligent algorithm with X 1, X 2X 7Be generalized into several classifications, these classifications are exactly the web site url feature classification that we will extract.The intelligent algorithm of analyzing this matrix can be inductive algorithm, the algorithm in the artificial intelligence such as the genetic algorithm on the statistics, and the clustering algorithm in the data mining.
Here we adopt that the fuzzy clustering algorithm illustrates in the data mining, and key step is as follows:
A) this matrix is carried out " normalization ", be converted to fuzzy matrix, so that carry out cluster." normalization " method is a lot, as the method for weighting, value method etc.Here give an example " being worth method most " change, formula is as follows:
x ij = x ij max ( x i 1 . . . x in )
X wherein IjRepresent j the element that i is capable.
Matrix after the conversion is as shown in table 6 below:
Table 6
Figure A20081007157400141
B) utilize the similarity coefficient method that above-mentioned fuzzy matrix is carried out cluster, the formula of similarity coefficient method also has a variety of, as scalar product method, the surplus Xuan of angle, correlation of indices method, approach degree etc.Here the surplus formula that revolves of angle for example, formula is as follows:
r ij = Σ k = 1 m x ik x jk Σ k = 1 m x ik 2 Σ k = 1 m x jk 2
Wherein, r IjThe similarity relation of expression row i and row j;
x IkRepresent k the element of i in capable;
x JkRepresent k the element of j in capable.
C) by after the step b) calculating, we can obtain the similarity between each row, and the result is as shown in table 7.
Table 7
??X 1 ??X 2 ??X 3 ??X 4 ??X 5 ??X 6 ??X 7
??X 1 ??1 ??0.734 ??0.735 ??0.732 ??0.717 ??0.720 ??0.719
??X 2 ??0.734 ??1 ??0.999 ??0.999 ??0.729 ??0.732 ??0.731
??X 3 ??0.735 ??0.999 ??1 ??0.999 ??0.730 ??0.733 ??0.731
??X 4 ??0.732 ??0.999 ??0.999 ??1 ??0.729 ??0.732 ??0.730
??X 5 ??0.717 ??0.729 ??0.730 ??0.729 ??1 ??0.999 ??0.999
??X 6 ??0.720 ??0.732 ??0.733 ??0.732 ??0.999 ??1 ??0.999
??X 7 ??0.719 ??0.731 ??0.731 ??0.730 ??0.999 ??0.999 ??1
According to the similarity result and the closure relation of equal value of table 7, we draw clustering result is three classes:
S 1:X 1
S 2:X 2?X 3?X 4
S 3:X 5?X 6?X 7
D) according to the scale of matrix, which classification is the threshold value that a classification element number is set distinguish for valency is arranged
Feature classification, which classification of value information link is and the irrelevant feature classification that links in search mission website.
Here, can preset threshold be an interval, scope can be (m * 0.4, m * 0.6), the line number of m formula matrix in the formula.Our threshold value is (2.8,4.2) in this example.Can draw S thus 1Element number be 1 not in the threshold interval scope, be judged to be and the irrelevant feature classification that links in search mission website; S 2And S 3Element all be 3 and in the threshold interval scope, be judged to be valuable Info Link classification.
In addition, in this step, also can judge according to the type of the search mission website of having judged.As, if the number of features of the valuable Info Link of one type of website is 2-3, so, X in this example 1Constitute a class by itself, can be judged as the feature of valueless link.
E) element in each classification is converted to original character types according to the ASCII respective value again, then according to the type of website, utilize the guidance strategy of the chain feature extraction of the dissimilar websites that propose in the step 201, the result of comprehensive preceding step can draw following differentiation conclusion to the search mission website:
1, the Type of website is: the BBS forum website;
2, valuable Info Link feature :/forum-and/thread-;
3, with irrelevant the chain feature :/ads.php in search mission website?
In the present embodiment, the web site url feature after the filtration that web crawlers obtains by step 206 has begun the task of creeping of whole website.Like this, utilize chain feature to come filtering useless link, increase work efficiency and search for accuracy.
Another embodiment of the present invention discloses a kind of device that screens links on web pages, as shown in Figure 3, comprises that link extraction module 301, link split module 302, intelligent analysis module 303 and chain feature screening module 304.Wherein:
Link extraction module 301 is used for extracting link from the homepage and the partial page of search mission website.
Link splits module 302, is used to split described domain name part and the relative path part of being linked as.
Intelligent analysis module 303 is used for the relative path of the domain name link identical with the domain name of search mission website is carried out intellectual analysis.Here, intellectual analysis can use multiple algorithm, as intelligent algorithm, statistical algorithms or clustering algorithm.
Chain feature screening module 304, link to each other with intelligent analysis module 303, be used for intellectual analysis result according to 303 pairs of described relative paths of described intelligent analysis module, the feature representation formula of extraction and search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and search mission website.Chain feature screening module 304 extract with the feature representation formula of search mission website peer link, through these feature representation formulas are reduced into link, web crawlers just can utilize these links to work on.And filter out and the irrelevant feature representation formula that links in search mission website, before working on, eliminates web crawlers useless link, improve the work efficiency of reptile.
Preferably, this device can also comprise filtering module 305, is used to filter out the different link of domain name of domain name and described search mission website, and this module filters out obvious incoherent link information when the link of extracting is handled, help out.
Preferably, this device can also comprise Type of website rule base 306, download module 307, Type of website judge module 308.Wherein:
Type of website rule base 306 is used to store the rule of the various Types of website of summary in advance.These rules are to utilize as data excavation, machine learning scheduling algorithm, and a large amount of website and webpage sum up out by analyzing.Because the chain feature of dissimilar websites has different expression-forms, and is as shown in table 1.Therefore to judge earlier before extracting the web site url feature that type under the website is with the guidance strategy shown in the acquisition table 1.
Download module 307 is used to download the homepage and the partial page content of search mission website.
Type of website judge module 308, link to each other with download module 307 with Type of website rule base 306, be used for homepage and partial page content,, judge the type that the search mission website is affiliated by query web typing rule storehouse 306 according to the search mission website of download module 307 downloads.
Like this, link extraction module 301 also is used for the Type of website according to described search mission website, extracts link.Because, known after the affiliated type in search mission website, link extraction module 301 can also obtain the rule of this Type of website from Type of website rule base 306 according to the Type of website, and these rules have reflected information such as link present position on the webpage of search mission website, form.Link extraction module 301 utilizes these information, can improve the efficient of extracting link.
Preferably, the chain feature of this device screening module 304 can comprise composite module 3041, modular converter 3042, grouping module 3043 and screening submodule 3044, wherein:
Composite module 3041, the relative path that is used for domain name each link identical with the domain name of search mission website is a delegation, forms a chain matrice.
Modular converter 3042 is used for described chain matrice is converted to numerical matrix, with the length of that the longest link line width as matrix, not enough benefit 0.
Grouping module 3043 is used for by intelligent algorithm, and the row of described numerical matrix is pressed different classes of grouping.
Screening submodule 3044, line number order according to described grouping, judge whether the pairing link of grouping is and described search mission website peer link the feature representation formula of extraction and search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and search mission website.
Here, each provisional capital of data matrix and chain matrice is corresponding, respective links in the also corresponding chain matrice of the row of its each grouping, when judgement is grouped into the group relevant with the search mission website, screening submodule 3044 just can extract the feature representation formula with search mission website peer link according to this simple corresponding relation.
Like this, by web page interlinkage being carried out domain name part and relative path part, at first filter out the domain name link different with the domain name of search mission website, then by intellectual analysis to relative path, to obtain the feature of web site url, link thereby reach to filter out,, improve the work efficiency of web crawlers and the purpose of search accuracy as the junk information link with the search website irrelevant information.
In sum, the technical scheme that embodiment of the present invention proposes, by feature extraction is carried out in web page interlinkage, filtration is to searching for useless connection, improve web crawlers to a great extent in the particularly work efficiency and the search accuracy in vertical search field of searching engine field, reduce labor workload, improve the dirigibility of system.In various types of websites, as potentiality that are widely used such as BBS website, blog website, news websites.
The above is an embodiment of the present invention only, is the concrete displaying of inventive concept, does not limit the present invention.To those skilled in the art, within the spirit and principles in the present invention every, to any change of the present invention, change, be equal to replacement etc., be included within protection scope of the present invention.

Claims (11)

1, a kind of method of screening links on web pages is characterized in that, comprising:
From the homepage of search mission website and the part correlation page, extract link;
Split described domain name part and the relative path part of being linked as;
Relative path to the domain name link identical with the domain name of search mission website carries out intellectual analysis;
According to the result of described intellectual analysis, the feature representation formula of extraction and described search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and described search mission website.
2, method according to claim 1 is characterized in that, also comprises:
Filter out the different link of domain name of domain name and described search mission website.
3, method according to claim 1 and 2 is characterized in that, also comprises:
Sum up the rule of the various Types of website in advance;
Download the homepage and the part correlation content of pages of described search mission website,, judge the Type of website that described search mission website is affiliated according to the rule of the described various Types of website;
Described from the search mission website homepage and the part correlation page extract link step be specially: according to the Type of website of described search mission website, extract link.
4, method according to claim 1 is characterized in that, the step that described relative path to the domain name link identical with the domain name of search mission website carries out intellectual analysis is specially:
Relative path with domain name each link identical with the domain name of search mission website is a delegation, forms a chain matrice;
Described chain matrice is converted to numerical matrix;
By intelligent algorithm, the row of described numerical matrix is pressed different classes of grouping.
5, method according to claim 4 is characterized in that, described intelligent algorithm is a kind of in the genetic algorithm, the clustering algorithm in the data mining in statistical inductive algorithm, the artificial intelligence.
6, according to claim 4 or 5 described methods, it is characterized in that, also comprise:
According to the line number order of described grouping, judge that whether the pairing link of grouping is and described search mission website peer link.
7, according to claim 4 or 5 described methods, it is characterized in that,, the row of the described numerical matrix step by different classes of grouping be specially by intelligent algorithm:
Formaing described numerical matrix is fuzzy matrix;
Obtain the similarity between each row of described fuzzy matrix;
According to described similarity, with the row grouping of described fuzzy matrix.
8, a kind of screen links on web pages device, it is characterized in that, comprising:
The link extraction module is used for extracting link from the homepage and the partial page of search mission website;
Link splits module, is used to split described domain name part and the relative path part of being linked as;
Intelligent analysis module is used for the relative path of the domain name link identical with the domain name of search mission website is carried out intellectual analysis;
Chain feature screening module, link to each other with described intelligent analysis module, be used for according to the intellectual analysis result of described intelligent analysis module the feature representation formula of extraction and search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and search mission website to described relative path.
9, device according to claim 8 is characterized in that, also comprises:
Filtering module is used to filter out the different link of domain name of domain name and described search mission website.
10, according to Claim 8 or 9 described devices, it is characterized in that, also comprise:
Type of website rule base is used to store the rule of the various Types of website of summary in advance;
Download module is used to download the homepage and the partial page content of described search mission website;
Type of website judge module links to each other with download module with Type of website rule base, is used for homepage and partial page content according to the search mission website of download module download, by query web typing rule storehouse, judges the type that described search mission website is affiliated;
Described link extraction module also is used for the Type of website according to described search mission website, extracts link.
11, according to Claim 8 or 9 described devices, it is characterized in that described chain feature screening module comprises:
Composite module, the relative path that is used for domain name each link identical with the domain name of search mission website is a delegation, forms a chain matrice;
Modular converter is used for described chain matrice is converted to numerical matrix;
Grouping module is used for by intelligent algorithm, and the row of described numerical matrix is pressed different classes of grouping;
The screening submodule, line number order according to described grouping, judge whether the pairing link of grouping is and described search mission website peer link the feature representation formula of extraction and search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and search mission website.
CN 200810071574 2008-08-12 2008-08-12 Method and device for screening links on web pages Active CN101650715B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200810071574 CN101650715B (en) 2008-08-12 2008-08-12 Method and device for screening links on web pages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200810071574 CN101650715B (en) 2008-08-12 2008-08-12 Method and device for screening links on web pages

Publications (2)

Publication Number Publication Date
CN101650715A true CN101650715A (en) 2010-02-17
CN101650715B CN101650715B (en) 2011-06-29

Family

ID=41672954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200810071574 Active CN101650715B (en) 2008-08-12 2008-08-12 Method and device for screening links on web pages

Country Status (1)

Country Link
CN (1) CN101650715B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298616A (en) * 2011-07-29 2011-12-28 百度在线网络技术(北京)有限公司 Method and device for providing related sub links in search result
CN102567337A (en) * 2010-12-15 2012-07-11 盛乐信息技术(上海)有限公司 Method and system for quickly recognizing webpage types through links
CN103581347A (en) * 2012-07-23 2014-02-12 深圳市世纪光速信息技术有限公司 Inundation sub-domain identification method and system
CN103577547A (en) * 2013-10-12 2014-02-12 优视科技有限公司 Webpage type identification method and device
CN104035940A (en) * 2013-03-07 2014-09-10 腾讯科技(深圳)有限公司 Webpage link storage method and server
CN104376000A (en) * 2013-08-13 2015-02-25 阿里巴巴集团控股有限公司 Webpage attribute determination method and webpage attribute determination device
CN105070124A (en) * 2015-08-05 2015-11-18 河南工业大学 Interactive commercial law basic teaching system
CN106095979A (en) * 2016-06-20 2016-11-09 百度在线网络技术(北京)有限公司 URL merging treatment method and apparatus
CN106202320A (en) * 2016-06-30 2016-12-07 广东小天才科技有限公司 The control method of a kind of browser Web side navigation and device, mobile device
WO2017000659A1 (en) * 2015-06-30 2017-01-05 北京奇虎科技有限公司 Enriched uniform resource locator (url) identification method and apparatus
CN106547851A (en) * 2016-10-19 2017-03-29 贵州大学 Based on the webpage content extracting method that fuzzy frequent episodes are excavated
CN107291727A (en) * 2016-03-31 2017-10-24 北京国双科技有限公司 The crawling method and device of a kind of reptile
CN109446445A (en) * 2018-10-23 2019-03-08 乐蜜有限公司 A kind of resource acquiring method and device
CN110837909A (en) * 2018-08-17 2020-02-25 北京京东尚科信息技术有限公司 Method and device for predicting order quantity
CN110866166A (en) * 2019-11-14 2020-03-06 北京京航计算通讯研究所 Distributed web crawler performance optimization system for mass data acquisition
CN110874429A (en) * 2019-11-14 2020-03-10 北京京航计算通讯研究所 Distributed web crawler performance optimization method oriented to mass data acquisition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7685296B2 (en) * 2003-09-25 2010-03-23 Microsoft Corporation Systems and methods for client-based web crawling

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567337A (en) * 2010-12-15 2012-07-11 盛乐信息技术(上海)有限公司 Method and system for quickly recognizing webpage types through links
CN102298616A (en) * 2011-07-29 2011-12-28 百度在线网络技术(北京)有限公司 Method and device for providing related sub links in search result
CN103581347B (en) * 2012-07-23 2019-03-26 深圳市世纪光速信息技术有限公司 The recognition methods and system of inundation sub-domain
CN103581347A (en) * 2012-07-23 2014-02-12 深圳市世纪光速信息技术有限公司 Inundation sub-domain identification method and system
CN104035940A (en) * 2013-03-07 2014-09-10 腾讯科技(深圳)有限公司 Webpage link storage method and server
CN104035940B (en) * 2013-03-07 2018-07-06 腾讯科技(深圳)有限公司 The storage method and server of web page interlinkage
CN104376000A (en) * 2013-08-13 2015-02-25 阿里巴巴集团控股有限公司 Webpage attribute determination method and webpage attribute determination device
CN103577547A (en) * 2013-10-12 2014-02-12 优视科技有限公司 Webpage type identification method and device
WO2017000659A1 (en) * 2015-06-30 2017-01-05 北京奇虎科技有限公司 Enriched uniform resource locator (url) identification method and apparatus
CN105070124A (en) * 2015-08-05 2015-11-18 河南工业大学 Interactive commercial law basic teaching system
CN107291727A (en) * 2016-03-31 2017-10-24 北京国双科技有限公司 The crawling method and device of a kind of reptile
CN106095979A (en) * 2016-06-20 2016-11-09 百度在线网络技术(北京)有限公司 URL merging treatment method and apparatus
CN106095979B (en) * 2016-06-20 2020-05-08 百度在线网络技术(北京)有限公司 URL merging processing method and device
CN106202320A (en) * 2016-06-30 2016-12-07 广东小天才科技有限公司 The control method of a kind of browser Web side navigation and device, mobile device
CN106547851A (en) * 2016-10-19 2017-03-29 贵州大学 Based on the webpage content extracting method that fuzzy frequent episodes are excavated
CN106547851B (en) * 2016-10-19 2020-04-07 贵州大学 Webpage content extraction method based on fuzzy sequence mode mining
CN110837909A (en) * 2018-08-17 2020-02-25 北京京东尚科信息技术有限公司 Method and device for predicting order quantity
CN109446445A (en) * 2018-10-23 2019-03-08 乐蜜有限公司 A kind of resource acquiring method and device
CN109446445B (en) * 2018-10-23 2022-03-22 北京乐我无限科技有限责任公司 Resource acquisition method and device
CN110866166A (en) * 2019-11-14 2020-03-06 北京京航计算通讯研究所 Distributed web crawler performance optimization system for mass data acquisition
CN110874429A (en) * 2019-11-14 2020-03-10 北京京航计算通讯研究所 Distributed web crawler performance optimization method oriented to mass data acquisition

Also Published As

Publication number Publication date
CN101650715B (en) 2011-06-29

Similar Documents

Publication Publication Date Title
CN101650715B (en) Method and device for screening links on web pages
US8868621B2 (en) Data extraction from HTML documents into tables for user comparison
CN103823883B (en) Analysis method and system for website user access path
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN112287273B (en) Method, system and storage medium for classifying website list pages
US20090070366A1 (en) Method and system for web document clustering
CN102622443A (en) Customized screening system and method for microblog
CN102411587A (en) Webpage classification method and device
CN103294781A (en) Method and equipment used for processing page data
CN102542061B (en) Intelligent product classification method
CN103823824A (en) Method and system for automatically constructing text classification corpus by aid of internet
CN103605738A (en) Webpage access data statistical method and webpage access data statistical device
CN104484431A (en) Multi-source individualized news webpage recommending method based on field body
CN109800350A (en) A kind of Personalize News recommended method and system, storage medium
CN102567494A (en) Website classification method and device
CN102681994A (en) Webpage information extracting method and system
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN102043862A (en) Directional web data extraction method
CN102236654A (en) Web useless link filtering method based on content relevancy
CN103488746A (en) Method and device for acquiring business information
CN108733813A (en) Information extracting method, system towards BBS forum Web pages contents and medium
CN103078854A (en) Message filtering method and device
CN102004805B (en) Webpage denoising system and method based on maximum similarity matching
CN101576933A (en) Fully-automatic grouping method of WEB pages based on title separator

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20100217

Assignee: Xiaoma Baoli (Xiamen) Network Technology Co.,Ltd.

Assignor: XIAMEN MEIYA PICO INFORMATION Co.,Ltd.

Contract record no.: X2023350000070

Denomination of invention: Method and device for filtering links on web pages

Granted publication date: 20110629

License type: Common License

Record date: 20230313

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20100217

Assignee: XIAMEN SECURITY INTELLIGENCE INFORMATION TECHNOLOGY CO.,LTD.

Assignor: XIAMEN MEIYA PICO INFORMATION Co.,Ltd.

Contract record no.: X2023350000068

Denomination of invention: Method and device for filtering links on web pages

Granted publication date: 20110629

License type: Common License

Record date: 20230317