CN101650715A

CN101650715A - Method and device for screening links on web pages

Info

Publication number: CN101650715A
Application number: CN200810071574A
Authority: CN
Inventors: 陈奋; 腾达; 吴鸿伟
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2008-08-12
Filing date: 2008-08-12
Publication date: 2010-02-17
Anticipated expiration: 2028-08-12
Also published as: CN101650715B

Abstract

The embodiment of the invention provides a method for screening links on web pages, which comprises the following steps: extracting the links from the home page of a search task web site and part of the web pages; splitting the links into a domain name part and a relative path part; performing intelligent analysis on the relative paths of the links of which the same domain names are the same withthat of the search task web site; and according to the result of the intelligent analysis, extracting characteristic expressions of the links correlative with the search task web site, and/or filtering the characteristic expressions of the links not correlative with the search task web site. The embodiment of the invention also provides a device for screening the likes on the web pages. The embodiment of the invention achieves the improvement of the working efficiency of web crawlers by extracting the links correlative with the search task web site and filtering the links not correlative withthe search task web site.

Description

A kind of method and apparatus that screens links on web pages

Technical field

The present invention relates to the communications field, relate in particular to a kind of method and apparatus of searching for filtration.

Background technology

Along with the development of information network, a large amount of electronic data is by information network storage and transmission, and information network has become the main mode that information is obtained.Search engine is a kind of important tool of seeking the information that satisfies certain needs from the data message of magnanimity, obtains in the application in present information, is bringing into play and is coming important more effect approximately.Web crawlers is an important component part of search engine system, and it plays a part analyzing web page, obtains link and downloads webpage.

Existing in the web page interlinkage much and search for the useless link of order for certain, how to filter out these useless links, is the gordian technique that improves the web crawlers operational efficiency, and particularly in the vertical search field, it is most important that this technology especially seems.

In the prior art, web crawlers mainly contains two kinds to the mode of the filtration of useless link in the search: (1) adopts the artificial mode that filtering rule filters of setting; (2) adopt the information filtering mode.The mode that adopts artificial setting filtering rule to filter can reach the effect of filtration to a certain extent, but this mode workload is big, dumb, changes as the website, and the filtering rule setting also must be changed.The mode of employing information filtering must be with the download content analysis of webpage, and this has increased the weight of the workload of web crawlers and the workload that the backstage is analyzed to a certain extent, has reduced search efficiency.

Summary of the invention

Embodiment of the present invention discloses a kind of method and a kind of device that screens links on web pages that screens links on web pages.

The disclosed a kind of method of screening links on web pages of embodiment of the present invention comprises:

From the homepage of search mission website and the part correlation page, extract link;

Split described domain name part and the relative path part of being linked as;

Relative path to the domain name link identical with the domain name of search mission website carries out intellectual analysis;

According to the result of described intellectual analysis, the feature representation formula of extraction and described search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and described search mission website.

Embodiment of the present invention disclosed a kind of screen links on web pages device, comprising:

The link extraction module is used for extracting link from the homepage and the partial page of search mission website;

Link splits module, is used to split described domain name part and the relative path part of being linked as;

Intelligent analysis module is used for the relative path of the domain name link identical with the domain name of search mission website is carried out intellectual analysis;

Chain feature screening module, link to each other with described intelligent analysis module, be used for according to the intellectual analysis result of described intelligent analysis module the feature representation formula of extraction and search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and search mission website to described relative path.

Embodiment of the present invention is filtered and irrelevant the linking in search mission website by extracting link relevant with the search mission website, has realized the raising of web crawlers work efficiency.

Description of drawings

A kind of method synoptic diagram that screens links on web pages that Fig. 1 provides for one embodiment of the present invention;

The extraction of introducing web page characteristics that Fig. 2 provides for another embodiment of the present invention and with the synoptic diagram of the irrelevant filter method that links of search mission;

Fig. 3 is the disclosed a kind of structure drawing of device that screens links on web pages of another embodiment of the present invention;

Fig. 4 distinguishes the decision tree synoptic diagram for categories of websites.

Embodiment

In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention is done detailed description further below in conjunction with embodiment.Embodiment of the present invention is carried out feature extraction to web page interlinkage, and the link on the webpage is screened, and extracts the link of valuable information, the link of filtering useless, thus thereby reach work efficiency and the search accuracy that the purpose of filtering link improves web crawlers.

One embodiment of the present invention provides a kind of method of screening links on web pages, as shown in Figure 1, comprises step:

Step 101, from the homepage of search mission website and the part correlation page, extract link;

In this step, the extraction of link can realize by the link extraction algorithm.The part correlation page here refers to the pairing page of the link that comprises in the homepage.Can also sum up the rule of the various Types of website in advance, and download the homepage and the part correlation content of pages of described search mission website, according to homepage and part correlation content of pages, and the rule of the various Types of website, judge the Type of website that described search mission website is affiliated.The rule of the Type of website has reflected the expression-form of the characteristic and the feature of the layout that links in the webpage, valuable Info Link.Extract link according to the Type of website of search mission website then, improved the efficient of extracting link like this.

Step 102, described domain name part and the relative path part of being linked as of fractionation;

In this step, can also at first filter out the different link of domain name of domain name and described search mission website, these links can be considered directly incoherent with the search mission website.

Step 103, the relative path of the domain name link identical with the domain name of search mission website is carried out intellectual analysis;

In this step, intellectual analysis can use various statistics, clustering algorithm, and its effect is a plurality of link relative paths are classified or to divide into groups, and is convenient to later step the group that meets a screening rule is extracted or filters out.We will introduce concrete screening rule and process in detail in example afterwards.

Here our intelligent analysis method of being exemplified below:

At first, be delegation with the relative path of domain name each link identical with the domain name of search mission website, form a chain matrice;

Secondly, chain matrice is converted to numerical matrix, with the length of that the longest link line width, not enough benefit 0 as matrix;

Then, by intelligent algorithm, format numerical matrix is a fuzzy matrix, calculates each similarity between capable of fuzzy matrix.According to described similarity, with the row grouping of described fuzzy matrix.Because it is corresponding that fuzzy matrix and numerical matrix, chain matrice all are row, this has just realized indirectly that row with the row of numerical matrix and chain matrice is by different classes of grouping.

Step 104, according to the result of described intellectual analysis, extract the feature representation formula with described search mission website peer link, and/or filtration and the described search mission website feature representation formula that has nothing to do and link.

In this step, clustering result has realized matrix is divided into groups, can be according to the line number order of grouping, judge that whether the pairing link of grouping is and described search mission website peer link.

In the use of web crawlers, can only carry out the function of the feature representation formula of extraction and described search mission website peer link, the link that these feature representation formulas are corresponding useful, web crawlers utilizes these links to work on.

In order to make the disclosed technical scheme of present embodiment more clear, another embodiment of the present invention will be by way of example, introduces the extraction of web page characteristics and the filter method that has nothing to do and link with search mission in detail, as shown in Figure 2, mainly may further comprise the steps:

The rule of step 201, the various Types of website of summary.

Because the chain feature of dissimilar websites exists different expression-forms, when chain feature was extracted, dissimilar websites needed to adopt Different Strategies.When table 1 has shown the web site url feature extraction, the guidance strategy that the chain feature of dissimilar websites is extracted.

Table 1

The Type of website	The number of features of valuable Info Link	The expression-form of feature
The Type of website	The number of features of valuable Info Link	The expression-form of feature	The BBS forum website	2 to 3 (list page feature and model information characteristics)	It generally is identical character string
Blog blog website	2 to 3 (list page feature and Blog content feature)	Generally be identical character string or meet certain	The BBS forum website		It generally is identical character string

		The regular expression rule of type
		The regular expression rule of type	News website	More than 3 or 3	Generally be to meet certain type regular expression rule

Shown in the table 1, the chain feature expression formula can be continuous character style, or meets certain regular form, and each feature can be represented with the form of feature representation formula like this.The guidance strategy of the chain feature of the dissimilar websites of mentioning at table 1, table 2 have shown the chain feature rule example of dissimilar websites.

Table 2

Web site name	The Type of website	The feature of valuable Info Link	The expression-form of feature
Web site name	The Type of website	The feature of valuable Info Link	The expression-form of feature	* nets forum	The BBS forum website	??tableforum/App/index.php？bbsid??App/view.php？bbsid＝	Identical character string
* net blog	Blog blog website	??[0-9]{1，10}.html	The Numeral Rules	* nets forum	The BBS forum website	??tableforum/App/index.php？bbsid??App/view.php？bbsid＝	Identical character string
* net blog	Blog blog website	??[0-9]{1，10}.html	The Numeral Rules	* News Network	News website	?? Content[0-9]{1，10}.htm?? www.inetcop.com.cn	(1) character and the numeral rule of (2) continuation character that combines

Therefore, we can at first judge the type of website before the web site url feature extraction, to obtain the guidance strategy of web site url feature extraction.

The various Type of website as BBS forum website, blog website, news portal website, enterprise web site and other types website, all has its specific composition rule, as contains some significant character, contains certain specific component content etc.In this step, we utilize as data excavation, machine learning scheduling algorithm, sum up the rule of all kinds website.In subsequent step, web crawlers will utilize the type under the Type of website rule judgment task website, thereby obtain the guidance strategy that chain feature is extracted.

This step detailed process is described below:

A) the some training materials of webpage of collection all kinds website as machine learning.

B) by analyzing a large amount of various types of webpages, extract the contents attribute feature of webpage, and represent attribute-bit as shown in table 3 and attributive character corresponding tables with attribute-bit.Like this, a kind of webpage of website just can have been represented with the combination of one or more attribute-bits.

Table 3

Attribute-bit	Attributive character is described
Attribute-bit	Attributive character is described	??A ₁	Whether URL contains the bbs character
??A ₂	Whether URL contains the blog character	??A ₁	Whether URL contains the bbs character

??A ₃	Whether URL contains the date characteristic character
??A ₃	Whether URL contains the date characteristic character	??A ₄	Contain bbs or " community " character in the Meta label in the web page contents
??A ₅	Contain blog or " blog " character in the Meta label in the web page contents	??A ₄
??A ₅		??A ₆	The navigation bar message length
??A ₇	The body matter text size	??A ₆	The navigation bar message length
??A ₇	The body matter text size	??A ₈	Whether comprise temporal characteristics
??A ₉	Whether contain and reply or comment	??A ₈	Whether comprise temporal characteristics
??A ₉	Whether contain and reply or comment	??A ₁₀	Whether contain peer link

C) sorting algorithm of employing machine learning algorithm or data mining, as neural network algorithm, decision tree classification algorithm etc., the training material of learning to collect, thereby obtain explaining the guidance rule of certain Type of website, this rule can be represented by the contents attribute feature, specifically can make up and represent by the attribute representation who represents attributive character, for example:

If (A ₆And A ₇And A ₁₀) then (this website belongs to the news website type);

This expression formula is represented: if a website possesses A simultaneously ₆, A ₇, A ₁₀Three pairing attributive character of attribute representation, promptly according to shown in the table 3, this website has navigation bar message length, body matter text size simultaneously and contains three features such as peer link, and then this website belongs to the website of news type.

Here our list of rules of obtaining for example by decision tree, as shown in Figure 4.

C ₁Expression news website type;

C ₂The expression forum Type of website;

C ₃The expression blog Type of website;

C ₄The expression other types;

That extracts is regular as follows:

C ₁Rule: if (A ₆=1and A ₇=1and A ₁₀=1) then C ₁

if(A ₆＝1and?A ₃＝1)then?C ₁

C ₂Rule: if (A ₈=1 and A ₁=1) then C ₂

if(A ₈＝1and?A ₄＝1)then?C ₂

C ₃Rule: if (A ₅=1) then C ₃

if(A ₂＝1)then?C ₃

C ₄Rule: if (A ₆=1and A ₇=1andA ₁₀=0) then C ₄

if(A ₆＝1and?A ₇＝0and?A ₃＝0)then?C ₄

if(A ₆＝0and?A ₂＝0and?A ₅＝0and?A ₈＝0)then?C ₄

if(A ₆＝0and?A ₂＝0and?A ₅＝0and?A ₈＝1and?A ₁＝0?and?A ₄＝0)then?C ₄

Here, 1 expression comprises this feature, and 0 expression does not comprise this feature.

Step 202, download the homepage and the partial page content of search mission website, utilize the judgment rule of the Type of website of step 1, judge the type under this search mission website.

In this step, at first the table of comparisons 3 extracts the attributive character of webpage, and the Type of website rule of utilizing step 1 to obtain is then judged the task Type of website.

Step 203, according to the type of search mission website, by the link extraction algorithm, from the homepage of search mission website and partial page, extract all-links.

In this step, the link extraction algorithm can adopt the method for regular expression or character feature coupling.The place that may occur linking in Webpage mainly contains following a few place:

1,＜BASE href=URL 〉, represent basic URL, in HTML, routing information is often specified by URL, and relative URL decides according to basic URL;

2,＜and Ahref=URL 〉, this is the most frequently used link form, other places that are used to be linked to another WEB page or are linked to the same page;

3,＜and LINK href=URL 〉, be used to link CSS style sheet address or the javascript page address that the WEB page uses;

4,＜and FRAME src=URL 〉, in frame page, be used to link framework page address pointed;

5,＜IFRAME src=URL 〉, IFRAME is a kind of form of framework, what it was different with FRAME is that IFRAME can be embedded in the arbitrary portion in the webpage.

If adopt the character feature matching algorithm, go for the link of character string " href " and " src " back exactly;

Do if the employing regular expression, we construct regular expression:＜.*? (src|href) s*=s*[" | ']? (?＜uri〉[^ ' " s]+).We just can obtain all-links in the webpage by this regular expression.

Step 204, fractionation are linked as domain name part and relative path part.

Step 205, filter out the domain name link different with the domain name of search mission website.

Step 206, relative path is added up cluster, extract the feature representation formula that link relevant with the search mission website, filtering has nothing to do with the search mission website links.

Here, the irrelevant link in search mission website is generally with the useful very low junk information of the degree of correlation that links and links with it.

In this step, the relative path that links that domain name is identical with the domain name of search mission is partly formed a chain matrice, utilizes its corresponding codes, as ASCII coding, Unicode coding etc., is converted to numerical matrix.Here be encoded to example with ASCII, introduce the process of conversion.

On a webpage, contain a plurality of links in the webpage, the character type relative path of each link is formed a matrix, and is as shown in table 4:

Table 4

According to the corresponding decimal system numerical value of ASCII, be converted to numerical matrix as shown in table 5, with the length of that the longest link line width, not enough benefit 0 as matrix.Wherein, X ₁To X ₇Representing matrix first row is to the 7th row respectively.

Table 5

For this numerical matrix, with intelligent algorithm with X ₁, X ₂X ₇Be generalized into several classifications, these classifications are exactly the web site url feature classification that we will extract.The intelligent algorithm of analyzing this matrix can be inductive algorithm, the algorithm in the artificial intelligence such as the genetic algorithm on the statistics, and the clustering algorithm in the data mining.

Here we adopt that the fuzzy clustering algorithm illustrates in the data mining, and key step is as follows:

A) this matrix is carried out " normalization ", be converted to fuzzy matrix, so that carry out cluster." normalization " method is a lot, as the method for weighting, value method etc.Here give an example " being worth method most " change, formula is as follows:

x_{ij} = \frac{x_{ij}}{\max (x_{i 1} . . . x_{in})}

X wherein _IjRepresent j the element that i is capable.

Matrix after the conversion is as shown in table 6 below:

Table 6

B) utilize the similarity coefficient method that above-mentioned fuzzy matrix is carried out cluster, the formula of similarity coefficient method also has a variety of, as scalar product method, the surplus Xuan of angle, correlation of indices method, approach degree etc.Here the surplus formula that revolves of angle for example, formula is as follows:

r_{ij} = \frac{Σ_{k = 1}^{m} x_{ik} x_{jk}}{\sqrt{Σ_{k = 1}^{m} x_{ik}^{2}} \sqrt{Σ_{k = 1}^{m} x_{jk}^{2}}}

Wherein, r _IjThe similarity relation of expression row i and row j;

x _IkRepresent k the element of i in capable;

x _JkRepresent k the element of j in capable.

C) by after the step b) calculating, we can obtain the similarity between each row, and the result is as shown in table 7.

Table 7

	??X ₁	??X ₂	??X ₃	??X ₄	??X ₅	??X ₆	??X ₇
	??X ₁	??X ₂	??X ₃	??X ₄	??X ₅	??X ₆	??X ₇	??X ₁	??1	??0.734	??0.735	??0.732	??0.717	??0.720	??0.719
??X ₂	??0.734	??1	??0.999	??0.999	??0.729	??0.732	??0.731	??X ₁	??1	??0.734	??0.735	??0.732	??0.717	??0.720	??0.719
??X ₂	??0.734	??1	??0.999	??0.999	??0.729	??0.732	??0.731	??X ₃	??0.735	??0.999	??1	??0.999	??0.730	??0.733	??0.731
??X ₄	??0.732	??0.999	??0.999	??1	??0.729	??0.732	??0.730	??X ₃	??0.735	??0.999	??1	??0.999	??0.730	??0.733	??0.731
??X ₄	??0.732	??0.999	??0.999	??1	??0.729	??0.732	??0.730	??X ₅	??0.717	??0.729	??0.730	??0.729	??1	??0.999	??0.999

??X ₆	??0.720	??0.732	??0.733	??0.732	??0.999	??1	??0.999
??X ₆	??0.720	??0.732	??0.733	??0.732	??0.999	??1	??0.999	??X ₇	??0.719	??0.731	??0.731	??0.730	??0.999	??0.999	??1

According to the similarity result and the closure relation of equal value of table 7, we draw clustering result is three classes:

S ₁：X ₁

S ₂：X ₂?X ₃?X ₄

S ₃：X ₅?X ₆?X ₇

D) according to the scale of matrix, which classification is the threshold value that a classification element number is set distinguish for valency is arranged

Feature classification, which classification of value information link is and the irrelevant feature classification that links in search mission website.

Here, can preset threshold be an interval, scope can be (m * 0.4, m * 0.6), the line number of m formula matrix in the formula.Our threshold value is (2.8,4.2) in this example.Can draw S thus ₁Element number be 1 not in the threshold interval scope, be judged to be and the irrelevant feature classification that links in search mission website; S ₂And S ₃Element all be 3 and in the threshold interval scope, be judged to be valuable Info Link classification.

In addition, in this step, also can judge according to the type of the search mission website of having judged.As, if the number of features of the valuable Info Link of one type of website is 2-3, so, X in this example ₁Constitute a class by itself, can be judged as the feature of valueless link.

E) element in each classification is converted to original character types according to the ASCII respective value again, then according to the type of website, utilize the guidance strategy of the chain feature extraction of the dissimilar websites that propose in the step 201, the result of comprehensive preceding step can draw following differentiation conclusion to the search mission website:

1, the Type of website is: the BBS forum website;

2, valuable Info Link feature :/forum-and/thread-;

3, with irrelevant the chain feature :/ads.php in search mission website?

In the present embodiment, the web site url feature after the filtration that web crawlers obtains by step 206 has begun the task of creeping of whole website.Like this, utilize chain feature to come filtering useless link, increase work efficiency and search for accuracy.

Another embodiment of the present invention discloses a kind of device that screens links on web pages, as shown in Figure 3, comprises that link extraction module 301, link split module 302, intelligent analysis module 303 and chain feature screening module 304.Wherein:

Link extraction module 301 is used for extracting link from the homepage and the partial page of search mission website.

Link splits module 302, is used to split described domain name part and the relative path part of being linked as.

Intelligent analysis module 303 is used for the relative path of the domain name link identical with the domain name of search mission website is carried out intellectual analysis.Here, intellectual analysis can use multiple algorithm, as intelligent algorithm, statistical algorithms or clustering algorithm.

Chain feature screening module 304, link to each other with intelligent analysis module 303, be used for intellectual analysis result according to 303 pairs of described relative paths of described intelligent analysis module, the feature representation formula of extraction and search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and search mission website.Chain feature screening module 304 extract with the feature representation formula of search mission website peer link, through these feature representation formulas are reduced into link, web crawlers just can utilize these links to work on.And filter out and the irrelevant feature representation formula that links in search mission website, before working on, eliminates web crawlers useless link, improve the work efficiency of reptile.

Preferably, this device can also comprise filtering module 305, is used to filter out the different link of domain name of domain name and described search mission website, and this module filters out obvious incoherent link information when the link of extracting is handled, help out.

Preferably, this device can also comprise Type of website rule base 306, download module 307, Type of website judge module 308.Wherein:

Type of website rule base 306 is used to store the rule of the various Types of website of summary in advance.These rules are to utilize as data excavation, machine learning scheduling algorithm, and a large amount of website and webpage sum up out by analyzing.Because the chain feature of dissimilar websites has different expression-forms, and is as shown in table 1.Therefore to judge earlier before extracting the web site url feature that type under the website is with the guidance strategy shown in the acquisition table 1.

Download module 307 is used to download the homepage and the partial page content of search mission website.

Type of website judge module 308, link to each other with download module 307 with Type of website rule base 306, be used for homepage and partial page content,, judge the type that the search mission website is affiliated by query web typing rule storehouse 306 according to the search mission website of download module 307 downloads.

Like this, link extraction module 301 also is used for the Type of website according to described search mission website, extracts link.Because, known after the affiliated type in search mission website, link extraction module 301 can also obtain the rule of this Type of website from Type of website rule base 306 according to the Type of website, and these rules have reflected information such as link present position on the webpage of search mission website, form.Link extraction module 301 utilizes these information, can improve the efficient of extracting link.

Preferably, the chain feature of this device screening module 304 can comprise composite module 3041, modular converter 3042, grouping module 3043 and screening submodule 3044, wherein:

Composite module 3041, the relative path that is used for domain name each link identical with the domain name of search mission website is a delegation, forms a chain matrice.

Modular converter 3042 is used for described chain matrice is converted to numerical matrix, with the length of that the longest link line width as matrix, not enough benefit 0.

Grouping module 3043 is used for by intelligent algorithm, and the row of described numerical matrix is pressed different classes of grouping.

Screening submodule 3044, line number order according to described grouping, judge whether the pairing link of grouping is and described search mission website peer link the feature representation formula of extraction and search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and search mission website.

Here, each provisional capital of data matrix and chain matrice is corresponding, respective links in the also corresponding chain matrice of the row of its each grouping, when judgement is grouped into the group relevant with the search mission website, screening submodule 3044 just can extract the feature representation formula with search mission website peer link according to this simple corresponding relation.

Like this, by web page interlinkage being carried out domain name part and relative path part, at first filter out the domain name link different with the domain name of search mission website, then by intellectual analysis to relative path, to obtain the feature of web site url, link thereby reach to filter out,, improve the work efficiency of web crawlers and the purpose of search accuracy as the junk information link with the search website irrelevant information.

In sum, the technical scheme that embodiment of the present invention proposes, by feature extraction is carried out in web page interlinkage, filtration is to searching for useless connection, improve web crawlers to a great extent in the particularly work efficiency and the search accuracy in vertical search field of searching engine field, reduce labor workload, improve the dirigibility of system.In various types of websites, as potentiality that are widely used such as BBS website, blog website, news websites.

The above is an embodiment of the present invention only, is the concrete displaying of inventive concept, does not limit the present invention.To those skilled in the art, within the spirit and principles in the present invention every, to any change of the present invention, change, be equal to replacement etc., be included within protection scope of the present invention.

Claims

1, a kind of method of screening links on web pages is characterized in that, comprising:

Split described domain name part and the relative path part of being linked as;

2, method according to claim 1 is characterized in that, also comprises:

Filter out the different link of domain name of domain name and described search mission website.

3, method according to claim 1 and 2 is characterized in that, also comprises:

Sum up the rule of the various Types of website in advance;

Download the homepage and the part correlation content of pages of described search mission website,, judge the Type of website that described search mission website is affiliated according to the rule of the described various Types of website;

Described from the search mission website homepage and the part correlation page extract link step be specially: according to the Type of website of described search mission website, extract link.

4, method according to claim 1 is characterized in that, the step that described relative path to the domain name link identical with the domain name of search mission website carries out intellectual analysis is specially:

Relative path with domain name each link identical with the domain name of search mission website is a delegation, forms a chain matrice;

Described chain matrice is converted to numerical matrix;

By intelligent algorithm, the row of described numerical matrix is pressed different classes of grouping.

5, method according to claim 4 is characterized in that, described intelligent algorithm is a kind of in the genetic algorithm, the clustering algorithm in the data mining in statistical inductive algorithm, the artificial intelligence.

6, according to claim 4 or 5 described methods, it is characterized in that, also comprise:

According to the line number order of described grouping, judge that whether the pairing link of grouping is and described search mission website peer link.

7, according to claim 4 or 5 described methods, it is characterized in that,, the row of the described numerical matrix step by different classes of grouping be specially by intelligent algorithm:

Formaing described numerical matrix is fuzzy matrix;

Obtain the similarity between each row of described fuzzy matrix;

According to described similarity, with the row grouping of described fuzzy matrix.

8, a kind of screen links on web pages device, it is characterized in that, comprising:

9, device according to claim 8 is characterized in that, also comprises:

Filtering module is used to filter out the different link of domain name of domain name and described search mission website.

10, according to Claim 8 or 9 described devices, it is characterized in that, also comprise:

Type of website rule base is used to store the rule of the various Types of website of summary in advance;

Download module is used to download the homepage and the partial page content of described search mission website;

Type of website judge module links to each other with download module with Type of website rule base, is used for homepage and partial page content according to the search mission website of download module download, by query web typing rule storehouse, judges the type that described search mission website is affiliated;

Described link extraction module also is used for the Type of website according to described search mission website, extracts link.

11, according to Claim 8 or 9 described devices, it is characterized in that described chain feature screening module comprises:

Composite module, the relative path that is used for domain name each link identical with the domain name of search mission website is a delegation, forms a chain matrice;

Modular converter is used for described chain matrice is converted to numerical matrix;

Grouping module is used for by intelligent algorithm, and the row of described numerical matrix is pressed different classes of grouping;

The screening submodule, line number order according to described grouping, judge whether the pairing link of grouping is and described search mission website peer link the feature representation formula of extraction and search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and search mission website.