Embodiment
In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention is done detailed description further below in conjunction with embodiment.Embodiment of the present invention is carried out feature extraction to web page interlinkage, and the link on the webpage is screened, and extracts the link of valuable information, the link of filtering useless, thus thereby reach work efficiency and the search accuracy that the purpose of filtering link improves web crawlers.
One embodiment of the present invention provides a kind of method of screening links on web pages, as shown in Figure 1, comprises step:
Step 101, from the homepage of search mission website and the part correlation page, extract link;
In this step, the extraction of link can realize by the link extraction algorithm.The part correlation page here refers to the pairing page of the link that comprises in the homepage.Can also sum up the rule of the various Types of website in advance, and download the homepage and the part correlation content of pages of described search mission website, according to homepage and part correlation content of pages, and the rule of the various Types of website, judge the Type of website that described search mission website is affiliated.The rule of the Type of website has reflected the expression-form of the characteristic and the feature of the layout that links in the webpage, valuable Info Link.Extract link according to the Type of website of search mission website then, improved the efficient of extracting link like this.
Step 102, described domain name part and the relative path part of being linked as of fractionation;
In this step, can also at first filter out the different link of domain name of domain name and described search mission website, these links can be considered directly incoherent with the search mission website.
Step 103, the relative path of the domain name link identical with the domain name of search mission website is carried out intellectual analysis;
In this step, intellectual analysis can use various statistics, clustering algorithm, and its effect is a plurality of link relative paths are classified or to divide into groups, and is convenient to later step the group that meets a screening rule is extracted or filters out.We will introduce concrete screening rule and process in detail in example afterwards.
Here our intelligent analysis method of being exemplified below:
At first, be delegation with the relative path of domain name each link identical with the domain name of search mission website, form a chain matrice;
Secondly, chain matrice is converted to numerical matrix, with the length of that the longest link line width, not enough benefit 0 as matrix;
Then, by intelligent algorithm, format numerical matrix is a fuzzy matrix, calculates each similarity between capable of fuzzy matrix.According to described similarity, with the row grouping of described fuzzy matrix.Because it is corresponding that fuzzy matrix and numerical matrix, chain matrice all are row, this has just realized indirectly that row with the row of numerical matrix and chain matrice is by different classes of grouping.
Step 104, according to the result of described intellectual analysis, extract the feature representation formula with described search mission website peer link, and/or filtration and the described search mission website feature representation formula that has nothing to do and link.
In this step, clustering result has realized matrix is divided into groups, can be according to the line number order of grouping, judge that whether the pairing link of grouping is and described search mission website peer link.
In the use of web crawlers, can only carry out the function of the feature representation formula of extraction and described search mission website peer link, the link that these feature representation formulas are corresponding useful, web crawlers utilizes these links to work on.
In order to make the disclosed technical scheme of present embodiment more clear, another embodiment of the present invention will be by way of example, introduces the extraction of web page characteristics and the filter method that has nothing to do and link with search mission in detail, as shown in Figure 2, mainly may further comprise the steps:
The rule of step 201, the various Types of website of summary.
Because the chain feature of dissimilar websites exists different expression-forms, when chain feature was extracted, dissimilar websites needed to adopt Different Strategies.When table 1 has shown the web site url feature extraction, the guidance strategy that the chain feature of dissimilar websites is extracted.
Table 1
The Type of website |
The number of features of valuable Info Link |
The expression-form of feature |
The BBS forum website |
2 to 3 (list page feature and model information characteristics) |
It generally is identical character string |
Blog blog website |
2 to 3 (list page feature and Blog content feature) |
Generally be identical character string or meet certain |
|
|
The regular expression rule of type |
News website |
More than 3 or 3 |
Generally be to meet certain type regular expression rule |
Shown in the table 1, the chain feature expression formula can be continuous character style, or meets certain regular form, and each feature can be represented with the form of feature representation formula like this.The guidance strategy of the chain feature of the dissimilar websites of mentioning at table 1, table 2 have shown the chain feature rule example of dissimilar websites.
Table 2
Web site name |
The Type of website |
The feature of valuable Info Link |
The expression-form of feature |
* nets forum |
The BBS forum website |
??tableforum/App/index.php?bbsid??App/view.php?bbsid= |
Identical character string |
* net blog |
Blog blog website |
??[0-9]{1,10}.html |
The Numeral Rules |
* News Network |
News website |
??
Content[0-9]{1,10}.htm??
www.inetcop.com.cn |
(1) character and the numeral rule of (2) continuation character that combines |
Therefore, we can at first judge the type of website before the web site url feature extraction, to obtain the guidance strategy of web site url feature extraction.
The various Type of website as BBS forum website, blog website, news portal website, enterprise web site and other types website, all has its specific composition rule, as contains some significant character, contains certain specific component content etc.In this step, we utilize as data excavation, machine learning scheduling algorithm, sum up the rule of all kinds website.In subsequent step, web crawlers will utilize the type under the Type of website rule judgment task website, thereby obtain the guidance strategy that chain feature is extracted.
This step detailed process is described below:
A) the some training materials of webpage of collection all kinds website as machine learning.
B) by analyzing a large amount of various types of webpages, extract the contents attribute feature of webpage, and represent attribute-bit as shown in table 3 and attributive character corresponding tables with attribute-bit.Like this, a kind of webpage of website just can have been represented with the combination of one or more attribute-bits.
Table 3
Attribute-bit |
Attributive character is described |
??A
1 |
Whether URL contains the bbs character |
??A
2 |
Whether URL contains the blog character |
??A
3 |
Whether URL contains the date characteristic character |
??A
4 |
Contain bbs or " community " character in the Meta label in the web page contents |
??A
5 |
Contain blog or " blog " character in the Meta label in the web page contents |
??A
6 |
The navigation bar message length |
??A
7 |
The body matter text size |
??A
8 |
Whether comprise temporal characteristics |
??A
9 |
Whether contain and reply or comment |
??A
10 |
Whether contain peer link |
C) sorting algorithm of employing machine learning algorithm or data mining, as neural network algorithm, decision tree classification algorithm etc., the training material of learning to collect, thereby obtain explaining the guidance rule of certain Type of website, this rule can be represented by the contents attribute feature, specifically can make up and represent by the attribute representation who represents attributive character, for example:
If (A
6And A
7And A
10) then (this website belongs to the news website type);
This expression formula is represented: if a website possesses A simultaneously
6, A
7, A
10Three pairing attributive character of attribute representation, promptly according to shown in the table 3, this website has navigation bar message length, body matter text size simultaneously and contains three features such as peer link, and then this website belongs to the website of news type.
Here our list of rules of obtaining for example by decision tree, as shown in Figure 4.
C
1Expression news website type;
C
2The expression forum Type of website;
C
3The expression blog Type of website;
C
4The expression other types;
That extracts is regular as follows:
C
1Rule: if (A
6=1and A
7=1and A
10=1) then C
1
if(A
6=1and?A
3=1)then?C
1
C
2Rule: if (A
8=1 and A
1=1) then C
2
if(A
8=1and?A
4=1)then?C
2
C
3Rule: if (A
5=1) then C
3
if(A
2=1)then?C
3
C
4Rule: if (A
6=1and A
7=1andA
10=0) then C
4
if(A
6=1and?A
7=0and?A
3=0)then?C
4
if(A
6=0and?A
2=0and?A
5=0and?A
8=0)then?C
4
if(A
6=0and?A
2=0and?A
5=0and?A
8=1and?A
1=0?and?A
4=0)then?C
4
Here, 1 expression comprises this feature, and 0 expression does not comprise this feature.
Step 202, download the homepage and the partial page content of search mission website, utilize the judgment rule of the Type of website of step 1, judge the type under this search mission website.
In this step, at first the table of comparisons 3 extracts the attributive character of webpage, and the Type of website rule of utilizing step 1 to obtain is then judged the task Type of website.
Step 203, according to the type of search mission website, by the link extraction algorithm, from the homepage of search mission website and partial page, extract all-links.
In this step, the link extraction algorithm can adopt the method for regular expression or character feature coupling.The place that may occur linking in Webpage mainly contains following a few place:
1,<BASE href=URL 〉, represent basic URL, in HTML, routing information is often specified by URL, and relative URL decides according to basic URL;
2,<and Ahref=URL 〉, this is the most frequently used link form, other places that are used to be linked to another WEB page or are linked to the same page;
3,<and LINK href=URL 〉, be used to link CSS style sheet address or the javascript page address that the WEB page uses;
4,<and FRAME src=URL 〉, in frame page, be used to link framework page address pointed;
5,<IFRAME src=URL 〉, IFRAME is a kind of form of framework, what it was different with FRAME is that IFRAME can be embedded in the arbitrary portion in the webpage.
If adopt the character feature matching algorithm, go for the link of character string " href " and " src " back exactly;
Do if the employing regular expression, we construct regular expression:<.*? (src|href) s*=s*[" | ']? (?<uri〉[^ ' " s]+).We just can obtain all-links in the webpage by this regular expression.
Step 204, fractionation are linked as domain name part and relative path part.
Step 205, filter out the domain name link different with the domain name of search mission website.
Step 206, relative path is added up cluster, extract the feature representation formula that link relevant with the search mission website, filtering has nothing to do with the search mission website links.
Here, the irrelevant link in search mission website is generally with the useful very low junk information of the degree of correlation that links and links with it.
In this step, the relative path that links that domain name is identical with the domain name of search mission is partly formed a chain matrice, utilizes its corresponding codes, as ASCII coding, Unicode coding etc., is converted to numerical matrix.Here be encoded to example with ASCII, introduce the process of conversion.
On a webpage, contain a plurality of links in the webpage, the character type relative path of each link is formed a matrix, and is as shown in table 4:
Table 4
According to the corresponding decimal system numerical value of ASCII, be converted to numerical matrix as shown in table 5, with the length of that the longest link line width, not enough benefit 0 as matrix.Wherein, X
1To X
7Representing matrix first row is to the 7th row respectively.
Table 5
For this numerical matrix, with intelligent algorithm with X
1, X
2X
7Be generalized into several classifications, these classifications are exactly the web site url feature classification that we will extract.The intelligent algorithm of analyzing this matrix can be inductive algorithm, the algorithm in the artificial intelligence such as the genetic algorithm on the statistics, and the clustering algorithm in the data mining.
Here we adopt that the fuzzy clustering algorithm illustrates in the data mining, and key step is as follows:
A) this matrix is carried out " normalization ", be converted to fuzzy matrix, so that carry out cluster." normalization " method is a lot, as the method for weighting, value method etc.Here give an example " being worth method most " change, formula is as follows:
X wherein
IjRepresent j the element that i is capable.
Matrix after the conversion is as shown in table 6 below:
Table 6
B) utilize the similarity coefficient method that above-mentioned fuzzy matrix is carried out cluster, the formula of similarity coefficient method also has a variety of, as scalar product method, the surplus Xuan of angle, correlation of indices method, approach degree etc.Here the surplus formula that revolves of angle for example, formula is as follows:
Wherein, r
IjThe similarity relation of expression row i and row j;
x
IkRepresent k the element of i in capable;
x
JkRepresent k the element of j in capable.
C) by after the step b) calculating, we can obtain the similarity between each row, and the result is as shown in table 7.
Table 7
|
??X
1 |
??X
2 |
??X
3 |
??X
4 |
??X
5 |
??X
6 |
??X
7 |
??X
1 |
??1 |
??0.734 |
??0.735 |
??0.732 |
??0.717 |
??0.720 |
??0.719 |
??X
2 |
??0.734 |
??1 |
??0.999 |
??0.999 |
??0.729 |
??0.732 |
??0.731 |
??X
3 |
??0.735 |
??0.999 |
??1 |
??0.999 |
??0.730 |
??0.733 |
??0.731 |
??X
4 |
??0.732 |
??0.999 |
??0.999 |
??1 |
??0.729 |
??0.732 |
??0.730 |
??X
5 |
??0.717 |
??0.729 |
??0.730 |
??0.729 |
??1 |
??0.999 |
??0.999 |
??X
6 |
??0.720 |
??0.732 |
??0.733 |
??0.732 |
??0.999 |
??1 |
??0.999 |
??X
7 |
??0.719 |
??0.731 |
??0.731 |
??0.730 |
??0.999 |
??0.999 |
??1 |
According to the similarity result and the closure relation of equal value of table 7, we draw clustering result is three classes:
S
1:X
1
S
2:X
2?X
3?X
4
S
3:X
5?X
6?X
7
D) according to the scale of matrix, which classification is the threshold value that a classification element number is set distinguish for valency is arranged
Feature classification, which classification of value information link is and the irrelevant feature classification that links in search mission website.
Here, can preset threshold be an interval, scope can be (m * 0.4, m * 0.6), the line number of m formula matrix in the formula.Our threshold value is (2.8,4.2) in this example.Can draw S thus
1Element number be 1 not in the threshold interval scope, be judged to be and the irrelevant feature classification that links in search mission website; S
2And S
3Element all be 3 and in the threshold interval scope, be judged to be valuable Info Link classification.
In addition, in this step, also can judge according to the type of the search mission website of having judged.As, if the number of features of the valuable Info Link of one type of website is 2-3, so, X in this example
1Constitute a class by itself, can be judged as the feature of valueless link.
E) element in each classification is converted to original character types according to the ASCII respective value again, then according to the type of website, utilize the guidance strategy of the chain feature extraction of the dissimilar websites that propose in the step 201, the result of comprehensive preceding step can draw following differentiation conclusion to the search mission website:
1, the Type of website is: the BBS forum website;
2, valuable Info Link feature :/forum-and/thread-;
3, with irrelevant the chain feature :/ads.php in search mission website?
In the present embodiment, the web site url feature after the filtration that web crawlers obtains by step 206 has begun the task of creeping of whole website.Like this, utilize chain feature to come filtering useless link, increase work efficiency and search for accuracy.
Another embodiment of the present invention discloses a kind of device that screens links on web pages, as shown in Figure 3, comprises that link extraction module 301, link split module 302, intelligent analysis module 303 and chain feature screening module 304.Wherein:
Link extraction module 301 is used for extracting link from the homepage and the partial page of search mission website.
Link splits module 302, is used to split described domain name part and the relative path part of being linked as.
Intelligent analysis module 303 is used for the relative path of the domain name link identical with the domain name of search mission website is carried out intellectual analysis.Here, intellectual analysis can use multiple algorithm, as intelligent algorithm, statistical algorithms or clustering algorithm.
Chain feature screening module 304, link to each other with intelligent analysis module 303, be used for intellectual analysis result according to 303 pairs of described relative paths of described intelligent analysis module, the feature representation formula of extraction and search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and search mission website.Chain feature screening module 304 extract with the feature representation formula of search mission website peer link, through these feature representation formulas are reduced into link, web crawlers just can utilize these links to work on.And filter out and the irrelevant feature representation formula that links in search mission website, before working on, eliminates web crawlers useless link, improve the work efficiency of reptile.
Preferably, this device can also comprise filtering module 305, is used to filter out the different link of domain name of domain name and described search mission website, and this module filters out obvious incoherent link information when the link of extracting is handled, help out.
Preferably, this device can also comprise Type of website rule base 306, download module 307, Type of website judge module 308.Wherein:
Type of website rule base 306 is used to store the rule of the various Types of website of summary in advance.These rules are to utilize as data excavation, machine learning scheduling algorithm, and a large amount of website and webpage sum up out by analyzing.Because the chain feature of dissimilar websites has different expression-forms, and is as shown in table 1.Therefore to judge earlier before extracting the web site url feature that type under the website is with the guidance strategy shown in the acquisition table 1.
Download module 307 is used to download the homepage and the partial page content of search mission website.
Type of website judge module 308, link to each other with download module 307 with Type of website rule base 306, be used for homepage and partial page content,, judge the type that the search mission website is affiliated by query web typing rule storehouse 306 according to the search mission website of download module 307 downloads.
Like this, link extraction module 301 also is used for the Type of website according to described search mission website, extracts link.Because, known after the affiliated type in search mission website, link extraction module 301 can also obtain the rule of this Type of website from Type of website rule base 306 according to the Type of website, and these rules have reflected information such as link present position on the webpage of search mission website, form.Link extraction module 301 utilizes these information, can improve the efficient of extracting link.
Preferably, the chain feature of this device screening module 304 can comprise composite module 3041, modular converter 3042, grouping module 3043 and screening submodule 3044, wherein:
Composite module 3041, the relative path that is used for domain name each link identical with the domain name of search mission website is a delegation, forms a chain matrice.
Modular converter 3042 is used for described chain matrice is converted to numerical matrix, with the length of that the longest link line width as matrix, not enough benefit 0.
Grouping module 3043 is used for by intelligent algorithm, and the row of described numerical matrix is pressed different classes of grouping.
Screening submodule 3044, line number order according to described grouping, judge whether the pairing link of grouping is and described search mission website peer link the feature representation formula of extraction and search mission website peer link, and/or the irrelevant feature representation formula that links of filtration and search mission website.
Here, each provisional capital of data matrix and chain matrice is corresponding, respective links in the also corresponding chain matrice of the row of its each grouping, when judgement is grouped into the group relevant with the search mission website, screening submodule 3044 just can extract the feature representation formula with search mission website peer link according to this simple corresponding relation.
Like this, by web page interlinkage being carried out domain name part and relative path part, at first filter out the domain name link different with the domain name of search mission website, then by intellectual analysis to relative path, to obtain the feature of web site url, link thereby reach to filter out,, improve the work efficiency of web crawlers and the purpose of search accuracy as the junk information link with the search website irrelevant information.
In sum, the technical scheme that embodiment of the present invention proposes, by feature extraction is carried out in web page interlinkage, filtration is to searching for useless connection, improve web crawlers to a great extent in the particularly work efficiency and the search accuracy in vertical search field of searching engine field, reduce labor workload, improve the dirigibility of system.In various types of websites, as potentiality that are widely used such as BBS website, blog website, news websites.
The above is an embodiment of the present invention only, is the concrete displaying of inventive concept, does not limit the present invention.To those skilled in the art, within the spirit and principles in the present invention every, to any change of the present invention, change, be equal to replacement etc., be included within protection scope of the present invention.