CN103838823A

CN103838823A - Website content accessible detection method based on web page templates

Info

Publication number: CN103838823A
Application number: CN201410028740.6A
Authority: CN
Inventors: 王灿; 李凯; 周宇; 卜佳俊; 陈纯
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-01-22
Filing date: 2014-01-22
Publication date: 2014-06-04
Anticipated expiration: 2034-01-22
Also published as: CN103838823B

Abstract

Provided is a website content accessible detection method based on web page templates. The method comprises the steps that according to a web address to be detected, all related web pages and resources of the website are captured; main body filtering is carried out on the web pages according to a main body extraction algorithm, and main body nodes of a DOM tree of the web pages are removed; a distance matrix between the web pages is measured and calculated for a web page set according to the web page structure based on an html tag; a hierarchical clustering threshold theta is set, hierarchical clustering is carried out on all the web pages according to the distance matrix, and a plurality of web pages are selected from each cluster to serve as the templates of the cluster to form a web page template set; detection corresponding to the correlated detection rule of the templates is carried out on the obtained template web page set; detection of the uncorrelated rule of the templates is carried out on all other web page sets, detection results are gathered, and a detection result is rapidly obtained.

Description

The accessible detection method of a kind of web site contents based on web page template

Technical field

The present invention relates to the field of the accessible detection of webpage and remodeling method, particularly the accessible detection method of a kind of web site contents based on web page template.

Background technology

Day by day flourishing today in internet, disabled person to the use of network because the problem of self exists obstacle, for alleviating this problem, 2012, Ministry of Industry and Information promulgated latest edition communication industry standard " YD/T1761 ?2012 Xin breaths Wu Zhang Ai ?physical function difference Ren Qun ?the accessible technical requirement of website design ", propose clog-free requirement for website design, needed website using to carry out clog-free detection for this reason.The webpage number of webpage is very many, directly detects efficiency and still in accuracy rate, all can have any problem, and is difficult to realize.

According to detect rule whether with can direct-detection web page template, can by detecting rule, to be divided into template relevant uncorrelated with template.If can accurately find out all template webpages in website, this will improve the accessible detection efficiency of template dependency rule greatly.Traditional template extraction algorithm is not considered the negative effect of body matter for template extraction effect.

Summary of the invention

The present invention has overcome the above-mentioned shortcoming of prior art, has proposed a kind of template extraction algorithm based on text filtration and structure of web page cluster, has proposed the accessible detection method of web site contents based on web page template on the basis of this algorithm.Utilize text to filter the webpage dom tree of realizing the extraction establishing target to text.And then the collections of web pages of removing after text is carried out to cluster, and therefrom find out template, utilize template to detect, avoid the detection one by one to magnanimity webpage.The invention provides the accessible detection method of a kind of web site contents based on web page template, comprise the following steps:

(1) obtain all related web pages in this website and resource according to network address to be detected; Rendering result is played up and preserved to all webpages;

(2) webpage in step (1) is carried out to text filtration according to text extraction algorithm, remove the text node of webpage dom tree;

(3) collections of web pages obtaining in step (2) is gone out to the distance matrix M between webpage according to the structure of web page metric calculation based on html label

(4) to M setting threshold θ, all webpages are carried out to hierarchical clustering; In each clustering cluster, choose several webpages and form web page template set as the template of this clustering cluster;

(5) template collections of web pages step (4) being obtained is carried out the detection of corresponding templates coherent detection rule;

(6) collections of web pages step (2) being obtained is carried out the detection of the uncorrelated rule of template and testing result and step (5) result is gathered, thereby obtains testing result fast.

Filter for the text in step (2), comprise the following steps:

(2.1) all webpages are built to dom trees, wherein in filtering web page html text with the incoherent html label of content;

(2.2) dom tree building in step (2.1) is calculated to the text density of each node, the node of text density maximum is text block, and the metric form of its Chinese version density is:

DS _c=∑ _{i ∈ childrenofc}textDensity _iformula (1)

Wherein i is the child node of node c, TextDensity _ifor the text density of node i, the combine text density that the present invention adopts, has considered that some hyperlink piece Chinese version density is large, the feature that text module is disturbed, and the combine text density of node i is defined as follows according to combine text density:

{TextDensity}_{i} = \frac{C_{i}}{T_{i}} \log_{\ln (\frac{C_{i}}{&Not; {LC}_{i}} {LC}_{i} + \frac{{LC}_{b}}{C_{b}} C_{i} + e)} (\frac{C_{i}}{{LC}_{i}} \frac{T_{i}}{{LT}_{i}})

Formula (2)

Wherein C _irepresent the word number of the sub-dom tree take i as root, T _irepresent the html number of tags LC of sub-dom tree _irepresent sub-dom tree hyperlink display text number,

lC _ifor non-hyperlink display text number, LT _ifor hyperlink number of tags, LC _bfor hyperlink display text number under <body> label, C _bfor below number of words of <body> label, e is natural constant.

Calculate for the webpage distance matrix in step (3), comprise the following steps:

(3.1) number of times that in statistical web page D, each html label occurs in webpage, construction feature vector v (D), vector is N dimension altogether, and N is that W3C standard allows all kinds of total number of labels that occur in html document

(3.2) proper vector after webpage conversion is carried out to the calculating of Euclidean distance, obtain structure of web page distance matrix.

For the hierarchical clustering in step (4), concrete steps are: all webpages are carried out to bottom-up hierarchical clustering, be the each webpage of initialization as independent clustering cluster, then merge clustering cluster according to the threshold value of distance of setting, until reach threshold value θ.

Advantage of the present invention is: template extraction algorithm is being considered the negative effect of text, and degree of accuracy is higher; It is relevant uncorrelated with template that detection rule is divided into template by the accessible detection method of web site contents based on web page template, only detects template for template dependency rule, greatly improved the efficiency of accessible detection.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention.

Embodiment

Below in conjunction with accompanying drawing, specific embodiment of the invention is described in detail, in conjunction with illustrated process, invention is elaborated.

The invention provides the accessible detection method of a kind of web site contents based on web page template, comprise the following steps:

(1) use distributed reptile to capture all related web pages in this website and resource according to network address to be detected; Use multithreading render engine, rendering result is played up and preserved to all webpages;

(2) use text extraction algorithm to carry out text filtration to the webpage in step (1) according to formula (1) and formula (2), remove the text node of all webpage dom trees;

(3) collections of web pages obtaining in step (2) is calculated to the distance matrix M between webpage according to the label vector distance based on html label;

Filter for the text in step (2), comprise the following steps:

(2.1) all webpages are built to dom tree, wherein the annotation in filtering web page html text, <script> label, <noscript> label, <style> label and CSS pattern are the label of " display:none ";

DS _c=∑ _{i ∈ childrenofc}textDensity _iformula (1)

{TextDensity}_{i} = \frac{C_{i}}{T_{i}} \log_{\ln (\frac{C_{i}}{&Not; {LC}_{i}} {LC}_{i} + \frac{{LC}_{b}}{C_{b}} C_{i} + e)} (\frac{C_{i}}{{LC}_{i}} \frac{T_{i}}{{LT}_{i}})

Formula (2)

lC _ifor non-hyperlink display text number, LT _ifor hyperlink number of tags, LC _bfor hyperlink display text number under <body> label, C _bfor below number of words of <body> label, e is natural constant, in the time that denominator in formula is 0, is made as 1.

(3.1) number of times that in statistical web page D, each html label occurs in webpage, construction feature vector v (D), vector is N dimension altogether, and N is that W3C standard allows all kinds of total number of labels that occur in html document;

For the hierarchical clustering in step (4), its implementation process is: the webpage after all texts are filtered carries out bottom-up hierarchical clustering, be that the each webpage of initialization is as independent clustering cluster, then merge clustering cluster according to the threshold value of the distance of setting, cluster until reach the maximum satisfying condition, the reference distance that wherein merges clustering cluster is the single link distance (Single Linkage) between to be combined clustering.

Finally, it should be pointed out that above embodiment is only the more representational example of the present invention.Obviously, technical scheme of the present invention is not limited to above-described embodiment, and the step in right can also have different settling modes.Those of ordinary skill in the art can not depart under the invention state of mind of the present invention; make various modifications or variation for above-described embodiment; thereby protection scope of the present invention do not limit by above-described embodiment, and it should be the maximum magnitude that meets the inventive features that claims mention.

Claims

1. the accessible detection method of the web site contents based on web page template, the method is characterized in that, comprises the following steps:

(1) capture all related web pages in this website and resource according to network address to be detected; Rendering result is played up and preserved to all webpages;

According to described in claim 1 method, it is characterized in that: what step (2) was described carries out text filtration according to text extraction algorithm, and concrete steps are:

(2.1) build dom tree to playing up complete webpage, in filtering web page html text with the incoherent label of content;

DS _c=∑ _{i ∈ childrenofc}textDensity _iformula (1)

Wherein i is the child node of certain node c of webpage dom tree, TextDensity _ifor the text density of node i, the text density of node i is defined as follows according to combine text density:

{TextDensity}_{i} = \frac{C_{i}}{T_{i}} \log_{\ln (\frac{C_{i}}{&Not; {LC}_{i}} {LC}_{i} + \frac{{LC}_{b}}{C_{b}} C_{i} + e)} (\frac{C_{i}}{{LC}_{i}} \frac{T_{i}}{{LT}_{i}})

Formula (2)

Wherein C _irepresent the word number of the sub-dom tree take i as root, T _irepresent the html number of tags LC of sub-dom tree _irepresent sub-dom tree hyperlink display text number, lC _ifor non-hyperlink display text number, LT _ifor hyperlink number of tags, LC _bfor hyperlink display text number under <body> label, C _bfor below number of words of <body> label, e is natural constant, in the time that denominator in formula is 0, is made as 1.

3. according to the method described in claim 1, it is characterized in that: the calculating structure of web page metric range matrix that step (3) is described, concrete steps are:

(3.2) proper vector after webpage conversion is carried out to the calculating of Euclidean distance, obtain structure of web page distance matrix M.

4. according to the method described in claim 1, it is characterized in that: the hierarchical clustering that step (4) is described, concrete steps are:

All webpages are carried out to bottom-up hierarchical clustering, i.e. the each webpage of initialization, as independent clustering cluster, then merges clustering cluster according to the threshold value of the distance of setting, until reach threshold value θ.