CN103838823A - Website content accessible detection method based on web page templates - Google Patents

Website content accessible detection method based on web page templates Download PDF

Info

Publication number
CN103838823A
CN103838823A CN201410028740.6A CN201410028740A CN103838823A CN 103838823 A CN103838823 A CN 103838823A CN 201410028740 A CN201410028740 A CN 201410028740A CN 103838823 A CN103838823 A CN 103838823A
Authority
CN
China
Prior art keywords
text
web page
webpage
carried out
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410028740.6A
Other languages
Chinese (zh)
Other versions
CN103838823B (en
Inventor
王灿
李凯
周宇
卜佳俊
陈纯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201410028740.6A priority Critical patent/CN103838823B/en
Publication of CN103838823A publication Critical patent/CN103838823A/en
Application granted granted Critical
Publication of CN103838823B publication Critical patent/CN103838823B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • G06F16/337Profile generation, learning or modification

Abstract

Provided is a website content accessible detection method based on web page templates. The method comprises the steps that according to a web address to be detected, all related web pages and resources of the website are captured; main body filtering is carried out on the web pages according to a main body extraction algorithm, and main body nodes of a DOM tree of the web pages are removed; a distance matrix between the web pages is measured and calculated for a web page set according to the web page structure based on an html tag; a hierarchical clustering threshold theta is set, hierarchical clustering is carried out on all the web pages according to the distance matrix, and a plurality of web pages are selected from each cluster to serve as the templates of the cluster to form a web page template set; detection corresponding to the correlated detection rule of the templates is carried out on the obtained template web page set; detection of the uncorrelated rule of the templates is carried out on all other web page sets, detection results are gathered, and a detection result is rapidly obtained.

Description

The accessible detection method of a kind of web site contents based on web page template
Technical field
The present invention relates to the field of the accessible detection of webpage and remodeling method, particularly the accessible detection method of a kind of web site contents based on web page template.
Background technology
Day by day flourishing today in internet, disabled person to the use of network because the problem of self exists obstacle, for alleviating this problem, 2012, Ministry of Industry and Information promulgated latest edition communication industry standard " YD/T1761 ?2012 Xin breaths Wu Zhang Ai ?physical function difference Ren Qun ?the accessible technical requirement of website design ", propose clog-free requirement for website design, needed website using to carry out clog-free detection for this reason.The webpage number of webpage is very many, directly detects efficiency and still in accuracy rate, all can have any problem, and is difficult to realize.
According to detect rule whether with can direct-detection web page template, can by detecting rule, to be divided into template relevant uncorrelated with template.If can accurately find out all template webpages in website, this will improve the accessible detection efficiency of template dependency rule greatly.Traditional template extraction algorithm is not considered the negative effect of body matter for template extraction effect.
Summary of the invention
The present invention has overcome the above-mentioned shortcoming of prior art, has proposed a kind of template extraction algorithm based on text filtration and structure of web page cluster, has proposed the accessible detection method of web site contents based on web page template on the basis of this algorithm.Utilize text to filter the webpage dom tree of realizing the extraction establishing target to text.And then the collections of web pages of removing after text is carried out to cluster, and therefrom find out template, utilize template to detect, avoid the detection one by one to magnanimity webpage.The invention provides the accessible detection method of a kind of web site contents based on web page template, comprise the following steps:
(1) obtain all related web pages in this website and resource according to network address to be detected; Rendering result is played up and preserved to all webpages;
(2) webpage in step (1) is carried out to text filtration according to text extraction algorithm, remove the text node of webpage dom tree;
(3) collections of web pages obtaining in step (2) is gone out to the distance matrix M between webpage according to the structure of web page metric calculation based on html label
Figure BDA0000460085170000021
(4) to M setting threshold θ, all webpages are carried out to hierarchical clustering; In each clustering cluster, choose several webpages and form web page template set as the template of this clustering cluster;
(5) template collections of web pages step (4) being obtained is carried out the detection of corresponding templates coherent detection rule;
(6) collections of web pages step (2) being obtained is carried out the detection of the uncorrelated rule of template and testing result and step (5) result is gathered, thereby obtains testing result fast.
Filter for the text in step (2), comprise the following steps:
(2.1) all webpages are built to dom trees, wherein in filtering web page html text with the incoherent html label of content;
(2.2) dom tree building in step (2.1) is calculated to the text density of each node, the node of text density maximum is text block, and the metric form of its Chinese version density is:
DS c=∑ i ∈ childrenofctextDensity iformula (1)
Wherein i is the child node of node c, TextDensity ifor the text density of node i, the combine text density that the present invention adopts, has considered that some hyperlink piece Chinese version density is large, the feature that text module is disturbed, and the combine text density of node i is defined as follows according to combine text density:
TextDensity i = C i T i log ln ( C i ⫬ LC i LC i + LC b C b C i + e ) ( C i LC i T i LT i ) Formula (2)
Wherein C irepresent the word number of the sub-dom tree take i as root, T irepresent the html number of tags LC of sub-dom tree irepresent sub-dom tree hyperlink display text number,
Figure BDA0000460085170000032
lC ifor non-hyperlink display text number, LT ifor hyperlink number of tags, LC bfor hyperlink display text number under <body> label, C bfor below number of words of <body> label, e is natural constant.
Calculate for the webpage distance matrix in step (3), comprise the following steps:
(3.1) number of times that in statistical web page D, each html label occurs in webpage, construction feature vector v (D), vector is N dimension altogether, and N is that W3C standard allows all kinds of total number of labels that occur in html document
(3.2) proper vector after webpage conversion is carried out to the calculating of Euclidean distance, obtain structure of web page distance matrix.
For the hierarchical clustering in step (4), concrete steps are: all webpages are carried out to bottom-up hierarchical clustering, be the each webpage of initialization as independent clustering cluster, then merge clustering cluster according to the threshold value of distance of setting, until reach threshold value θ.
Advantage of the present invention is: template extraction algorithm is being considered the negative effect of text, and degree of accuracy is higher; It is relevant uncorrelated with template that detection rule is divided into template by the accessible detection method of web site contents based on web page template, only detects template for template dependency rule, greatly improved the efficiency of accessible detection.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing, specific embodiment of the invention is described in detail, in conjunction with illustrated process, invention is elaborated.
The invention provides the accessible detection method of a kind of web site contents based on web page template, comprise the following steps:
(1) use distributed reptile to capture all related web pages in this website and resource according to network address to be detected; Use multithreading render engine, rendering result is played up and preserved to all webpages;
(2) use text extraction algorithm to carry out text filtration to the webpage in step (1) according to formula (1) and formula (2), remove the text node of all webpage dom trees;
(3) collections of web pages obtaining in step (2) is calculated to the distance matrix M between webpage according to the label vector distance based on html label;
Figure BDA0000460085170000041
(4) to M setting threshold θ, all webpages are carried out to hierarchical clustering; In each clustering cluster, choose several webpages and form web page template set as the template of this clustering cluster;
(5) template collections of web pages step (4) being obtained is carried out the detection of corresponding templates coherent detection rule;
(6) collections of web pages step (2) being obtained is carried out the detection of the uncorrelated rule of template and testing result and step (5) result is gathered, thereby obtains testing result fast.
Filter for the text in step (2), comprise the following steps:
(2.1) all webpages are built to dom tree, wherein the annotation in filtering web page html text, <script> label, <noscript> label, <style> label and CSS pattern are the label of " display:none ";
(2.2) dom tree building in step (2.1) is calculated to the text density of each node, the node of text density maximum is text block, and the metric form of its Chinese version density is:
DS c=∑ i ∈ childrenofctextDensity iformula (1)
Wherein i is the child node of node c, TextDensity ifor the text density of node i, the combine text density that the present invention adopts, has considered that some hyperlink piece Chinese version density is large, the feature that text module is disturbed, and the combine text density of node i is defined as follows according to combine text density:
TextDensity i = C i T i log ln ( C i &Not; LC i LC i + LC b C b C i + e ) ( C i LC i T i LT i ) Formula (2)
Wherein C irepresent the word number of the sub-dom tree take i as root, T irepresent the html number of tags LC of sub-dom tree irepresent sub-dom tree hyperlink display text number,
Figure BDA0000460085170000052
lC ifor non-hyperlink display text number, LT ifor hyperlink number of tags, LC bfor hyperlink display text number under <body> label, C bfor below number of words of <body> label, e is natural constant, in the time that denominator in formula is 0, is made as 1.
Calculate for the webpage distance matrix in step (3), comprise the following steps:
(3.1) number of times that in statistical web page D, each html label occurs in webpage, construction feature vector v (D), vector is N dimension altogether, and N is that W3C standard allows all kinds of total number of labels that occur in html document;
(3.2) proper vector after webpage conversion is carried out to the calculating of Euclidean distance, obtain structure of web page distance matrix.
For the hierarchical clustering in step (4), its implementation process is: the webpage after all texts are filtered carries out bottom-up hierarchical clustering, be that the each webpage of initialization is as independent clustering cluster, then merge clustering cluster according to the threshold value of the distance of setting, cluster until reach the maximum satisfying condition, the reference distance that wherein merges clustering cluster is the single link distance (Single Linkage) between to be combined clustering.
Finally, it should be pointed out that above embodiment is only the more representational example of the present invention.Obviously, technical scheme of the present invention is not limited to above-described embodiment, and the step in right can also have different settling modes.Those of ordinary skill in the art can not depart under the invention state of mind of the present invention; make various modifications or variation for above-described embodiment; thereby protection scope of the present invention do not limit by above-described embodiment, and it should be the maximum magnitude that meets the inventive features that claims mention.

Claims (4)

1. the accessible detection method of the web site contents based on web page template, the method is characterized in that, comprises the following steps:
(1) capture all related web pages in this website and resource according to network address to be detected; Rendering result is played up and preserved to all webpages;
(2) webpage in step (1) is carried out to text filtration according to text extraction algorithm, remove the text node of webpage dom tree;
(3) collections of web pages obtaining in step (2) is gone out to the distance matrix M between webpage according to the structure of web page metric calculation based on html label
Figure FDA0000460085160000011
(4) to M setting threshold θ, all webpages are carried out to hierarchical clustering; In each clustering cluster, choose several webpages and form web page template set as the template of this clustering cluster;
(5) template collections of web pages step (4) being obtained is carried out the detection of corresponding templates coherent detection rule;
(6) collections of web pages step (2) being obtained is carried out the detection of the uncorrelated rule of template and testing result and step (5) result is gathered, thereby obtains testing result fast.
According to described in claim 1 method, it is characterized in that: what step (2) was described carries out text filtration according to text extraction algorithm, and concrete steps are:
(2.1) build dom tree to playing up complete webpage, in filtering web page html text with the incoherent label of content;
(2.2) dom tree building in step (2.1) is calculated to the text density of each node, the node of text density maximum is text block, and the metric form of its Chinese version density is:
DS c=∑ i ∈ childrenofctextDensity iformula (1)
Wherein i is the child node of certain node c of webpage dom tree, TextDensity ifor the text density of node i, the text density of node i is defined as follows according to combine text density:
TextDensity i = C i T i log ln ( C i &Not; LC i LC i + LC b C b C i + e ) ( C i LC i T i LT i ) Formula (2)
Wherein C irepresent the word number of the sub-dom tree take i as root, T irepresent the html number of tags LC of sub-dom tree irepresent sub-dom tree hyperlink display text number, lC ifor non-hyperlink display text number, LT ifor hyperlink number of tags, LC bfor hyperlink display text number under <body> label, C bfor below number of words of <body> label, e is natural constant, in the time that denominator in formula is 0, is made as 1.
3. according to the method described in claim 1, it is characterized in that: the calculating structure of web page metric range matrix that step (3) is described, concrete steps are:
(3.1) number of times that in statistical web page D, each html label occurs in webpage, construction feature vector v (D), vector is N dimension altogether, and N is that W3C standard allows all kinds of total number of labels that occur in html document;
(3.2) proper vector after webpage conversion is carried out to the calculating of Euclidean distance, obtain structure of web page distance matrix M.
4. according to the method described in claim 1, it is characterized in that: the hierarchical clustering that step (4) is described, concrete steps are:
All webpages are carried out to bottom-up hierarchical clustering, i.e. the each webpage of initialization, as independent clustering cluster, then merges clustering cluster according to the threshold value of the distance of setting, until reach threshold value θ.
CN201410028740.6A 2014-01-22 2014-01-22 Website content accessible detection method based on web page templates Active CN103838823B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410028740.6A CN103838823B (en) 2014-01-22 2014-01-22 Website content accessible detection method based on web page templates

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410028740.6A CN103838823B (en) 2014-01-22 2014-01-22 Website content accessible detection method based on web page templates

Publications (2)

Publication Number Publication Date
CN103838823A true CN103838823A (en) 2014-06-04
CN103838823B CN103838823B (en) 2017-02-22

Family

ID=50802320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410028740.6A Active CN103838823B (en) 2014-01-22 2014-01-22 Website content accessible detection method based on web page templates

Country Status (1)

Country Link
CN (1) CN103838823B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776645A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 Data processing method and device
CN107229669A (en) * 2016-03-23 2017-10-03 塔塔咨询服务公司 Method and system for selecting the sample set on assessing website Barrien-free
CN110020296A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of method and device for extracting news web page text
CN110457579A (en) * 2019-07-30 2019-11-15 四川大学 The Web de-noising method and system to be cooperated based on template and classifier
CN110851606A (en) * 2019-11-18 2020-02-28 杭州安恒信息技术股份有限公司 Website clustering method and system based on webpage structure similarity
CN111314109A (en) * 2020-01-15 2020-06-19 太原理工大学 Weak key-based large-scale Internet of things equipment firmware identification method
CN111625749A (en) * 2020-06-01 2020-09-04 深圳市小满科技有限公司 Method, device, equipment and medium for extracting detail page information of participating company website
CN113780260A (en) * 2021-07-27 2021-12-10 浙江大学 Computer vision-based intelligent barrier-free character detection method
CN113806661A (en) * 2021-09-18 2021-12-17 中国电子技术标准化研究院 Website information barrier-free detection tool
CN115373649A (en) * 2022-07-26 2022-11-22 哈尔滨亿时代数码科技开发有限公司 Dynamic internet content barrier-free transformation method and device and website content barrier-free transformation method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218286A (en) * 2012-01-20 2013-07-24 阿里巴巴集团控股有限公司 Method and system for detecting accessibility of webpage
CN102662972A (en) * 2012-03-09 2012-09-12 浙江大学 A visually disabled person-oriented automatic picture description method for web content barrier-free access
CN102799638B (en) * 2012-06-25 2015-07-15 浙江大学 In-page navigation generation method facing barrier-free access to webpage contents

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李浚: "图书馆网站无障碍问题调查及解决策略", 《现代图书情报技术》 *
陈威刚等: "网络无障碍技术研究成果", 《现代电信科技》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776645A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 Data processing method and device
CN107229669B (en) * 2016-03-23 2021-02-05 塔塔咨询服务公司 Method and system for selecting a sample set for assessing website non-obstruction
CN107229669A (en) * 2016-03-23 2017-10-03 塔塔咨询服务公司 Method and system for selecting the sample set on assessing website Barrien-free
CN110020296A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of method and device for extracting news web page text
CN110457579A (en) * 2019-07-30 2019-11-15 四川大学 The Web de-noising method and system to be cooperated based on template and classifier
CN110851606A (en) * 2019-11-18 2020-02-28 杭州安恒信息技术股份有限公司 Website clustering method and system based on webpage structure similarity
CN111314109A (en) * 2020-01-15 2020-06-19 太原理工大学 Weak key-based large-scale Internet of things equipment firmware identification method
CN111625749A (en) * 2020-06-01 2020-09-04 深圳市小满科技有限公司 Method, device, equipment and medium for extracting detail page information of participating company website
CN111625749B (en) * 2020-06-01 2023-08-11 深圳市小满科技有限公司 Method, device, equipment and medium for extracting website detail page information of participant company
CN113780260A (en) * 2021-07-27 2021-12-10 浙江大学 Computer vision-based intelligent barrier-free character detection method
CN113780260B (en) * 2021-07-27 2023-09-19 浙江大学 Barrier-free character intelligent detection method based on computer vision
CN113806661A (en) * 2021-09-18 2021-12-17 中国电子技术标准化研究院 Website information barrier-free detection tool
CN113806661B (en) * 2021-09-18 2023-08-25 中国电子技术标准化研究院 Barrier-free detection tool for website information
CN115373649A (en) * 2022-07-26 2022-11-22 哈尔滨亿时代数码科技开发有限公司 Dynamic internet content barrier-free transformation method and device and website content barrier-free transformation method
CN115373649B (en) * 2022-07-26 2023-03-31 哈尔滨亿时代数码科技开发有限公司 Dynamic internet content barrier-free transformation method and device and website content barrier-free transformation method

Also Published As

Publication number Publication date
CN103838823B (en) 2017-02-22

Similar Documents

Publication Publication Date Title
CN103838823A (en) Website content accessible detection method based on web page templates
CN102541874B (en) Webpage text content extracting method and device
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
US8560940B2 (en) Detecting repeat patterns on a web page using signals
CN103810425B (en) The detection method of malice network address and device
WO2019041521A1 (en) Apparatus and method for extracting user keyword, and computer-readable storage medium
US8819028B2 (en) System and method for web content extraction
CN102750390B (en) Automatic news webpage element extracting method
CN106599181A (en) Hot news detecting method based on topic model
CN110991171B (en) Sensitive word detection method and device
CN102207946B (en) Knowledge network semi-automatic generation method
CN107992542A (en) A kind of similar article based on topic model recommends method
CN106708952B (en) A kind of Webpage clustering method and device
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN105447081A (en) Cloud platform-oriented government affair and public opinion monitoring method
CN104598577A (en) Extraction method for webpage text
CN103679012A (en) Clustering method and device of portable execute (PE) files
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN108021667A (en) A kind of file classification method and device
CN104199838B (en) A kind of user model constructing method based on label disambiguation
CN105528357A (en) Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
CN104572787B (en) The recognition methods of pseudo- original website and device
CN106528509B (en) Webpage information extraction method and device
CN108694192B (en) Webpage type judging method and device
CN104217025B (en) For the entry extraction system and method for more record webpages

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant