CN103838823A - Website content accessible detection method based on web page templates - Google Patents
Website content accessible detection method based on web page templates Download PDFInfo
- Publication number
- CN103838823A CN103838823A CN201410028740.6A CN201410028740A CN103838823A CN 103838823 A CN103838823 A CN 103838823A CN 201410028740 A CN201410028740 A CN 201410028740A CN 103838823 A CN103838823 A CN 103838823A
- Authority
- CN
- China
- Prior art keywords
- text
- web page
- webpage
- carried out
- template
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
- G06F16/337—Profile generation, learning or modification
Abstract
Provided is a website content accessible detection method based on web page templates. The method comprises the steps that according to a web address to be detected, all related web pages and resources of the website are captured; main body filtering is carried out on the web pages according to a main body extraction algorithm, and main body nodes of a DOM tree of the web pages are removed; a distance matrix between the web pages is measured and calculated for a web page set according to the web page structure based on an html tag; a hierarchical clustering threshold theta is set, hierarchical clustering is carried out on all the web pages according to the distance matrix, and a plurality of web pages are selected from each cluster to serve as the templates of the cluster to form a web page template set; detection corresponding to the correlated detection rule of the templates is carried out on the obtained template web page set; detection of the uncorrelated rule of the templates is carried out on all other web page sets, detection results are gathered, and a detection result is rapidly obtained.
Description
Technical field
The present invention relates to the field of the accessible detection of webpage and remodeling method, particularly the accessible detection method of a kind of web site contents based on web page template.
Background technology
Day by day flourishing today in internet, disabled person to the use of network because the problem of self exists obstacle, for alleviating this problem, 2012, Ministry of Industry and Information promulgated latest edition communication industry standard " YD/T1761 ?2012 Xin breaths Wu Zhang Ai ?physical function difference Ren Qun ?the accessible technical requirement of website design ", propose clog-free requirement for website design, needed website using to carry out clog-free detection for this reason.The webpage number of webpage is very many, directly detects efficiency and still in accuracy rate, all can have any problem, and is difficult to realize.
According to detect rule whether with can direct-detection web page template, can by detecting rule, to be divided into template relevant uncorrelated with template.If can accurately find out all template webpages in website, this will improve the accessible detection efficiency of template dependency rule greatly.Traditional template extraction algorithm is not considered the negative effect of body matter for template extraction effect.
Summary of the invention
The present invention has overcome the above-mentioned shortcoming of prior art, has proposed a kind of template extraction algorithm based on text filtration and structure of web page cluster, has proposed the accessible detection method of web site contents based on web page template on the basis of this algorithm.Utilize text to filter the webpage dom tree of realizing the extraction establishing target to text.And then the collections of web pages of removing after text is carried out to cluster, and therefrom find out template, utilize template to detect, avoid the detection one by one to magnanimity webpage.The invention provides the accessible detection method of a kind of web site contents based on web page template, comprise the following steps:
(1) obtain all related web pages in this website and resource according to network address to be detected; Rendering result is played up and preserved to all webpages;
(2) webpage in step (1) is carried out to text filtration according to text extraction algorithm, remove the text node of webpage dom tree;
(3) collections of web pages obtaining in step (2) is gone out to the distance matrix M between webpage according to the structure of web page metric calculation based on html label
(4) to M setting threshold θ, all webpages are carried out to hierarchical clustering; In each clustering cluster, choose several webpages and form web page template set as the template of this clustering cluster;
(5) template collections of web pages step (4) being obtained is carried out the detection of corresponding templates coherent detection rule;
(6) collections of web pages step (2) being obtained is carried out the detection of the uncorrelated rule of template and testing result and step (5) result is gathered, thereby obtains testing result fast.
Filter for the text in step (2), comprise the following steps:
(2.1) all webpages are built to dom trees, wherein in filtering web page html text with the incoherent html label of content;
(2.2) dom tree building in step (2.1) is calculated to the text density of each node, the node of text density maximum is text block, and the metric form of its Chinese version density is:
DS
c=∑
i ∈ childrenofctextDensity
iformula (1)
Wherein i is the child node of node c, TextDensity
ifor the text density of node i, the combine text density that the present invention adopts, has considered that some hyperlink piece Chinese version density is large, the feature that text module is disturbed, and the combine text density of node i is defined as follows according to combine text density:
Wherein C
irepresent the word number of the sub-dom tree take i as root, T
irepresent the html number of tags LC of sub-dom tree
irepresent sub-dom tree hyperlink display text number,
lC
ifor non-hyperlink display text number, LT
ifor hyperlink number of tags, LC
bfor hyperlink display text number under <body> label, C
bfor below number of words of <body> label, e is natural constant.
Calculate for the webpage distance matrix in step (3), comprise the following steps:
(3.1) number of times that in statistical web page D, each html label occurs in webpage, construction feature vector v (D), vector is N dimension altogether, and N is that W3C standard allows all kinds of total number of labels that occur in html document
(3.2) proper vector after webpage conversion is carried out to the calculating of Euclidean distance, obtain structure of web page distance matrix.
For the hierarchical clustering in step (4), concrete steps are: all webpages are carried out to bottom-up hierarchical clustering, be the each webpage of initialization as independent clustering cluster, then merge clustering cluster according to the threshold value of distance of setting, until reach threshold value θ.
Advantage of the present invention is: template extraction algorithm is being considered the negative effect of text, and degree of accuracy is higher; It is relevant uncorrelated with template that detection rule is divided into template by the accessible detection method of web site contents based on web page template, only detects template for template dependency rule, greatly improved the efficiency of accessible detection.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing, specific embodiment of the invention is described in detail, in conjunction with illustrated process, invention is elaborated.
The invention provides the accessible detection method of a kind of web site contents based on web page template, comprise the following steps:
(1) use distributed reptile to capture all related web pages in this website and resource according to network address to be detected; Use multithreading render engine, rendering result is played up and preserved to all webpages;
(2) use text extraction algorithm to carry out text filtration to the webpage in step (1) according to formula (1) and formula (2), remove the text node of all webpage dom trees;
(3) collections of web pages obtaining in step (2) is calculated to the distance matrix M between webpage according to the label vector distance based on html label;
(4) to M setting threshold θ, all webpages are carried out to hierarchical clustering; In each clustering cluster, choose several webpages and form web page template set as the template of this clustering cluster;
(5) template collections of web pages step (4) being obtained is carried out the detection of corresponding templates coherent detection rule;
(6) collections of web pages step (2) being obtained is carried out the detection of the uncorrelated rule of template and testing result and step (5) result is gathered, thereby obtains testing result fast.
Filter for the text in step (2), comprise the following steps:
(2.1) all webpages are built to dom tree, wherein the annotation in filtering web page html text, <script> label, <noscript> label, <style> label and CSS pattern are the label of " display:none ";
(2.2) dom tree building in step (2.1) is calculated to the text density of each node, the node of text density maximum is text block, and the metric form of its Chinese version density is:
DS
c=∑
i ∈ childrenofctextDensity
iformula (1)
Wherein i is the child node of node c, TextDensity
ifor the text density of node i, the combine text density that the present invention adopts, has considered that some hyperlink piece Chinese version density is large, the feature that text module is disturbed, and the combine text density of node i is defined as follows according to combine text density:
Wherein C
irepresent the word number of the sub-dom tree take i as root, T
irepresent the html number of tags LC of sub-dom tree
irepresent sub-dom tree hyperlink display text number,
lC
ifor non-hyperlink display text number, LT
ifor hyperlink number of tags, LC
bfor hyperlink display text number under <body> label, C
bfor below number of words of <body> label, e is natural constant, in the time that denominator in formula is 0, is made as 1.
Calculate for the webpage distance matrix in step (3), comprise the following steps:
(3.1) number of times that in statistical web page D, each html label occurs in webpage, construction feature vector v (D), vector is N dimension altogether, and N is that W3C standard allows all kinds of total number of labels that occur in html document;
(3.2) proper vector after webpage conversion is carried out to the calculating of Euclidean distance, obtain structure of web page distance matrix.
For the hierarchical clustering in step (4), its implementation process is: the webpage after all texts are filtered carries out bottom-up hierarchical clustering, be that the each webpage of initialization is as independent clustering cluster, then merge clustering cluster according to the threshold value of the distance of setting, cluster until reach the maximum satisfying condition, the reference distance that wherein merges clustering cluster is the single link distance (Single Linkage) between to be combined clustering.
Finally, it should be pointed out that above embodiment is only the more representational example of the present invention.Obviously, technical scheme of the present invention is not limited to above-described embodiment, and the step in right can also have different settling modes.Those of ordinary skill in the art can not depart under the invention state of mind of the present invention; make various modifications or variation for above-described embodiment; thereby protection scope of the present invention do not limit by above-described embodiment, and it should be the maximum magnitude that meets the inventive features that claims mention.
Claims (4)
1. the accessible detection method of the web site contents based on web page template, the method is characterized in that, comprises the following steps:
(1) capture all related web pages in this website and resource according to network address to be detected; Rendering result is played up and preserved to all webpages;
(2) webpage in step (1) is carried out to text filtration according to text extraction algorithm, remove the text node of webpage dom tree;
(3) collections of web pages obtaining in step (2) is gone out to the distance matrix M between webpage according to the structure of web page metric calculation based on html label
(4) to M setting threshold θ, all webpages are carried out to hierarchical clustering; In each clustering cluster, choose several webpages and form web page template set as the template of this clustering cluster;
(5) template collections of web pages step (4) being obtained is carried out the detection of corresponding templates coherent detection rule;
(6) collections of web pages step (2) being obtained is carried out the detection of the uncorrelated rule of template and testing result and step (5) result is gathered, thereby obtains testing result fast.
According to described in claim 1 method, it is characterized in that: what step (2) was described carries out text filtration according to text extraction algorithm, and concrete steps are:
(2.1) build dom tree to playing up complete webpage, in filtering web page html text with the incoherent label of content;
(2.2) dom tree building in step (2.1) is calculated to the text density of each node, the node of text density maximum is text block, and the metric form of its Chinese version density is:
DS
c=∑
i ∈ childrenofctextDensity
iformula (1)
Wherein i is the child node of certain node c of webpage dom tree, TextDensity
ifor the text density of node i, the text density of node i is defined as follows according to combine text density:
Wherein C
irepresent the word number of the sub-dom tree take i as root, T
irepresent the html number of tags LC of sub-dom tree
irepresent sub-dom tree hyperlink display text number,
lC
ifor non-hyperlink display text number, LT
ifor hyperlink number of tags, LC
bfor hyperlink display text number under <body> label, C
bfor below number of words of <body> label, e is natural constant, in the time that denominator in formula is 0, is made as 1.
3. according to the method described in claim 1, it is characterized in that: the calculating structure of web page metric range matrix that step (3) is described, concrete steps are:
(3.1) number of times that in statistical web page D, each html label occurs in webpage, construction feature vector v (D), vector is N dimension altogether, and N is that W3C standard allows all kinds of total number of labels that occur in html document;
(3.2) proper vector after webpage conversion is carried out to the calculating of Euclidean distance, obtain structure of web page distance matrix M.
4. according to the method described in claim 1, it is characterized in that: the hierarchical clustering that step (4) is described, concrete steps are:
All webpages are carried out to bottom-up hierarchical clustering, i.e. the each webpage of initialization, as independent clustering cluster, then merges clustering cluster according to the threshold value of the distance of setting, until reach threshold value θ.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410028740.6A CN103838823B (en) | 2014-01-22 | 2014-01-22 | Website content accessible detection method based on web page templates |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410028740.6A CN103838823B (en) | 2014-01-22 | 2014-01-22 | Website content accessible detection method based on web page templates |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103838823A true CN103838823A (en) | 2014-06-04 |
CN103838823B CN103838823B (en) | 2017-02-22 |
Family
ID=50802320
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410028740.6A Active CN103838823B (en) | 2014-01-22 | 2014-01-22 | Website content accessible detection method based on web page templates |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103838823B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776645A (en) * | 2015-11-24 | 2017-05-31 | 北京国双科技有限公司 | Data processing method and device |
CN107229669A (en) * | 2016-03-23 | 2017-10-03 | 塔塔咨询服务公司 | Method and system for selecting the sample set on assessing website Barrien-free |
CN110020296A (en) * | 2017-10-31 | 2019-07-16 | 北京国双科技有限公司 | A kind of method and device for extracting news web page text |
CN110457579A (en) * | 2019-07-30 | 2019-11-15 | 四川大学 | The Web de-noising method and system to be cooperated based on template and classifier |
CN110851606A (en) * | 2019-11-18 | 2020-02-28 | 杭州安恒信息技术股份有限公司 | Website clustering method and system based on webpage structure similarity |
CN111314109A (en) * | 2020-01-15 | 2020-06-19 | 太原理工大学 | Weak key-based large-scale Internet of things equipment firmware identification method |
CN111625749A (en) * | 2020-06-01 | 2020-09-04 | 深圳市小满科技有限公司 | Method, device, equipment and medium for extracting detail page information of participating company website |
CN113780260A (en) * | 2021-07-27 | 2021-12-10 | 浙江大学 | Computer vision-based intelligent barrier-free character detection method |
CN113806661A (en) * | 2021-09-18 | 2021-12-17 | 中国电子技术标准化研究院 | Website information barrier-free detection tool |
CN115373649A (en) * | 2022-07-26 | 2022-11-22 | 哈尔滨亿时代数码科技开发有限公司 | Dynamic internet content barrier-free transformation method and device and website content barrier-free transformation method |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103218286A (en) * | 2012-01-20 | 2013-07-24 | 阿里巴巴集团控股有限公司 | Method and system for detecting accessibility of webpage |
CN102662972A (en) * | 2012-03-09 | 2012-09-12 | 浙江大学 | A visually disabled person-oriented automatic picture description method for web content barrier-free access |
CN102799638B (en) * | 2012-06-25 | 2015-07-15 | 浙江大学 | In-page navigation generation method facing barrier-free access to webpage contents |
-
2014
- 2014-01-22 CN CN201410028740.6A patent/CN103838823B/en active Active
Non-Patent Citations (2)
Title |
---|
李浚: "图书馆网站无障碍问题调查及解决策略", 《现代图书情报技术》 * |
陈威刚等: "网络无障碍技术研究成果", 《现代电信科技》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776645A (en) * | 2015-11-24 | 2017-05-31 | 北京国双科技有限公司 | Data processing method and device |
CN107229669B (en) * | 2016-03-23 | 2021-02-05 | 塔塔咨询服务公司 | Method and system for selecting a sample set for assessing website non-obstruction |
CN107229669A (en) * | 2016-03-23 | 2017-10-03 | 塔塔咨询服务公司 | Method and system for selecting the sample set on assessing website Barrien-free |
CN110020296A (en) * | 2017-10-31 | 2019-07-16 | 北京国双科技有限公司 | A kind of method and device for extracting news web page text |
CN110457579A (en) * | 2019-07-30 | 2019-11-15 | 四川大学 | The Web de-noising method and system to be cooperated based on template and classifier |
CN110851606A (en) * | 2019-11-18 | 2020-02-28 | 杭州安恒信息技术股份有限公司 | Website clustering method and system based on webpage structure similarity |
CN111314109A (en) * | 2020-01-15 | 2020-06-19 | 太原理工大学 | Weak key-based large-scale Internet of things equipment firmware identification method |
CN111625749A (en) * | 2020-06-01 | 2020-09-04 | 深圳市小满科技有限公司 | Method, device, equipment and medium for extracting detail page information of participating company website |
CN111625749B (en) * | 2020-06-01 | 2023-08-11 | 深圳市小满科技有限公司 | Method, device, equipment and medium for extracting website detail page information of participant company |
CN113780260A (en) * | 2021-07-27 | 2021-12-10 | 浙江大学 | Computer vision-based intelligent barrier-free character detection method |
CN113780260B (en) * | 2021-07-27 | 2023-09-19 | 浙江大学 | Barrier-free character intelligent detection method based on computer vision |
CN113806661A (en) * | 2021-09-18 | 2021-12-17 | 中国电子技术标准化研究院 | Website information barrier-free detection tool |
CN113806661B (en) * | 2021-09-18 | 2023-08-25 | 中国电子技术标准化研究院 | Barrier-free detection tool for website information |
CN115373649A (en) * | 2022-07-26 | 2022-11-22 | 哈尔滨亿时代数码科技开发有限公司 | Dynamic internet content barrier-free transformation method and device and website content barrier-free transformation method |
CN115373649B (en) * | 2022-07-26 | 2023-03-31 | 哈尔滨亿时代数码科技开发有限公司 | Dynamic internet content barrier-free transformation method and device and website content barrier-free transformation method |
Also Published As
Publication number | Publication date |
---|---|
CN103838823B (en) | 2017-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103838823A (en) | Website content accessible detection method based on web page templates | |
CN102541874B (en) | Webpage text content extracting method and device | |
CN104199972B (en) | A kind of name entity relation extraction and construction method based on deep learning | |
US8560940B2 (en) | Detecting repeat patterns on a web page using signals | |
CN103810425B (en) | The detection method of malice network address and device | |
WO2019041521A1 (en) | Apparatus and method for extracting user keyword, and computer-readable storage medium | |
US8819028B2 (en) | System and method for web content extraction | |
CN102750390B (en) | Automatic news webpage element extracting method | |
CN106599181A (en) | Hot news detecting method based on topic model | |
CN110991171B (en) | Sensitive word detection method and device | |
CN102207946B (en) | Knowledge network semi-automatic generation method | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN106708952B (en) | A kind of Webpage clustering method and device | |
CN110413787B (en) | Text clustering method, device, terminal and storage medium | |
CN105447081A (en) | Cloud platform-oriented government affair and public opinion monitoring method | |
CN104598577A (en) | Extraction method for webpage text | |
CN103679012A (en) | Clustering method and device of portable execute (PE) files | |
CN110457579B (en) | Webpage denoising method and system based on cooperative work of template and classifier | |
CN108021667A (en) | A kind of file classification method and device | |
CN104199838B (en) | A kind of user model constructing method based on label disambiguation | |
CN105528357A (en) | Webpage content extraction method based on similarity of URLs and similarity of webpage document structures | |
CN104572787B (en) | The recognition methods of pseudo- original website and device | |
CN106528509B (en) | Webpage information extraction method and device | |
CN108694192B (en) | Webpage type judging method and device | |
CN104217025B (en) | For the entry extraction system and method for more record webpages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |