CN113779377B

CN113779377B - Crawler searching method based on barrier-free detection result deduplication

Info

Publication number: CN113779377B
Application number: CN202110849849.6A
Authority: CN
Inventors: 卜佳俊; 杨文武; 周晟; 王炜; 于智
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2024-03-22
Anticipated expiration: 2041-07-27
Also published as: CN113779377A

Abstract

A crawler searching method based on barrier-free detection result deduplication presets the total number of pages to be crawled, circularly acquires links from a URL queue, accesses the links and acquires webpage source codes; detecting a rule subset selected from the webpage source codes, and combining detection results into a feature matrix; after all links extracted from a webpage are accessed, clustering matrixes acquired by all linked pages by using a DBSCAN algorithm; randomly sampling the clustered results of each cluster to serve as representative pages of the cluster, extracting links from the representative pages, and adding the links into a URL (uniform resource locator) queue, wherein other webpages in one cluster are marked as 'skipped' due to the fact that barrier-free detection results are similar to the representative pages, and the count of the crawled pages is directly increased without actually crawling the pages; the invention is used for the webpage link crawling stage in the automatic detection of the user friendliness degree of the website pages, and the crawling progress is quickened by controlling the number of the crawled pages, so that the detection efficiency is quickened.

Description

Crawler searching method based on barrier-free detection result deduplication

Technical field:

the invention belongs to the field of information barrier-free, and particularly relates to a crawler module applied to a barrier-free detection step.

The background technology is as follows:

in the big data information age, people are getting the information needed by themselves from mass data in the internet more and more. With the gradual penetration of the concept of Internet plus, information is given a wider connotation by people in the background of social development. How to realize information equalization through information technology in the information environment of the Internet, so that all people including handicapped groups and vulnerable groups can conveniently acquire and use the information, and the information is just one current social hotspot.

The web page is unobstructed, which means that a disabled person and a sound person with special requirements can acquire any information on the network. To do this, it is desirable to achieve both unobstructed web content and unobstructed auxiliary software technology for use on the web. Due to the rapid development of internet technology, the presentation forms of data information in web pages are also becoming more and more diversified. In order to display more information in a page, a developer uses technology to achieve the aim that a user pays attention to a certain type of information, and the information is usually presented in a form of a floating window, a side advertisement and the like. However, this approach often brings bad use experience to the user, and at the same time, aggravates the obstacle of the weak group (visually impaired people, elderly people) to acquire information, so that the weak group cannot acquire the normal information content in the web page normally through the auxiliary mode. Therefore, in order to reduce the information acquisition threshold of the vulnerable group, the barrier-free construction of the web page is necessary.

The detection of the web page without obstacle is an important ring in the construction of the web page without obstacle. Through carrying out barrier-free detection on the web page, various designs which are unfavorable for the target user to acquire information in the web page are found, and effective basis can be provided for subsequent barrier-free optimization of the website. The premise of carrying out overall barrier-free detection on the website is to acquire pages in the website through the crawler. Most of the current crawlers use breadth-first search and content duplication removal methods, and along with the improvement of the intelligent degree of the barrier-free detection of the web pages, the time consumption of the barrier-free detection of the single web page is longer and longer. The barrier-free detection of all pages of the website can obviously greatly increase the detection cost, and is unfavorable for rapidly objectively evaluating the barrier-free detection degree of the website, so that other crawler search methods are urgently needed to improve the detection efficiency.

The invention comprises the following steps:

aiming at the problems and difficulties, the invention provides a crawler searching method based on barrier-free detection result deduplication. Compared with the traditional breadth-first search, the method reduces the number of the webpages needing to be crawled through the similarity judgment of the barrier-free detection results, and also reduces the number of calling complex barrier-free detection methods. Compared with the method for judging the similarity of the web pages by utilizing the content and the structure of the web pages, the method for judging the similarity of the web pages is short in time consumption and low in consumption, improves the diversity of barrier-free detection results of the web sites, and improves the barrier-free detection speed of the web sites.

The crawler searching method based on the barrier-free detection result deduplication comprises the following specific steps:

s1, acquiring links of a website top page and total number total count of webpages required to be acquired from user input. And adding the links of the website top page into the URL queue.

S2, acquiring a link from the first URL queue to be crawled, and accessing the link to acquire the webpage source code. The value of the accessed link number finisccount is increased by 1.

S3, if the number of links which have been accessed finiscount and the number of links marked as skipped skip meet the condition finiscount+skip count not less than total count, ending the flow, otherwise continuing to execute downwards.

S4, extracting an unobstructed detection item matrix for the webpage source code.

S41, selecting a rule subset from GB/T37668-2019 information technology Internet content accessible technical requirements and test methods. The rules selected meet the following criteria: the realization is simple, and only depends on the webpage source code, and does not relate to image, video or audio information. The detection speed is high, and the total time of all rules on a single webpage is not more than 1 second. According to the above standard, 7 accessible rules are selected from national standards, and the rule names are respectively: non-text links, non-text controls, non-text content, user contact feedback, real-time user contact feedback, consistent navigation, in-site searching, and sitemaps.

S42, detecting the rule selected in the step S41 of the webpage source code application, wherein for one rule, the detected result is in the form of r= [ N ] _s ，N _p ，N _f ，N _i ]Where N is _s N is the number of detection points _p N is the number of passes of the result in the detection point _f N is the number of failed detection points _i Is the result of the detection point isUnknown quantity.

S43, splicing vectors corresponding to the detection rules obtained in the step S42 into a matrix according to a fixed sequence, wherein the sequence of the rules is fixed, and the vectors can be ordered according to the number of the rules. The matrix format obtained is m= [ r ₁ ，r ₂ ，r ₃ ...r _n ]Wherein r is _i Is the vector corresponding to the ith rule.

S5, when the link A is extracted from the webpage source code corresponding to the link B, the link A is called as a parent link of the link B, and the link A is a child link of the link B. Finding the parent link of the current access link, and returning to the step S2 to continue execution if all child links of the parent link are not all accessed. Otherwise, a set c= { M of matrices obtained by step S4 will be obtained ₁ ，M ₂ ，M ₃ ...M _n M is }, where M _i And detecting a term matrix for the ith sub-link corresponding to the barrier-free webpage source code. And performing cluster analysis on the set by using a DBSCAN method based on density clustering, and dividing the set C into a plurality of clusters.

S6, for each cluster divided in the step S5, sampling according to the proportion lambda, and putting the sampling result into a set R= { H ₁ ，H ₂ ，H ₃ ...H _m The remaining results are put into t= { H } _m+1 ...H _n }. Aggregate wherein H _i And detecting the original webpage source code corresponding to the item matrix for the ith barrier-free detection.

S7, extracting links from each element and adding the links to the URL queue for the set R acquired in the step S6. For the set T acquired in step S6, links are extracted from each element, and all links are added to the set p= { U ₁ ，U ₂ ，U ₃ ...U _n U, where _i Is the ith link. Links in set P are links that are de-duplicated based on the barrier-free detection result, marked as skipped, and the number of skipped links skip count is added to the number of elements Card (P) in set P.

S8, if the finiscount+skip count is more than or equal to the total count, enough webpages are acquired, the process ends, and otherwise, the step S2 is repeatedly executed.

In summary, the invention provides a crawler searching method based on barrier-free detection result deduplication, which has the following beneficial effects:

(1) By reducing the number of the web pages to be crawled, the speed of the overall barrier-free detection of the website is improved.

(2) Compared with a similarity judging method based on webpage content and webpage structure, the method has the advantages that the step of detecting the webpage without obstacle and the step of crawling the webpage are integrated, and no extra calculation is needed by taking the detection result without obstacle as the characteristic.

(3) The method has universality, uses the barrier-free detection result as a characteristic, does not depend on the content or structure of the website, and can be implemented on different types of websites.

Description of the drawings:

in order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a general flow chart of a crawler search method based on barrier-free detection result deduplication provided by the invention.

Fig. 2 shows a flowchart of obtaining a web page barrier-free detection result matrix in a general flowchart of the crawler search method based on barrier-free detection result deduplication provided by the invention.

The specific implementation method comprises the following steps:

exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Taking a website as an example, the method comprises the following specific steps:

S41, selecting a rule subset from GB/T37668-2019 information technology Internet content accessible technical requirements and test methods. The rules selected meet the following criteria:

1. the realization is simple, and only depends on the webpage source code, and does not relate to image, video or audio information.

2. The detection speed is high, and the total time of all rules on a single webpage is not more than 1 second.

According to the above standard, 7 accessible rules are selected from national standards, and the rule names are respectively: non-text links, non-text controls, non-text content, user contact feedback, real-time user contact feedback, consistent navigation, in-site searching, and sitemaps.

S42, detecting the rule selected in the step S41 of the webpage source code application, wherein for one rule, the detected result is in the form of r= [ N ] _s ，N _p ，N _f ，N _i ]Where N is _s N is the number of detection points _p N is the number of passes of the result in the detection point _f N is the number of failed detection points _i Is the number of unknown results in the detection point.

FIG. 1 shows a general flow chart of a crawler search method based on barrier-free detection result deduplication provided by the invention

Fig. 2 shows a flowchart of obtaining a web page unobstructed detection result matrix in a general flowchart of a crawler search method based on unobstructed detection result deduplication provided by the invention: s41 No barrier is provided for Internet content from GB/T37668-2019 information technologyThe accessibility specification and test method selects a subset of rules. The rules selected meet the following criteria: 1. the realization is simple, and only depends on the webpage source code, and does not relate to image, video or audio information. 2. The detection speed is high, and the total time of all rules on a single webpage is not more than 1 second. According to the above standard, 7 accessible rules are selected from national standards, and the rule names are respectively: non-text links, non-text controls, non-text content, user contact feedback, real-time user contact feedback, consistent navigation, in-site searching, and sitemaps. S42, detecting the rule selected in the step S41 of the webpage source code application, wherein for one rule, the detected result is in the form of r= [ N ] _s ，N _p ，N _f ，N _i ]Where N is _s N is the number of detection points _p N is the number of passes of the result in the detection point _f N is the number of failed detection points _i Is the number of unknown results in the detection point. S43, splicing vectors corresponding to the detection rules obtained in the step S42 into a matrix according to a fixed sequence, wherein the sequence of the rules is fixed, and the vectors can be ordered according to the number of the rules. The matrix format obtained is m= [ r ₁ ，r ₂ ，r ₃ ...r _n ]Wherein r is _i Is the vector corresponding to the ith rule.

The embodiments described in the present specification are merely examples of implementation forms of the inventive concept, and the scope of protection of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, and the scope of protection of the present invention and equivalent technical means that can be conceived by those skilled in the art based on the inventive concept.

Claims

1. A crawler searching method based on barrier-free detection result deduplication comprises the following steps:

s1, acquiring links of a website top page and total number total count of webpages to be acquired from user input; adding the links of the website home page into a URL queue;

s2, acquiring a link from the first URL queue to be crawled, and accessing the link to acquire a webpage source code; the value of the accessed link number finisccount plus 1;

s3, if the accessed link number finiscount and the skip link number skip count marked as skipped meet the condition finiscount+skip count is more than or equal to total count, ending the flow, otherwise, continuing to execute downwards;

s4, extracting an unobstructed detection item matrix for the webpage source code;

s41, selecting a rule subset from GB/T37668-2019 information technology Internet content accessible technical requirements and test methods; the rules selected meet the following criteria:

1. the realization is simple, and only depends on the webpage source code, and does not relate to image, video or audio information;

2. the detection speed is high, and the total time of all rules in a single webpage is not more than 1 second;

according to the above standard, 7 accessible rules are selected from national standards, and the rule names are respectively: non-text links, non-text controls, non-text content, user contact feedback, real-time user contact feedback, consistent navigation, in-station searching, and sitemaps;

s42, detecting the rule selected in the step S41 of the webpage source code application, wherein for one rule, the detected result is in the form of r= [ N ] _s ，N _p ，N _f ，N _i ]Where N is _s N is the number of detection points _p N is the number of passes of the result in the detection point _f N is the number of failed detection points _i The number of unknown results in the detection points;

s43, splicing vectors corresponding to the detection rules obtained in the step S42 into a matrix according to a fixed sequence, wherein the sequence of the rules is fixed, and the vectors can be ordered according to the number of the rules; the matrix format obtained is m= [ r ₁ ，r ₂ ，r ₃ ...r _n ]Wherein r is _i Is the vector corresponding to the ith rule;

s5, when the link A is extracted from the webpage source code corresponding to the link B, the link A is called as a parent link of the link A, and the link A is a child link of the link B; finding the parent link of the current access link if the parent linkAll the sub links are not accessed yet, and the step S2 is returned to continue to execute; otherwise, a set c= { M of matrices obtained by step S4 will be obtained ₁ ，M ₂ ，M ₃ ...M _n M is }, where M _i An unobstructed detection item matrix corresponding to the webpage source code for the ith sub-link; performing cluster analysis on the set by using a DBSCAN method based on density clustering, and dividing the set C into a plurality of clusters;

s6, for each cluster divided in the step S5, sampling according to the proportion lambda, and putting the sampling result into a set R= { H ₁ ，H ₂ ，H ₃ ...H _m The remaining results are put into t= { H } _m+1 ...H _n -a }; aggregate wherein H _i The original webpage source codes corresponding to the ith barrier-free detection item matrix;

s7, extracting links from each element and adding the links to the URL queue for the set R acquired in the step S6; for the set T acquired in step S6, links are extracted from each element, and all links are added to the set p= { U ₁ ，U ₂ ，U ₃ ...U _n U, where _i Is the ith link; links in the set P are links which are de-duplicated according to the barrier-free detection result, marked as skipped, and the skipped link number skip count is added to the number Card (P) of elements in the set P;