CN104598458A - Page detection method and device - Google Patents

Page detection method and device Download PDF

Info

Publication number
CN104598458A
CN104598458A CN201310528389.2A CN201310528389A CN104598458A CN 104598458 A CN104598458 A CN 104598458A CN 201310528389 A CN201310528389 A CN 201310528389A CN 104598458 A CN104598458 A CN 104598458A
Authority
CN
China
Prior art keywords
network address
page
dead chain
blacklist
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310528389.2A
Other languages
Chinese (zh)
Other versions
CN104598458B (en
Inventor
陆中振
黄达文
卓居超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201310528389.2A priority Critical patent/CN104598458B/en
Publication of CN104598458A publication Critical patent/CN104598458A/en
Application granted granted Critical
Publication of CN104598458B publication Critical patent/CN104598458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a page detection method and device. The embodiment of the invention comprises the following steps: collecting websites correspond to a preset quantity of webpages which are displayed on a user operation interface, and carrying out duplication removal processing on the collected websites; carrying out dead link invalidation page detection on the websites which are subjected to the duplication removal processing to obtain the websites which are preliminarily detected as dead link invalidation pages; comparing the websites which are preliminarily detected as dead link invalidation pages with a blacklist which is established in advance; and if the websites which are preliminarily detected as dead link invalidation pages hit sites in the blacklist, judging that the websites which hit the sites in the blacklist are the dead link invalidation pages. The page detection method and device has the beneficial effects that the detection accuracy of the dead link invalidation pages is improved and the detection misjudgment rate of the dead link invalidation pages is lowered.

Description

Page detection method and device
Technical field
The present invention relates to Internet technology, particularly relate to a kind of page detection method to dead chain inefficacy page and device.
Background technology
Because Webpage has stronger ageing, the dead chain inefficacy page of some in a large amount of webpages of therefore including at search engine, can be there is unavoidably.At present, the web page interlinkage of some testing tools to the respective site captured mainly is utilized to detect to the detection method of dead chain inefficacy page; Detection system according to the return message of web page interlinkage, and combines the analysis result to web page contents, and whether the web page interlinkage of recognition detection is dead chain inefficacy page.
This mode is for the detection of the web page interlinkage of single or minute quantity, and accuracy is higher; But for large batch of webpage, said detecting system the problem such as to close by facing higher website pressure and website, and the False Rate of the dead chain inefficacy page adopting above-mentioned detection mode to draw is very high.
Summary of the invention
Given this, be necessary to provide a kind of page detection method for dead chain inefficacy page and device, to reduce the False Rate detected dead chain inefficacy page.
The embodiment of the invention discloses a kind of page detection method, comprise the following steps:
Collect the network address corresponding to webpage of the predetermined number shown on user interface, and re-scheduling process is carried out to the network address of having collected;
Carry out dead chain inefficacy page to the network address after re-scheduling process to detect, obtain the network address that Preliminary detection is dead chain inefficacy page;
Be that the network address of dead chain inefficacy page is compared with the blacklist to set up in advance by Preliminary detection;
If Preliminary detection is the website that the network address of dead chain inefficacy page hits in described blacklist, then judge to hit the network address of website in described blacklist as dead chain inefficacy page.
The embodiment of the present invention also discloses a kind of page detection device, comprising:
Data collection module, for collect the predetermined number shown on user interface webpage corresponding to network address, and re-scheduling process is carried out to the network address of having collected;
Page initial survey module, detects for carrying out dead chain inefficacy page to the network address after re-scheduling process, obtains the network address that Preliminary detection is dead chain inefficacy page;
Page determination module, for being that the network address of dead chain inefficacy page is compared with the blacklist to set up in advance by Preliminary detection; If Preliminary detection is the website that the network address of dead chain inefficacy page hits in described blacklist, then judge to hit the network address of website in described blacklist as dead chain inefficacy page.
The embodiment of the present invention collects the network address corresponding to webpage of the predetermined number shown on user interface, and carries out re-scheduling process to the network address of having collected; Carry out dead chain inefficacy page to the network address after re-scheduling process to detect, obtain the network address that Preliminary detection is dead chain inefficacy page; Be that the network address of dead chain inefficacy page is compared with the blacklist to set up in advance by Preliminary detection; If Preliminary detection is the website that the network address of dead chain inefficacy page hits in described blacklist, then judge to hit the network address of website in described blacklist as dead chain inefficacy page; Be judged to be the method for real dead chain inefficacy page with the dead chain inefficacy page directly gone out by systems axiol-ogy in prior art, the embodiment of the present invention has the beneficial effect improving dead chain inefficacy page detection accuracy, reduces the False Rate detected dead chain inefficacy page.
Accompanying drawing explanation
Fig. 1 is page detection method first embodiment schematic flow sheet of the present invention;
Fig. 2 is in page detection method of the present invention, collects the network address one embodiment schematic flow sheet carrying out page detection;
Fig. 3 is page detection method second embodiment schematic flow sheet of the present invention;
Fig. 4 is in page detection method of the present invention, sets up blacklist and is sealed the single embodiment schematic flow sheet of name;
Fig. 5 is page detection device first embodiment high-level schematic functional block diagram of the present invention;
Fig. 6 is page detection device second embodiment high-level schematic functional block diagram of the present invention;
Fig. 7 is page detection device the 3rd embodiment high-level schematic functional block diagram of the present invention.
The realization of embodiment of the present invention object, functional characteristics and advantage will in conjunction with the embodiments, are described further with reference to accompanying drawing.
Embodiment
Technical scheme of the present invention is further illustrated below in conjunction with Figure of description and specific embodiment.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
In the following embodiment of page detection method of the present invention and device, described dead chain inefficacy page comprises dead chain page and inefficacy page; Dead chain page can be understood as, the page of failed download or content extraction failure; Such as, the address of server changes, and during web-page requests, browser cannot find current address location.Inefficacy page can be understood as, the specialized page such as pornographic, gambling, or the page that blank page etc. is nonsensical.
Fig. 1 is page detection method first embodiment schematic flow sheet of the present invention; As shown in Figure 1, page detection method of the present invention comprises the following steps:
The network address corresponding to webpage of the predetermined number that step S01, collection have shown on user interface, and re-scheduling process is carried out to the network address of having collected;
In the present embodiment, detection system collects the network address corresponding to webpage shown on user interface; Because different users may send the request to same webpage, therefore in order to avoid repeating insignificant detection analytical work, detection system carries out re-scheduling process to each network address of collecting, and for the multiple identical network address collected, detection system only retains one.
Further, in the present embodiment, consider the custom of user and the problem of clicking rate height, under normal circumstances, detection system collects front 10 results that front end is landed; Such as, the network request that browser sends according to user, shows the Search Results that this network request is corresponding; Front 10 results that detection system collects browser display detect.
Step S02, the network address after re-scheduling process carried out to dead chain inefficacy page and detect, obtain the network address that Preliminary detection is dead chain inefficacy page;
Detection system, according to the trace routine pre-set, carries out the detection of dead chain inefficacy page to the network address after the re-scheduling process of collecting; Such as, carry out the detecting steps such as page-downloading, extraction, content analysis to the network address after above-mentioned re-scheduling process, detection system obtains the testing result to above-mentioned each network address.According to the testing result obtained, detection system obtains the network address of the dead chain inefficacy page correspondence that Preliminary detection arrives.
Step S03, be that the network address of dead chain inefficacy page is compared with the blacklist set up in advance by Preliminary detection;
Step S04, judge to hit the network address of website in described blacklist as dead chain inefficacy page.
Detection system is according to the trace routine pre-set, and after from the webpage collected, Preliminary detection goes out dead chain inefficacy page, the dead chain inefficacy page that Preliminary detection goes out by detection system again and the blacklist prestored are compared; If in the dead chain inefficacy page that detection system Preliminary detection goes out, there is the network address of website in the blacklist hitting and stored, then detection system judges to hit this network address of website in described blacklist as dead chain inefficacy page; Now, detection system further to being judged to be that the network address of dead chain inefficacy page carries out follow-up process, such as, to being judged to be that the network address of dead chain inefficacy page can push shielding immediately, even directly can be deleted it.Such as, www.qq.com is in blacklist, and so for the network address by Preliminary detection being dead chain inefficacy page: http://www.qq.com/abc.html, then hit the website www.qq.com in blacklist, then this network address will be detected system mask and deletion.
Further, in the embodiment of the present invention, do not have to hit the network address with website in described blacklist if exist, then the network address of not hitting website in described blacklist is compared with the list that sealed set up in advance by detection system again; Do not hit described by the network address of website in envelope list if exist, then for not hitting by the network address of website in envelope list, detection system neither pushes shielding, deletes, and also do not suppress it, namely it does not process detection system.For not hitting the network address of website in blacklist and this network address is not hit described by the website in envelope list yet, then this network address is suppressed.In the present embodiment, described webpage is suppressed and be can be understood as, and moves backward the arrangement position of the displaying result of this webpage on front end and user interface.The benefit done like this is, due in the testing process of detection system, inevitably to the situation of certain site erroneous judgement, and the existence of erroneous judgement directly results in " webpage shielding " and " index deletion " can not be implemented effectively.Because this " webpage shielding " and " index deletion " two tactful service conditions are all very severe, once occur deleting in a large number by mistake, consequence is very serious, so the embodiment of the present invention proposes " webpage is suppressed " this strategy comparatively relaxed; Due in actual applications, " webpage is suppressed " only has 3 days terms of validity, suppresses the cycle once cross, and can be reappeared, can not cause serious consequence by the network address suppressed.
The embodiment of the present invention collects the network address corresponding to webpage of the predetermined number shown on user interface, and carries out re-scheduling process to the network address of having collected; Carry out dead chain inefficacy page to the network address after re-scheduling process to detect, obtain the network address that Preliminary detection is dead chain inefficacy page; Be that the network address of dead chain inefficacy page is compared with the blacklist to set up in advance by Preliminary detection; If Preliminary detection is the website that the network address of dead chain inefficacy page hits in described blacklist, then judge to hit the network address of website in described blacklist as dead chain inefficacy page; Be judged to be the method for real dead chain inefficacy page with the dead chain inefficacy page directly gone out by systems axiol-ogy in prior art, the embodiment of the present invention has the beneficial effect improving dead chain inefficacy page Detection accuracy, reduces the False Rate detected dead chain inefficacy page; Further, for all not appearing at blacklist and by the network address in envelope list, the mode that the embodiment of the present invention takes webpage to suppress processes, avoid to cause because of erroneous judgement to this network address carry out webpage shielding and index delete caused by serious consequence.
Fig. 2 is in page detection method of the present invention, collects the network address one embodiment schematic flow sheet carrying out page detection; Based on the description of embodiment described in Fig. 1, as shown in Figure 2, in the present embodiment, detection system collects the network address corresponding to webpage of the predetermined number shown on user interface, comprising:
Step S11, inquire about download time of each webpage;
Step S12, according to each self-corresponding upper pressure limit value of different website, choose the network address corresponding to webpage of predetermined number according to the sequencing of each web page download time.
The download time of each webpage of detection system offline search, and according to the sequencing of each web page download time, each webpage is sorted; According to each self-corresponding upper pressure limit value of different website, detection system chooses the network address corresponding to the webpage of predetermined number according to the sequencing of each web page download time.This is because detection system draws following result by off-line analysis: if a webpage was just downloaded recently, then it is that the probability of dead chain inefficacy page is relatively low; And for the webpage for a long time downloaded before, then the probability of dead chain inefficacy is very high.Such as, in a concrete scene, in experimental data for one group of blog.sina.com.cn website, detection system have detected 133923 webpages altogether, and detect 8758 dead chain inefficacy pages, and these dead chain inefficacy pages all concentrate in download time (crawl time) minimum 8789 webpages; Can be understood as, the effect of 8789 webpages that detection system detection download time is minimum and the whole webpage of detection is basically identical, namely all can detect all invalid pages of dead chain.
This processing mode will be particularly remarkable for the beneficial effect of the website arranging upper pressure limit value; Website captures pressure and search engine to the frequency of a Website server access and total degree within the unit interval, and website upper pressure limit value refers to the data volume that website allows the maximum page captured in a day; And in the application of reality, each website can arrange upper pressure limit value corresponding to this website, the crawl exceeding this website upper pressure limit value will be dropped.Therefore, adopt the present embodiment to inquire about the download time of each webpage, according to each self-corresponding upper pressure limit value of different website, choose the network address corresponding to webpage of predetermined number according to the sequencing of each web page download time; Avoid because webpage detection limit exceedes corresponding website pressure, makes part webpage directly be dropped will not to carry out detecting and failing to judge of causing, further increase the Detection accuracy of dead chain inefficacy page.
Fig. 3 is page detection method second embodiment schematic flow sheet of the present invention; The difference of embodiment described in the embodiment of the present invention and Fig. 1 is, adds step S10, sets up described blacklist and sealed list; The present embodiment is only described step S10, relevant other steps involved by page detection method of the present invention, and please refer to relevant is the specific descriptions of embodiment, does not repeat them here.
In the embodiment of the present invention, described step S10 can as detection system one of condition detecting dead chain inefficacy page, the present embodiment detection system is set up blacklist and is sealed list and can carry out separately, therefore, as long as before using this blacklist and being sealed list, establish corresponding blacklist and sealed list.Therefore, in the present embodiment, step S10 can any one step before step S03 perform.Described in Fig. 3, embodiment was only described before step S01 for step S10.
As shown in Figure 3, page detection method of the present invention also comprises:
Step S10, set up described blacklist with sealed list.
In the present embodiment, detection system can based on experience value and history detection record set up corresponding blacklist and sealed list; Or, adopt other strategies to find the data of maskable and deletion, and then build corresponding blacklist and sealed list, the webpage that such as this network address is corresponding for three days on end all lost efficacy, for http(Hyper Text Transport Protocol, HTML (Hypertext Markup Language)) request return 404 pages etc.
In the present embodiment, detection system set up blacklist with sealed list be site-level other, such as, www.qq.com is stored in blacklist by detection system, as one of them website in blacklist, so for network address http://www.qq.com/abc.html, if detected systems axiol-ogy is dead chain, when then this network address and blacklist are compared by detection system, this network address is by the website www.qq.com in hit blacklist, then this network address can be detected system and is further processed, as detection system to shield this network address and deletion etc.; Detection system does not need network address http://www.qq.com/abc.html to exist in blacklist.Same reason, detection system is set up and also can be taked this mode by envelope list.
In an additional preferred embodiment, detection system is set up blacklist and is sealed list and can realize based on browser; Because can judge whether a network address is dead chain inefficacy page easily by browser.In the present embodiment, judge that the whether dead chain inefficacy page of a network address please refer to Fig. 4 based on browser, Fig. 4 is in page detection method of the present invention, sets up blacklist and is sealed the single embodiment schematic flow sheet of name.
As shown in Figure 4, detection system is set up blacklist based on browser and is sealed list and comprises:
Step S21, obtain and need determine whether the network address of dead chain inefficacy page;
Detection system obtains and need determine whether the network address of dead chain inefficacy page, and in the present embodiment, in order to save trace routine, it is the network address that each website of dead chain inefficacy page is corresponding respectively that detection system directly can obtain Preliminary detection; Further, in order to prevent the upper pressure limit value exceeding certain website, the network address of Preliminary detection corresponding to dead chain inefficacy page that detection system only chooses predetermined number detects further, and determines whether webpage corresponding to this network address is dead chain inefficacy page.
Step S22, call browser plug-in;
Step S23, based on the described browser plug-in called, obtain the webpage state that the network address that need determine whether dead chain inefficacy page is respectively corresponding;
Detection system calls the browser plug-in write, and now this browser plug-in brings into operation, and sends HTTP request to server, captures and need determine whether webpage corresponding to the network address of dead chain inefficacy page; Further, in order to get the state of the webpage corresponding to this network address more significantly, interface, a making foreground can be preset, with the webpage state corresponding to the network address showing detection.Browser plug-in obtains the state of the data corresponding to network address that need detect and the webpage corresponding to this network address from server, and by this state transfer on server.
The webpage state that each network address that step S24, analysis obtain is corresponding respectively, according to analysis result, sets up described blacklist by the network address of correspondence and is sealed list.
The network address that the carrying out of the storage that detection system obtains on server detects and webpage state corresponding to each network address, and analyze the network address that obtained and webpage state corresponding to each network address; If webpage state corresponding to this network address lost efficacy, then this network address was added into blacklist; If webpage state corresponding to this network address is effective webpage state, then this network address is added into by envelope list.
In a concrete application scenarios, detection system calls the chrome plug-in unit write, and detection system uses an advantage of chrome plug-in unit to be to solve the problem of cross-domain access like a cork.
Further, in the present embodiment, due to detection system obtain the webpage state detected corresponding to network address time, also there is upper pressure limit value because exceeding respective site and by the danger of sealing, for avoiding occurring this problem, detection system can not be too frequent for the detection frequency of same website.Such as, in following embody rule scene, suppose there be 20w website, then detection system goes out 100 detected system Preliminary detection from each website random selection is dead chain inefficacy page; Then, detection system is carried out 100 to it and is taken turns detection, and often wheel detects the webpage of 20w different website.Suppose the webpage detection rates with 20/second, then detection system detects and one takes turns needs 3 hours, and namely 1 website only detected 1 time in 3 hours, and substantially not existing may by what seal.After detection system collects required status data, corresponding blacklist can be constructed in the manner described above and sealed list.
In addition, ageing due to network address and corresponding webpage, detection system upgrades according to predetermined period the blacklist that built and by envelope list; Detection system, constantly to adjust the corresponding update cycle as the case may be, such as adjusts etc. according to the Detection accuracy of dead chain inefficacy page; The concrete time span of the present embodiment to the update cycle of Operation system setting does not limit.
Embodiment of the present invention system is set up blacklist and by envelope list, is provided necessary precondition for improving the Detection accuracy of detection system to dead chain inefficacy page.
Fig. 5 is page detection device first embodiment high-level schematic functional block diagram of the present invention; As shown in Figure 5, page detection device of the present invention comprises: data collection module 01, page initial survey module 02 and page determination module 03.
Data collection module 01, for collect the predetermined number shown on user interface webpage corresponding to network address, and re-scheduling process is carried out to the network address of having collected;
In the present embodiment, data collection module 01 collects the network address corresponding to webpage shown on user interface; Because different users may send the request to same webpage, therefore in order to avoid repeating insignificant detection analytical work, data collection module 01 carries out re-scheduling process to each network address of collecting, and for the multiple identical network address collected, detection system only retains one.
Further, in the present embodiment, consider the custom of user and the problem of clicking rate height, under normal circumstances, data collection module 01 collects front 10 results that front end is landed; Such as, the network request that browser sends according to user, shows the Search Results that this network request is corresponding; Front 10 results that data collection module 01 collects browser display detect.
Page initial survey module 02, detects for carrying out dead chain inefficacy page to the network address after re-scheduling process, obtains the network address that Preliminary detection is dead chain inefficacy page;
Page initial survey module 02, according to the trace routine pre-set, carries out the detection of dead chain inefficacy page to the network address after the re-scheduling process of collecting; Such as, page initial survey module 02 carries out the detecting steps such as page-downloading, extraction, content analysis to the network address after above-mentioned re-scheduling process, obtains the testing result that above-mentioned each network address is corresponding.According to testing result, page initial survey module 02 obtains the network address of the dead chain inefficacy page correspondence that Preliminary detection arrives.
Page determination module 03, for being that the network address of dead chain inefficacy page is compared with the blacklist to set up in advance by Preliminary detection; If Preliminary detection is the website that the network address of dead chain inefficacy page hits in described blacklist, then judge to hit the network address of website in described blacklist as dead chain inefficacy page.
Page initial survey module 02 is according to the trace routine pre-set, and after from the webpage collected, Preliminary detection goes out dead chain inefficacy page, the dead chain inefficacy page that Preliminary detection goes out by page determination module 03 again and the blacklist prestored are compared; If in the dead chain inefficacy page that page determination module 03 Preliminary detection goes out, there is the network address of website in the blacklist hitting and stored, then page determination module 03 judges to hit this network address of website in described blacklist as dead chain inefficacy page; Now, page determination module 03 further to being judged to be that the network address of dead chain inefficacy page carries out follow-up process, such as, to being judged to be that the network address of dead chain inefficacy page can push shielding immediately, even directly can be deleted it.Such as, www.qq.com is in blacklist, and so for the network address by Preliminary detection being dead chain inefficacy page: http://www.qq.com/abc.html, then hit the website www.qq.com in blacklist, then this network address will be detected system mask and deletion.
Further, in the embodiment of the present invention, if there is the network address of not hitting website in described blacklist, then the network address of not hitting website in described blacklist is compared with the list that sealed set up in advance by page determination module 03 again; If it is described by the network address of website in envelope list to there is hit, then sealed the network address of website in list for hit, page determination module 03 neither pushes shielding, deletes, and also do not suppress it, namely it does not process detection system.For not hitting the network address of website in blacklist and this network address is not hit described by the website in envelope list yet, page determination module 03 is suppressed this network address.In the present embodiment, described webpage is suppressed and be can be understood as, and moves backward the arrangement position of the displaying result of this webpage on front end and user interface.The benefit done like this is, due in the testing process of page detection device, inevitably to the situation of certain site erroneous judgement, and the existence of erroneous judgement directly results in " webpage shielding " and " index deletion " can not be implemented effectively.Because this " webpage shielding " and " index deletion " two tactful service conditions are all very severe, once occur deleting in a large number by mistake, consequence is very serious, so the embodiment of the present invention proposes " webpage is suppressed " this strategy comparatively relaxed; Due in actual applications, " webpage is suppressed " only has 3 days terms of validity, suppresses the cycle once cross, and can be reappeared, can not cause serious consequence by the network address suppressed.
The embodiment of the present invention collects the network address corresponding to webpage of the predetermined number shown on user interface, and carries out re-scheduling process to the network address of having collected; Carry out dead chain inefficacy page to the network address after re-scheduling process to detect, obtain the network address that Preliminary detection is dead chain inefficacy page; Be that the network address of dead chain inefficacy page is compared with the blacklist to set up in advance by Preliminary detection; If Preliminary detection is the website that the network address of dead chain inefficacy page hits in described blacklist, then judge to hit the network address of website in described blacklist as dead chain inefficacy page; Be judged to be the method for real dead chain inefficacy page with the dead chain inefficacy page directly gone out by systems axiol-ogy in prior art, the embodiment of the present invention has the beneficial effect improving dead chain inefficacy page Detection accuracy, reduces the False Rate detected dead chain inefficacy page; Further, for all not appearing at blacklist and by the network address in envelope list, the mode that the embodiment of the present invention takes webpage to suppress processes, avoid to cause because of erroneous judgement to this network address carry out webpage shielding and index delete caused by serious consequence.
Collect the network address corresponding to the webpage of the predetermined number shown on user interface continue referring to data set collection module 01 in Fig. 5, Fig. 5, comprising:
Inquire about the download time of each webpage; According to each self-corresponding upper pressure limit value of different website, choose the network address corresponding to webpage of predetermined number according to the sequencing of each web page download time.
The download time of each webpage of data set collection module 01 offline search, and according to the sequencing of each web page download time, each webpage is sorted; According to each self-corresponding upper pressure limit value of different website, data set collection module 01 chooses the network address corresponding to the webpage of predetermined number according to the sequencing of each web page download time.This is because data set collection module 01 draws following result by off-line analysis: if a webpage was just downloaded recently, then it is that the probability of dead chain inefficacy page is relatively low; And for the webpage for a long time downloaded before, then the probability of dead chain inefficacy is very high.Such as, in a concrete scene, in experimental data for one group of blog.sina.com.cn website, page initial survey module 02 have detected 133923 webpages altogether, and detect 8758 dead chain inefficacy pages, and these dead chain inefficacy pages all concentrate in download time (crawltime) minimum 8789 webpages; Can be understood as, page initial survey module 02 detect download time minimum 8789 webpages and the effect that detects whole webpage be basically identical, namely all can detect all invalid pages of dead chain.
This processing mode will be particularly remarkable for the beneficial effect of the website arranging upper pressure limit value; Website captures pressure and search engine to the frequency of a Website server access and total degree within the unit interval, and website upper pressure limit value refers to the data volume that website allows the maximum page captured in a day; And in the application of reality, each website can arrange upper pressure limit value corresponding to this website, the crawl exceeding this website upper pressure limit value will be dropped.Therefore, adopt the present embodiment to inquire about the download time of each webpage, according to each self-corresponding upper pressure limit value of different website, choose the network address corresponding to webpage of predetermined number according to the sequencing of each web page download time; Avoid because webpage detection limit exceedes corresponding website pressure, makes part webpage directly be dropped will not to carry out detecting and failing to judge of causing, further increase the Detection accuracy of dead chain inefficacy page.
Fig. 6 is page detection device second embodiment high-level schematic functional block diagram of the present invention; Described in the present embodiment and Fig. 5, the difference of embodiment adds list creation module 04; The present embodiment is only described list creation module 04, and relevant other modules involved by page detection device of the present invention, please refer to the specific descriptions of related embodiment, do not repeat them here.
As shown in Figure 6, page detection device of the present invention also comprises:
List creation module 04, for setting up described blacklist and being sealed list.
In the present embodiment, list creation module 04 can based on experience value and history detection record set up corresponding blacklist and sealed list; Or list creation module 04 adopts other strategies to find the data of maskable and deletion, and then build corresponding blacklist and sealed list, the webpage that such as this network address is corresponding for three days on end all lost efficacy, returned 404 pages etc. for http request.
In the present embodiment, list creation module 04 set up blacklist with sealed list be site-level other, such as, www.qq.com is stored in blacklist by list creation module 04, as one of them website in blacklist, so for network address http://www.qq.com/abc.html, if be detected as dead chain by page initial survey module 02, when then this network address and blacklist are compared by page determination module 03, this network address is by the website www.qq.com in hit blacklist, then this network address can be further processed by page determination module 03, as page determination module 03 to shield this network address and deletion etc., page determination module 03 does not need network address http://www.qq.com/abc.html to exist in blacklist.Same reason, page determination module 03 is set up and also can be taked this mode by envelope list.
In an additional preferred embodiment, list creation module 04 is set up blacklist and is sealed list and can realize based on browser; Because can judge whether a network address is dead chain inefficacy page easily by browser.List creation module 04 is set up blacklist based on browser and is sealed list and comprises:
List creation module 04 obtains and need determine whether the network address of dead chain inefficacy page, and in the present embodiment, in order to save trace routine, it is the network address that each website of dead chain inefficacy page is corresponding respectively that list creation module 04 directly can obtain Preliminary detection; Further, in order to prevent the upper pressure limit value exceeding certain website, the network address of Preliminary detection corresponding to dead chain inefficacy page that list creation module 04 only chooses predetermined number detects further, and determines whether webpage corresponding to this network address is dead chain inefficacy page.
List creation module 04 calls the browser plug-in write, and now this browser plug-in brings into operation, and sends HTTP request to server, captures and need determine whether webpage corresponding to the network address of dead chain inefficacy page; Further, in order to get the state of the webpage corresponding to this network address more significantly, list creation module 04 can preset interface, a making foreground, with the webpage state corresponding to the network address showing detection.Browser plug-in obtains the state of the data corresponding to network address that need detect and the webpage corresponding to this network address from server, and by this state transfer on server.
The network address that the carrying out of the storage that list creation module 04 obtains on server detects and webpage state corresponding to each network address, and analyze the network address that obtained and webpage state corresponding to each network address; If webpage state corresponding to this network address lost efficacy, this network address was added into blacklist by list creation module 04; If webpage state corresponding to this network address is effective webpage state, this network address is added into by envelope list by list creation module 04.
In a concrete application scenarios, list creation module 04 calls the chrome plug-in unit write, and list creation module 04 uses chrome plug-in unit advantage to be to solve the problem of cross-domain access like a cork.
Further, in the present embodiment, due to list creation module 04 obtain the webpage state detected corresponding to network address time, also there is upper pressure limit value because exceeding respective site and by the danger of sealing, for avoiding occurring this problem, list creation module 04 can not be too frequent for the detection frequency of same website.Such as, in following embody rule scene, suppose there be 20w website, then list creation module 04 goes out 100 by page initial survey module 02 Preliminary detection from each website random selection is dead chain inefficacy page; Then, page initial survey module 02 is carried out 100 to it and is taken turns detection, and often wheel detects the webpage of 20w different website.Suppose the webpage detection rates with 20/second, then page initial survey module 02 detects one and takes turns needs 3 hours, and namely 1 website only detected 1 time in 3 hours, and substantially not existing may by what seal.After list creation module 04 collects required status data, corresponding blacklist can be constructed in the manner described above and sealed list.
Fig. 7 is page detection device the 3rd embodiment high-level schematic functional block diagram of the present invention, and the page detection device shown in Fig. 7 also comprises list update module 05; Described list update module 05 for, upgrade according to predetermined period the described blacklist set up and by envelope list.
Ageing due to network address and corresponding webpage, list update module 05 upgrades according to predetermined period blacklist that list creation module 04 built and by envelope list; List update module 05 constantly can adjust the corresponding update cycle as the case may be, such as adjusts etc. according to the Detection accuracy of dead chain inefficacy page; The concrete time span of the present embodiment to list update module 05 update cycle does not limit.
Embodiment of the present invention system is set up blacklist and by envelope list, is provided necessary precondition for improving the Detection accuracy of detection system to dead chain inefficacy page.
It should be noted that, in this article, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or device and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or device.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the device comprising this key element and also there is other identical element.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that above-described embodiment method can add required general hardware platform by software and realize, hardware can certainly be passed through, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody the part that prior art contributes in essence in other words in form of a computer software product, this computer software product is stored in the storage medium (as ROM/RAM, magnetic disc, CD) of the page detection device described in a Fig. 5 to Fig. 7, it (can be mobile phone that this storage medium comprises some instructions in order to make a station terminal equipment, computing machine, server, or the network equipment, or the page detection device etc. described in Fig. 5 to Fig. 7) perform method described in the present invention each embodiment.
The foregoing is only the preferred embodiments of the present invention; not thereby its scope of the claims is limited; every utilize instructions of the present invention and accompanying drawing content to do equivalent structure or equivalent flow process conversion; directly or indirectly be used in the technical field that other are relevant, be all in like manner included in scope of patent protection of the present invention.

Claims (14)

1. a page detection method, is characterized in that, comprises the following steps:
Collect the network address corresponding to webpage of the predetermined number shown on user interface, and re-scheduling process is carried out to the network address of having collected;
Carry out dead chain inefficacy page to the network address after re-scheduling process to detect, obtain the network address that Preliminary detection is dead chain inefficacy page;
Be that the network address of dead chain inefficacy page is compared with the blacklist to set up in advance by Preliminary detection;
If Preliminary detection is the website that the network address of dead chain inefficacy page hits in described blacklist, then judge to hit the network address of website in described blacklist as dead chain inefficacy page.
2. the method for claim 1, is characterized in that, described is that the network address of dead chain inefficacy page is compared with the blacklist to set up in advance by Preliminary detection, also comprises afterwards:
If Preliminary detection is the website that the network address of dead chain inefficacy page does not hit in described blacklist, then the network address of not hitting website in described blacklist is compared with the list that sealed set up in advance;
If the hit of this network address is described by the website in envelope list, then this network address is suppressed.
3. method as claimed in claim 1 or 2, it is characterized in that, the network address corresponding to webpage of the predetermined number that described collection has shown on user interface, comprising:
Inquire about the download time of each webpage;
According to each self-corresponding upper pressure limit value of different website, choose the network address corresponding to webpage of predetermined number according to the sequencing of each web page download time.
4. method as claimed in claim 2, is characterized in that, described is that the network address of dead chain inefficacy page is compared with the blacklist set up in advance by Preliminary detection, also comprises before:
Set up described blacklist and sealed list.
5. method as claimed in claim 4, is characterized in that, describedly sets up described blacklist and is sealed list, comprising:
Acquisition need determine whether the network address of dead chain inefficacy page;
Call browser plug-in;
Based on the described browser plug-in called, obtain the webpage state of the network address difference correspondence that need determine whether dead chain inefficacy page;
Analyze the webpage state of each network address difference correspondence obtained, according to analysis result, set up described blacklist by the network address of correspondence and sealed list.
6. method as claimed in claim 5, it is characterized in that, described acquisition need determine whether the network address of dead chain inefficacy page, comprising:
Obtain the network address corresponding to dead chain inefficacy page of the predetermined number of each website difference correspondence that Preliminary detection goes out.
7. method as claimed in claim 4, is characterized in that, describedly sets up described blacklist and is sealed list, also comprises afterwards:
The described blacklist set up is upgraded and by envelope list according to predetermined period.
8. a page detection device, is characterized in that, comprising:
Data collection module, for collect the predetermined number shown on user interface webpage corresponding to network address, and re-scheduling process is carried out to the network address of having collected;
Page initial survey module, detects for carrying out dead chain inefficacy page to the network address after re-scheduling process, obtains the network address that Preliminary detection is dead chain inefficacy page;
Page determination module, for being that the network address of dead chain inefficacy page is compared with the blacklist to set up in advance by Preliminary detection; If Preliminary detection is the website that the network address of dead chain inefficacy page hits in described blacklist, then judge to hit the network address of website in described blacklist as dead chain inefficacy page.
9. device as claimed in claim 8, is characterized in that, described page determination module also for:
If Preliminary detection is the website that the network address of dead chain inefficacy page does not hit in described blacklist, then the network address of not hitting website in described blacklist is compared with the list that sealed set up in advance;
If the hit of this network address is described by the website in envelope list, then this network address is suppressed.
10. as claimed in claim 8 or 9 device, is characterized in that, described data collection module also for:
Inquire about the download time of each webpage;
According to each self-corresponding upper pressure limit value of different website, choose the network address corresponding to webpage of predetermined number according to the sequencing of each web page download time.
11. devices as claimed in claim 9, is characterized in that, also comprise:
List creation module, for setting up described blacklist and being sealed list.
12. devices as claimed in claim 11, is characterized in that, described list creation module also for:
Acquisition need determine whether the network address of dead chain inefficacy page;
Call browser plug-in;
Based on the described browser plug-in called, obtain the webpage state of the network address difference correspondence that need determine whether dead chain inefficacy page;
Analyze the webpage state of each network address difference correspondence obtained, according to analysis result, set up described blacklist by the network address of correspondence and sealed list.
13. devices as claimed in claim 12, is characterized in that, described list creation module also for:
Obtain the network address corresponding to dead chain inefficacy page of the predetermined number of each website difference correspondence that Preliminary detection goes out.
14. devices as claimed in claim 11, is characterized in that, also comprise:
List update module, for upgrading the described blacklist set up and by envelope list according to predetermined period.
CN201310528389.2A 2013-10-30 2013-10-30 Page detection method and device Active CN104598458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310528389.2A CN104598458B (en) 2013-10-30 2013-10-30 Page detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310528389.2A CN104598458B (en) 2013-10-30 2013-10-30 Page detection method and device

Publications (2)

Publication Number Publication Date
CN104598458A true CN104598458A (en) 2015-05-06
CN104598458B CN104598458B (en) 2019-07-16

Family

ID=53124257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310528389.2A Active CN104598458B (en) 2013-10-30 2013-10-30 Page detection method and device

Country Status (1)

Country Link
CN (1) CN104598458B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105187505A (en) * 2015-08-11 2015-12-23 魅族科技(中国)有限公司 Download processing method and device
CN112269666A (en) * 2020-11-10 2021-01-26 北京百度网讯科技有限公司 Applet dead link detection method and device, computing device and medium
CN113590987A (en) * 2021-09-29 2021-11-02 飞狐信息技术(天津)有限公司 Link detection method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102025559A (en) * 2010-11-09 2011-04-20 百度在线网络技术(北京)有限公司 Method for detecting and processing dead links on basis of classification, and network equipment
CN102739653A (en) * 2012-06-06 2012-10-17 奇智软件(北京)有限公司 Detection method and device aiming at webpage address
CN102752154A (en) * 2012-07-29 2012-10-24 西北工业大学 Detecting method of dead link of Web site
CN102769632A (en) * 2012-07-30 2012-11-07 珠海市君天电子科技有限公司 Method and system for grading detection and prompt of fishing website

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102025559A (en) * 2010-11-09 2011-04-20 百度在线网络技术(北京)有限公司 Method for detecting and processing dead links on basis of classification, and network equipment
CN102739653A (en) * 2012-06-06 2012-10-17 奇智软件(北京)有限公司 Detection method and device aiming at webpage address
CN102752154A (en) * 2012-07-29 2012-10-24 西北工业大学 Detecting method of dead link of Web site
CN102769632A (en) * 2012-07-30 2012-11-07 珠海市君天电子科技有限公司 Method and system for grading detection and prompt of fishing website

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105187505A (en) * 2015-08-11 2015-12-23 魅族科技(中国)有限公司 Download processing method and device
CN112269666A (en) * 2020-11-10 2021-01-26 北京百度网讯科技有限公司 Applet dead link detection method and device, computing device and medium
CN112269666B (en) * 2020-11-10 2023-07-25 北京百度网讯科技有限公司 Applet dead-link detection method and device, computing device and medium
CN113590987A (en) * 2021-09-29 2021-11-02 飞狐信息技术(天津)有限公司 Link detection method and device

Also Published As

Publication number Publication date
CN104598458B (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN103023712B (en) Method and system for monitoring malicious property of webpage
CN103279710B (en) Method and system for detecting malicious codes of Internet information system
CN109688097A (en) Website protection method, website protective device, website safeguard and storage medium
CN102739663A (en) Detection method and scanning engine of web pages
CN103077250B (en) A kind of capturing webpage contents method and device
CN104954372A (en) Method and system for performing evidence acquisition and verification on phishing website
CN104348650A (en) Website monitoring method, business device and website monitoring system
CN102708309A (en) Automatic malicious code analysis method and system
CN103856442A (en) Black chain detection method, apparatus and system
CN103530565A (en) Method and device for scanning website program bugs based on web
CN105743730A (en) Method and system used for providing real-time monitoring for webpage service of mobile terminal
CN105404631B (en) Picture identification method and device
CN103428183A (en) Method and device for identifying malicious website
CN103455600A (en) Video URL (Uniform Resource Locator) grabbing method and device and server equipment
CN105825129A (en) Converged communication malicious software identification method and system
CN114244564B (en) Attack defense method, device, equipment and readable storage medium
CN104301161A (en) Computing method, computing device and communication system for business quality index
CN104468459B (en) A kind of leak detection method and device
CN113518077A (en) Malicious web crawler detection method, device, equipment and storage medium
CN111740868A (en) Alarm data processing method and device and storage medium
CN105302815A (en) Web page uniform resource locator URL filtering method and apparatus
CN111177623A (en) Information processing method and device
CN103220277B (en) The monitoring method of cross-site scripting attack, Apparatus and system
CN103428219B (en) A kind of web vulnerability scanning method based on web page template coupling
CN104598458A (en) Page detection method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant