CN106874299A - Page detection method and device - Google Patents

Page detection method and device Download PDF

Info

Publication number
CN106874299A
CN106874299A CN201510923931.3A CN201510923931A CN106874299A CN 106874299 A CN106874299 A CN 106874299A CN 201510923931 A CN201510923931 A CN 201510923931A CN 106874299 A CN106874299 A CN 106874299A
Authority
CN
China
Prior art keywords
webpage
target web
accessed
time period
preset time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510923931.3A
Other languages
Chinese (zh)
Inventor
李新国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510923931.3A priority Critical patent/CN106874299A/en
Publication of CN106874299A publication Critical patent/CN106874299A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

This application discloses a kind of page detection method and device.Wherein, the method includes:Access log to targeted website in preset time period is parsed, and obtains multiple accessed webpages in preset time period;Target web is determined from the accessed webpage of multiple, target web is that webpage was not accessed for before preset time period;The content of pages of target web is crawled, the issuing time of target web is parsed from content of pages;Judge issuing time whether in preset time period;And when judging that issuing time is in preset time period, determine that target web is the webpage updated in preset time period.The low technical problem of the detection process efficiency of more new web page is caused present application addresses due to the webpage quantity for detecting greatly.

Description

Page detection method and device
Technical field
The application is related to internet arena, in particular to a kind of page detection method and device.
Background technology
Webpage on internet arena, website can be constantly updated, and network upgrade amount is also evaluate website performance one Item important indicator.Here network upgrade amount refers to the quantity of the webpage of network upgrade within a certain period of time.In statistics During network upgrade amount, how to determine which webpage is that the webpage that website updates within a certain period of time is one and is difficult to The problem of solution.At present, the webpage on website is crawled typically by crawlers, then one by one web page analysis whether It is the webpage for updating.If however, wanting the website of statistical updating amount bigger, the webpage number for crawling every time is more, and Most of in these webpages is not the webpage for updating so that needs the webpage quantity of detection big, causes the inspection of more new web page Survey process efficiency low.
For above-mentioned problem, effective solution is not yet proposed at present.
The content of the invention
The embodiment of the present application provides a kind of page detection method and device, at least to solve the webpage quantity due to detecting Cause the low technical problem of the detection process efficiency of more new web page greatly.
According to the one side of the embodiment of the present application, there is provided a kind of page detection method, including:Targeted website is existed Access log in preset time period is parsed, and obtains multiple accessed webpages in the preset time period;From institute State during multiple is accessed webpages and determine target web, the target web is not to be interviewed before the preset time period The webpage asked;The content of pages of the target web is crawled, the target web is parsed from the content of pages Issuing time;Judge the issuing time whether in the preset time period;And when the issue is judged Between in the preset time period when, determine that the target web is the webpage updated in the preset time period.
Further, determine that target web includes from the multiple accessed webpage:One by one will be the multiple interviewed The URL and the URL of the webpage recorded before the preset time period for asking webpage are carried out Matching, the URL that webpage is accessed in the multiple accessed webpage is not matched when described default Between webpage on the targeted website that records before section URL when, by this do not match it is accessed Webpage is used as the target web.
Further, one by one by the URL of the multiple accessed webpage with the preset time period it The URL of the webpage of preceding record is matched, and the system of webpage is accessed in the multiple accessed webpage One URLs does not match the unified money of the webpage on the targeted website recorded before the preset time period During the finger URL of source, the accessed webpage that this is not matched includes as the target web:To the multiple accessed The URL of each accessed webpage carries out Hash coding in webpage, obtains the multiple accessed webpage In each accessed webpage URL cryptographic Hash;Institute is inquired about in the Bloom filter for pre-setting The cryptographic Hash of the URL of each accessed webpage in multiple accessed webpages is stated, wherein, the cloth is grand Be stored with the targeted website URL of the webpage issued before the preset time period in filter Cryptographic Hash;The corresponding webpage of cryptographic Hash that will do not inquired is used as the target web.
Further, after the content of pages for crawling the target web, methods described also includes:According to the page Face content judges whether the target web is list page;When it is list page to judge the target web, institute is abandoned State target web.
Further, the issuing time that the target web is parsed from the content of pages includes:According to the mesh The resolution rules of mark website configuration parse the issuing time of the target web from the content of pages;Or, press The issuing time of the target web is parsed from the content of pages according to the resolution rules for pre-setting.
According to the another aspect of the embodiment of the present application, a kind of webpage detection means is additionally provided, including:First parsing is single Unit, parses for the access log to targeted website in preset time period, obtains in the preset time period The accessed webpage of multiple;First determining unit, for determining target web from the multiple accessed webpage, institute It is not to be accessed for webpage before the preset time period to state target web;Second resolution unit, it is described for crawling The content of pages of target web, parses the issuing time of the target web from the content of pages;First judges Unit, for judging the issuing time whether in the preset time period;And second determining unit, it is used for When judging that the issuing time is in the preset time period, determine that the target web is when described default Between the webpage that updates in section.
Further, first determining unit is specifically for one by one determining the unified resource of the multiple accessed webpage Position symbol is matched with the URL of the webpage recorded before the preset time period, in the multiple quilt Access and the URL of webpage is accessed in webpage did not matched before the preset time period described in record During the URL of the webpage on targeted website, the accessed webpage that this is not matched is used as the target network Page.
Further, first determining unit includes:Coding module, for every in the multiple accessed webpage One URL of accessed webpage carries out Hash coding, obtains each in the multiple accessed webpage The cryptographic Hash of the URL of accessed webpage;Enquiry module, in the Bloom filter for pre-setting The cryptographic Hash of the URL of each accessed webpage in the multiple accessed webpage is inquired about, wherein, institute State the unified resource of the webpage issued before the preset time period on the targeted website that is stored with Bloom filter The cryptographic Hash of finger URL;Determining module, the corresponding webpage of cryptographic Hash for that will not inquire is used as the target web.
Further, described device also includes:Second judging unit, in the page for crawling the target web After appearance, judge whether the target web is list page according to the content of pages;Discarding unit, for judging When to go out the target web be list page, the target web is abandoned.
Further, second resolution unit includes:First parsing module, for being configured according to the targeted website Resolution rules the issuing time of the target web is parsed from the content of pages;Or, the second parsing module, Issuing time for parsing the target web from the content of pages according to the resolution rules for pre-setting.
According to the embodiment of the present application, parsed by the access log to targeted website in preset time period, obtained Multiple accessed webpages in preset time period;Target web is determined from the accessed webpage of multiple, target web is Webpage was not accessed for before preset time period;The content of pages of target web is crawled, is parsed from content of pages The issuing time of target web;Judge issuing time whether in the preset time period, judging that issuing time is in When in preset time period, determine that target web is the webpage updated in preset time period.Due to only detecting Preset Time Accessed webpage in section, for all webpages for crawling website in the prior art, the quantity of its webpage is significantly Reduce, solve because the webpage quantity for detecting causes the low technical problem of the detection process efficiency of more new web page greatly, carry The more detection efficiency of new web page high.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen Schematic description and description please does not constitute the improper restriction to the application for explaining the application.In accompanying drawing In:
Fig. 1 is the flow chart of the page detection method according to the embodiment of the present application;
Fig. 2 is the schematic diagram of the webpage detection means according to the embodiment of the present application.
Specific embodiment
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present application, it is clear that described embodiment The only embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ability The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, should all belong to The scope of the application protection.
It should be noted that term " first ", " in the description and claims of this application and above-mentioned accompanying drawing Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that this The data that sample is used can be exchanged in the appropriate case, so as to embodiments herein described herein can with except Here the order beyond those for illustrating or describing is implemented.Additionally, term " comprising " and " having " and they Any deformation, it is intended that covering is non-exclusive to be included, for example, containing process, the side of series of steps or unit Method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include unclear List or for these processes, method, product or other intrinsic steps of equipment or unit.
According to the embodiment of the present application, there is provided a kind of embodiment of the method for page detection method, it is necessary to explanation, The step of flow of accompanying drawing is illustrated can perform in the such as one group computer system of computer executable instructions, and And, although logical order is shown in flow charts, but in some cases, can be with different from order herein Perform shown or described step.
Fig. 1 is the flow chart of the page detection method according to the embodiment of the present application, as shown in figure 1, the method is included such as Lower step:
Step S102, the access log to targeted website in preset time period is parsed, and is obtained in preset time period Interior multiple accessed webpages.
Wherein, preset time period can refer to the time period to be detected, if for example, to detect targeted website 2015 The webpage that on December 1, in updates, then can determine the access log of this day from the access log of targeted website, Then the accessed webpage in this day is therefrom parsed.Wherein, the access log of targeted website can be from the clothes of targeted website Obtained on business device, it is also possible to monitored by the monitoring code being arranged on targeted website and obtained.Because the webpage for updating leads to The concern and access of the network user can often be attracted, therefore, in the present embodiment, detecting the renewal in preset time period During webpage, the accessed webpage in the preset time period is determined, in order to therefrom determine to be accessed for webpage for the first time.
Step S104, target web is determined from the accessed webpage of multiple, and target web is before preset time period It is not accessed for webpage.
Because the accessed webpage in preset time period is included in the webpage updated in preset time period, it is also included within pre- If the webpage being just updated between the time period, in the present embodiment, it is accessed for the first time from above-mentioned multiple accessed webpages Webpage, i.e., webpage was not accessed for before preset time period, that is to say target web.
Alternatively, can in advance be counted in the present embodiment and webpage was accessed for before preset time period, and be remembered Record, is then matched above-mentioned multiple accessed webpages with the webpage for recording respectively, if matched, shows phase The webpage answered also is accessed between preset time period, then the webpage is not the webpage for updating, whereas if not Be fitted on, then show that corresponding webpage is probably the webpage updated in preset time period, then as target web, so as to In determining whether.
Step S106, crawls the content of pages of target web, and the issuing time of target web is parsed from content of pages.
In the present embodiment, target web can be one or more.If target web is multiple, need to crawl The content of pages of each target web, and the issuing time of corresponding target web is parsed from the content of pages for crawling. On usual website during more new web page, the issuing time of webpage can be recorded in the content of pages of webpage, the issuing time is Webpage update time, by parse the issuing time can be accurate determine whether target web is in preset time period The webpage of renewal.
Whether step S108, judge issuing time in preset time period.
Step S110, when judging that issuing time is in preset time period, determines that target web is in Preset Time The webpage updated in section.Issuing time is being judged outside preset time period when, is determining that target web is not default The webpage updated in time period.
According to the embodiment of the present application, parsed by the access log to targeted website in preset time period, obtained Multiple accessed webpages in preset time period;Target web is determined from the accessed webpage of multiple, target web is Webpage was not accessed for before preset time period;The content of pages of target web is crawled, is parsed from content of pages The issuing time of target web;Judge issuing time whether in the preset time period, judging that issuing time is in When in preset time period, determine that target web is the webpage updated in preset time period.Due to only detecting Preset Time Accessed webpage in section, for all webpages for crawling website in the prior art, the quantity of its webpage is significantly Reduce, solve because the webpage quantity for detecting causes the low technical problem of the detection process efficiency of more new web page greatly, carry The more detection efficiency of new web page high.
Further, in the embodiment of the present application, done to being accessed for webpage i.e. target web for the first time in preset time period Determine whether, crawl the web page contents of target web, parse its issuing time, determined by the issuing time The webpage updated in preset time period, so that it is still accessed for for the first time in preset time period to reject renewal already Webpage, improves the accuracy of detection more new web page.
Preferably, determine that target web includes from the accessed webpage of multiple:One by one by the system of the accessed webpage of multiple One URLs is matched with the URL of the webpage recorded before preset time period, described many The URL that webpage is accessed in individual accessed webpage does not match what is recorded before the preset time period During the URL of the webpage on the targeted website, the accessed webpage that this is not matched is used as target network Page.
In the present embodiment, webpage URL was accessed on pre-recorded targeted website before preset time period, When determining target web, can be by using the URL (URL) of accessed webpage and in preset time period The mode that the URL of the webpage for recording before is matched judges whether accessed webpage is target web.
Specifically, parsed in the access log from targeted website in the preset time period, all accessed webpages URL, the URL is matched with pre-recorded URL, if matching identical URL, then it is assumed that the URL is not It is to be accessed for webpage, i.e. non-targeted webpage for the first time in preset time period;, whereas if identical URL is not matched, Then the corresponding webpages of the URL are target web.
It is further preferred that the URL of the accessed webpage of multiple is remembered with before preset time period one by one The URL of the webpage of record is matched, and the unified money of webpage is accessed in the multiple accessed webpage The unified resource that source finger URL does not match the webpage on the targeted website recorded before the preset time period is determined During the symbol of position, the accessed webpage that this is not matched includes as target web:Each in webpage is accessed to multiple The URL of accessed webpage carries out Hash coding, obtains each accessed net in multiple accessed webpages The cryptographic Hash of the URL of page;Inquire about every in multiple accessed webpages in the Bloom filter for pre-setting One cryptographic Hash of the URL of accessed webpage, wherein, it is stored with targeted website in Bloom filter The cryptographic Hash of the URL of the webpage issued before preset time period;The cryptographic Hash correspondence that will do not inquired Webpage as target web.
Specifically, when URL matchings are carried out, it is possible to use the Bloom filter for pre-setting, the Bloom filter exists Build after completing, all webpages issued before calculating preset time period according to preset rules on targeted website URL cryptographic Hash, and store in Bloom filter, so, detect target web during, according to identical Rule calculate in preset time period be accessed webpage URL cryptographic Hash, then utilize and looked into Bloom filter The cryptographic Hash is ask, when identical cryptographic Hash is inquired, shows the corresponding webpage of the cryptographic Hash before preset time period It is issued;, whereas if not inquiring, then show that the webpage is target web.
It is grand in cloth using this by calculating the cryptographic Hash of the URL that webpage is accessed in preset time period in the present embodiment Cryptographic Hash is inquired about in filter, relative to direct by the way of URL carries out matching inquiry, matching inquiry can be reduced Complexity, improve search efficiency.
Further, it is specific as follows before the detection of target web is carried out, it is necessary to first build Bloom filter:
The total amount n of the URL of the webpage of estimation targeted website scale, i.e. targeted website, then sets the grand filtering of cloth first The first prime number x that can be accommodated in device, the n values can be the value according to x to be determined, for example, x is multiplied by 10 as estimating The first prime number n accommodated in the Bloom filter of calculation, according to actual conditions typing fault tolerance p, such as 0.001%.
Then the memory size m bits of needs are calculated:
By m, n obtains the number of hash function:
URL has been accessed in initializing Bloom filter, and extraction system finally according to above parameter (m, p, k), After Hash coding is carried out to URL, the cryptographic Hash for obtaining will be encoded and be stored in Bloom filter.
Preferably, after the cryptographic Hash in the absence of the URL of accessed webpage is inquired, method is also wrapped Include:By in the cryptographic Hash storage of the URL of accessed webpage to Bloom filter.
In the present embodiment, after target web is determined, can be by the cryptographic Hash of the URL of target web storage to cloth In grand filter, the webpage of the renewal in the preset time period is weeded out during with the webpage for ensureing renewal continuous after sensing.
Preferably, after the content of pages for crawling target web, method also includes:Target is judged according to content of pages Whether webpage is list page;When it is list page to judge target web, target web is abandoned.
Due to that can there are some list pages (being referred to as navigation page) in targeted website, and generally comprised only in list page Hyperlink for being connected to other webpages, without actual content of pages.In order to avoid list page is to judged result Influence, in the embodiment of the present application, after the content of pages for crawling target web, judge whether the webpage is list Page, if it is, abandoning the list page, parses without the content of pages to the list page, and reducing needs solution The data volume of analysis.
Preferably, the issuing time that target web is parsed from content of pages includes:According to the solution that targeted website configures Analysis rule parses the issuing time of target web from content of pages;Or, according to the resolution rules for pre-setting from The issuing time of target web is parsed in content of pages.
If targeted website is configured with resolution rules, when Webpage Context resolution is carried out, can be advised according to the parsing Then parse issuing time.If targeted website is configured without resolution rules, can be carried out according to general rule Parsing.
A kind of preferred embodiment to the embodiment of the present application is described below, specifically includes:
Step 1:Monitoring code Tracker is disposed in targeted website.Monitoring code Tracker can be one section of JS Script, is embedded in the source code of targeted website, and access log that can be by user in website is sent to the service specified Device;
Step 2:Access log of the targeted website that resolution server is collected into one by one in preset time period;
Step 3:Extract the URL of the webpage that the URL in access log, i.e. user access in preset time period;
Step 4:The URL obtained to step 3 carries out Hash coding, corresponding cryptographic Hash is obtained, then by cloth The cryptographic Hash is inquired about in grand filter and be whether there is detecting the URL, if in the presence of represent the URL preset time period with Preceding to be accessed, then the webpage is not the webpage that the new webpage issued updates, if the URL is in preset time period It is before not visited, then it is assumed that the corresponding webpages of the URL are target web;;
Step 5:Parsing has collected all access logs in preset time period;
Step 6:For the target web obtained in step 5, the corresponding pages of each URL are crawled by crawlers Content.Compared to all URL for crawling almost whole website in the prior art, and the application is by the treatment of former steps The target web quantity for obtaining afterwards is few, therefore the content for crawling is less;
Step 7:If detecting targeted website is configured with resolution rules, according to the page that resolution rules parsing is crawled Issue date in content, if being parsed according to general rule without if.Then the issue date for parsing is contrasted, If the access date that the issue date is equal to the webpage is that in preset time period, can determine that the webpage is when default Between the webpage that updates in section, by its URL labeled as the webpage updated in preset time period, otherwise it is assumed that the URL is not Updated in preset time period;
Step 8:For obtaining webpage in step 7, judge whether it is original list, if then abandoning the webpage.
Step 9:The URL counted in recording step 8 and corresponding date;
Step 10:In judging the cryptographic Hash write-in Bloom filter of the URL of target web for obtaining in step 4.
To sum up, the embodiment of the present invention can reach following technique effect:
1st, server cost and bandwidth cost when reptile crawls network upgrade is greatly decreased;
2nd, the isolated island page (not being accessed for the page) can be effectively identified, so as to lift renewal amount statistical accuracy;
3rd, list page judgement is increased, the degree of accuracy is further improved;
4th, the speed of history page judgement is greatly improved by using Bloom filter.
The embodiment of the present application additionally provides a kind of webpage detection means, and the device can be used for performing the embodiment of the present application Page detection method, as shown in Fig. 2 the device includes:First resolution unit 10, the first determining unit 20, second Resolution unit 30, the first judging unit 40 and the second determining unit 50.
The access log that first resolution unit 10 is used for targeted website in preset time period is parsed, and is obtained in institute State multiple accessed webpages in preset time period.
Wherein, preset time period can refer to the time period to be detected, if for example, to detect targeted website 2015 The webpage that on December 1, in updates, then can determine the access log of this day from the access log of targeted website, Then the accessed webpage in this day is therefrom parsed.Wherein, the access log of targeted website can be from the clothes of targeted website Obtained on business device, it is also possible to monitored by the monitoring code being arranged on targeted website and obtained.Because the webpage for updating leads to The concern and access of the network user can often be attracted, therefore, in the present embodiment, detecting the renewal in preset time period During webpage, the accessed webpage in the preset time period is determined, in order to therefrom determine to be accessed for webpage for the first time.
First determining unit 20 is used to determine target web from the multiple accessed webpage that the target web to be Webpage was not accessed for before the preset time period.
Because the accessed webpage in preset time period is included in the webpage updated in preset time period, it is also included within pre- If the webpage being just updated between the time period, in the present embodiment, it is accessed for the first time from above-mentioned multiple accessed webpages Webpage, i.e., webpage was not accessed for before preset time period, that is to say target web.
Alternatively, can in advance be counted in the present embodiment and webpage was accessed for before preset time period, and be remembered Record, is then matched above-mentioned multiple accessed webpages with the webpage for recording respectively, if matched, shows phase The webpage answered also is accessed between preset time period, then the webpage is not the webpage for updating, whereas if not Be fitted on, then show that corresponding webpage is probably the webpage updated in preset time period, then as target web, so as to In determining whether.
Second resolution unit 30 is used to crawl the content of pages of the target web, and institute is parsed from the content of pages State the issuing time of target web.
In the present embodiment, target web can be one or more.If target web is multiple, need to crawl The content of pages of each target web, and the issuing time of corresponding target web is parsed from the content of pages for crawling. On usual website during more new web page, the issuing time of webpage can be recorded in the content of pages of webpage, the issuing time is Webpage update time, by parse the issuing time can be accurate determine whether target web is in preset time period The webpage of renewal.
Whether the first judging unit 40 is used to judge the issuing time in the preset time period.
Second determining unit 50 is used for when judging that the issuing time is in the preset time period, it is determined that described Target web is the webpage updated in the preset time period.
According to the embodiment of the present application, parsed by the access log to targeted website in preset time period, obtained Multiple accessed webpages in preset time period;Target web is determined from the accessed webpage of multiple, target web is Webpage was not accessed for before preset time period;The content of pages of target web is crawled, is parsed from content of pages The issuing time of target web;Judge issuing time whether in the preset time period, judging that issuing time is in When in preset time period, determine that target web is the webpage updated in preset time period.Due to only detecting Preset Time Accessed webpage in section, for all webpages for crawling website in the prior art, the quantity of its webpage is significantly Reduce, solve because the webpage quantity for detecting causes the low technical problem of the detection process efficiency of more new web page greatly, carry The more detection efficiency of new web page high.
Further, in the embodiment of the present application, done to being accessed for webpage i.e. target web for the first time in preset time period Determine whether, crawl the web page contents of target web, parse its issuing time, determined by the issuing time The webpage updated in preset time period, so that it is still accessed for for the first time in preset time period to reject renewal already Webpage, improves the accuracy of detection more new web page.
Preferably, first determining unit is specifically for one by one positioning the unified resource of the multiple accessed webpage Accord with and being matched with the URL of the webpage recorded before the preset time period, the multiple interviewed Ask that the URL of accessed webpage in webpage does not match the mesh recorded before the preset time period During the URL of the webpage on mark website, the accessed webpage that this is not matched is used as the target web.
In the present embodiment, webpage URL was accessed on pre-recorded targeted website before preset time period, When determining target web, can be by using the URL (URL) of accessed webpage and in preset time period The mode that the URL of the webpage for recording before is matched judges whether accessed webpage is target web.
Specifically, parsed in the access log from targeted website in the preset time period, all accessed webpages URL, the URL is matched with pre-recorded URL, if matching identical URL, then it is assumed that the URL is not It is to be accessed for webpage, i.e. non-targeted webpage for the first time in preset time period;, whereas if identical URL is not matched, Then the corresponding webpages of the URL are target web.
Preferably, first determining unit includes:Coding module, for each in the multiple accessed webpage The URL of individual accessed webpage carries out Hash coding, obtains each quilt in the multiple accessed webpage Access the cryptographic Hash of the URL of webpage;Enquiry module, for being looked into the Bloom filter for pre-setting The cryptographic Hash of the URL of each accessed webpage in the multiple accessed webpage is ask, wherein, it is described The unified resource of the webpage issued before the preset time period of being stored with Bloom filter on the targeted website is determined The cryptographic Hash of position symbol;Determining module, the corresponding webpage of cryptographic Hash for that will not inquire is used as the target web.
Specifically, when URL matchings are carried out, it is possible to use the Bloom filter for pre-setting, the Bloom filter exists Build after completing, all webpages issued before calculating preset time period according to preset rules on targeted website URL cryptographic Hash, and store in Bloom filter, so, detect target web during, according to identical Rule calculate in preset time period be accessed webpage URL cryptographic Hash, then utilize and looked into Bloom filter The cryptographic Hash is ask, when identical cryptographic Hash is inquired, shows the corresponding webpage of the cryptographic Hash before preset time period It is issued;, whereas if not inquiring, then show that the webpage is target web.
It is grand in cloth using this by calculating the cryptographic Hash of the URL that webpage is accessed in preset time period in the present embodiment Cryptographic Hash is inquired about in filter, relative to direct by the way of URL carries out matching inquiry, matching inquiry can be reduced Complexity, improve search efficiency.
Preferably, described device also includes:Second judging unit, for crawling the content of pages of the target web Afterwards, judge whether the target web is list page according to the content of pages;Discarding unit, for judging When the target web is list page, the target web is abandoned.
Due to that can there are some list pages (being referred to as navigation page) in targeted website, and generally comprised only in list page Hyperlink for being connected to other webpages, without actual content of pages.In order to avoid list page is to judged result Influence, in the embodiment of the present application, after the content of pages for crawling target web, judge whether the webpage is list Page, if it is, abandoning the list page, parses without the content of pages to the list page, and reducing needs solution The data volume of analysis.
Preferably, second resolution unit includes:First parsing module, for what is configured according to the targeted website Resolution rules parse the issuing time of the target web from the content of pages;Or, the second parsing module, Issuing time for parsing the target web from the content of pages according to the resolution rules for pre-setting.
If targeted website is configured with resolution rules, when Webpage Context resolution is carried out, can be advised according to the parsing Then parse issuing time.If targeted website is configured without resolution rules, can be carried out according to general rule Parsing.
The webpage detection means include processor and memory, above-mentioned first resolution unit 10, the first determining unit 20, Second resolution unit 30, the first judging unit 40 and second determining unit 50 etc. are as program unit storage in storage In device, by computing device storage said procedure unit in memory.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one Or more, the webpage that targeted website updates in preset time period is detected by adjusting kernel parameter.
Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/ Or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory, memory includes at least one Individual storage chip.
Present invention also provides a kind of embodiment of computer program product, when being performed on data processing equipment, fit In the program code for performing initialization there are as below methods step:Access log to targeted website in preset time period enters Row parsing, obtains multiple accessed webpages in preset time period;Target web is determined from the accessed webpage of multiple, Target web is that webpage was not accessed for before preset time period;The content of pages of target web is crawled, from the page The issuing time of target web is parsed in appearance;Judge issuing time whether in preset time period;And judging Go out issuing time to be in when in preset time period, determine that target web is the webpage updated in preset time period.
Above-mentioned the embodiment of the present application sequence number is for illustration only, and the quality of embodiment is not represented.
In above-described embodiment of the application, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment The part of detailed description, may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, can be by other Mode realize.Wherein, device embodiment described above is only schematical, such as division of described unit, Can be a kind of division of logic function, there can be other dividing mode when actually realizing, for example multiple units or component Can combine or be desirably integrated into another system, or some features can be ignored, or do not perform.It is another, institute Display or the coupling each other for discussing or direct-coupling or communication connection can be by some interfaces, unit or mould The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.
The unit that is illustrated as separating component can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to On multiple units.Some or all of unit therein can be according to the actual needs selected to realize this embodiment scheme Purpose.
In addition, during each functional unit in the application each embodiment can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.It is above-mentioned integrated Unit can both be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit is to realize in the form of SFU software functional unit and as independent production marketing or when using, Can store in a computer read/write memory medium.Based on such understanding, the technical scheme essence of the application On all or part of the part that is contributed to prior art in other words or the technical scheme can be with software product Form is embodied, and the computer software product is stored in a storage medium, including some instructions are used to so that one Platform computer equipment (can be personal computer, server or network equipment etc.) performs each embodiment institute of the application State all or part of step of method.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD Etc. it is various can be with the medium of store program codes.
The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art For member, on the premise of the application principle is not departed from, some improvements and modifications can also be made, these improve and moisten Decorations also should be regarded as the protection domain of the application.

Claims (10)

1. a kind of page detection method, it is characterised in that including:
Access log to targeted website in preset time period is parsed, and is obtained in the preset time period The accessed webpage of multiple;
Target web is determined from the multiple accessed webpage, the target web is in the Preset Time Webpage is not accessed for before section;
The content of pages of the target web is crawled, the hair of the target web is parsed from the content of pages The cloth time;
Judge the issuing time whether in the preset time period;And
When judging that the issuing time is in the preset time period, determine that the target web is in institute State the webpage updated in preset time period.
2. method according to claim 1, it is characterised in that determine target from the multiple accessed webpage Webpage includes:
The URL of the multiple accessed webpage is recorded with before the preset time period one by one The URL of webpage matched, the unification of webpage is accessed in the multiple accessed webpage URLs does not match the unification of the webpage on the targeted website recorded before the preset time period During URLs, the accessed webpage that this is not matched is used as the target web.
3. method according to claim 2, it is characterised in that one by one by the unified money of the multiple accessed webpage Source finger URL is matched with the URL of the webpage recorded before the preset time period, in institute State be accessed in multiple accessed webpages webpage URLs do not match the preset time period it During the URL of the webpage on the targeted website of preceding record, the accessed net that this is not matched Page includes as the target web:
URL to each accessed webpage in the multiple accessed webpage carries out Hash coding, Obtain the cryptographic Hash of the URL of each accessed webpage in the multiple accessed webpage;
Each accessed webpage in the multiple accessed webpage is inquired about in the Bloom filter for pre-setting The cryptographic Hash of URL, wherein, in institute on the targeted website that is stored with the Bloom filter The cryptographic Hash of the URL of the webpage issued before stating preset time period;
The corresponding webpage of cryptographic Hash that will do not inquired is used as the target web.
4. method according to claim 1, it is characterised in that after the content of pages for crawling the target web, Methods described also includes:
Judge whether the target web is list page according to the content of pages;
When it is list page to judge the target web, the target web is abandoned.
5. method according to claim 1, it is characterised in that the target network is parsed from the content of pages The issuing time of page includes:
The resolution rules configured according to the targeted website parse the target web from the content of pages Issuing time;Or
The issuing time of the target web is parsed from the content of pages according to the resolution rules for pre-setting.
6. a kind of webpage detection means, it is characterised in that including:
First resolution unit, parses for the access log to targeted website in preset time period, obtains Multiple accessed webpages in the preset time period;
First determining unit, for determining target web, the target network from the multiple accessed webpage Page is not to be accessed for webpage before the preset time period;
Second resolution unit, the content of pages for crawling the target web is parsed from the content of pages Go out the issuing time of the target web;
First judging unit, for judging the issuing time whether in the preset time period;And
Second determining unit, for judge the issuing time be in the preset time period in when, it is determined that The target web is the webpage updated in the preset time period.
7. device according to claim 6, it is characterised in that first determining unit is specifically for one by one by institute State the URL of multiple accessed webpages and the unification of the webpage recorded before the preset time period URLs is matched, and the URL of webpage is accessed in the multiple accessed webpage not Match before the preset time period record the targeted website on webpage URL when, The accessed webpage that this is not matched is used as the target web.
8. device according to claim 7, it is characterised in that first determining unit includes:
Coding module, for the unified resource positioning to each accessed webpage in the multiple accessed webpage Symbol carries out Hash coding, obtains the unified resource positioning of each accessed webpage in the multiple accessed webpage The cryptographic Hash of symbol;
Enquiry module, it is each in the multiple accessed webpage for being inquired about in the Bloom filter for pre-setting The cryptographic Hash of the URL of individual accessed webpage, wherein, it is stored with the Bloom filter described The cryptographic Hash of the URL of the webpage issued before the preset time period on targeted website;
Determining module, the corresponding webpage of cryptographic Hash for that will not inquire is used as the target web.
9. device according to claim 6, it is characterised in that described device also includes:
Second judging unit, for after the content of pages for crawling the target web, according in the page Appearance judges whether the target web is list page;
Discarding unit, for when it is list page to judge the target web, abandoning the target web.
10. device according to claim 6, it is characterised in that second resolution unit includes:
First parsing module, the resolution rules for being configured according to the targeted website are solved from the content of pages Separate out the issuing time of the target web;Or
Second parsing module, it is described for being parsed from the content of pages according to the resolution rules for pre-setting The issuing time of target web.
CN201510923931.3A 2015-12-14 2015-12-14 Page detection method and device Pending CN106874299A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510923931.3A CN106874299A (en) 2015-12-14 2015-12-14 Page detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510923931.3A CN106874299A (en) 2015-12-14 2015-12-14 Page detection method and device

Publications (1)

Publication Number Publication Date
CN106874299A true CN106874299A (en) 2017-06-20

Family

ID=59178259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510923931.3A Pending CN106874299A (en) 2015-12-14 2015-12-14 Page detection method and device

Country Status (1)

Country Link
CN (1) CN106874299A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710834A (en) * 2018-11-16 2019-05-03 北京字节跳动网络技术有限公司 Similar web page detection method, device, storage medium and electronic equipment
CN110287393A (en) * 2019-06-26 2019-09-27 深信服科技股份有限公司 A kind of webpage acquisition methods, device, equipment and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102984121A (en) * 2011-06-02 2013-03-20 富士通株式会社 Access monitoring method and information processing apparatus
US20130144928A1 (en) * 2011-12-05 2013-06-06 Microsoft Corporation Minimal download and simulated page navigation features
CN103177090A (en) * 2013-03-08 2013-06-26 亿赞普(北京)科技有限公司 Topic detection method and device based on big data
CN104182548A (en) * 2014-09-10 2014-12-03 北京国双科技有限公司 Webpage updating and processing method and device
CN104391953A (en) * 2014-11-27 2015-03-04 北京国双科技有限公司 Method and device for detecting web page updating
CN104794193A (en) * 2015-04-17 2015-07-22 南京大学 Webpage increment capture method for valid link acquisition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102984121A (en) * 2011-06-02 2013-03-20 富士通株式会社 Access monitoring method and information processing apparatus
US20130144928A1 (en) * 2011-12-05 2013-06-06 Microsoft Corporation Minimal download and simulated page navigation features
CN103177090A (en) * 2013-03-08 2013-06-26 亿赞普(北京)科技有限公司 Topic detection method and device based on big data
CN104182548A (en) * 2014-09-10 2014-12-03 北京国双科技有限公司 Webpage updating and processing method and device
CN104391953A (en) * 2014-11-27 2015-03-04 北京国双科技有限公司 Method and device for detecting web page updating
CN104794193A (en) * 2015-04-17 2015-07-22 南京大学 Webpage increment capture method for valid link acquisition

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710834A (en) * 2018-11-16 2019-05-03 北京字节跳动网络技术有限公司 Similar web page detection method, device, storage medium and electronic equipment
CN109710834B (en) * 2018-11-16 2020-01-10 北京字节跳动网络技术有限公司 Similar webpage detection method and device, storage medium and electronic equipment
CN110287393A (en) * 2019-06-26 2019-09-27 深信服科技股份有限公司 A kind of webpage acquisition methods, device, equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN110099059B (en) Domain name identification method and device and storage medium
CN106874165B (en) Webpage detection method and device
US20180114139A1 (en) Customized website predictions for machine-learning systems
CN106202101B (en) Advertisement identification method and device
CN107578263A (en) A kind of detection method, device and the electronic equipment of advertisement abnormal access
CN106570013A (en) Method and device for processing page access data
CN106936778A (en) The abnormal detection method of website traffic and device
CN106933905B (en) Method and device for monitoring webpage access data
CN106610994A (en) Method and device for counting click paths
CN106484738A (en) A kind of page processing method and device
CN106874299A (en) Page detection method and device
AU2019387166A1 (en) A system and method of reconstructing browser interaction from session data having incomplete tracking data
CN106815248A (en) Web analytics method and device
CN106487833A (en) The statistical method of isolated user number and device in network monitor
CN106933903B (en) Storage method and device applied to distributed storage
CN107135199A (en) The detection method and device at webpage back door
US20190286671A1 (en) Algorithmic computation of entity information from ip address
CN110457603A (en) Customer relationship abstracting method, device, electronic equipment and readable storage medium storing program for executing
CN106874298A (en) Page detection method and device
CN106547780A (en) Article reprints statistics of variables method and device
CN107357795B (en) Method and device for monitoring association degree between websites
CN112307298B (en) Method and device for generating personal brand label
CN110472137B (en) Negative sample construction method, device and system of recognition model
US20210056561A1 (en) Method and system for identifying electronic devices of genuine customers of organizations
CN106708878B (en) Terminal identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170620