CN106874299A - Page detection method and device - Google Patents
Page detection method and device Download PDFInfo
- Publication number
- CN106874299A CN106874299A CN201510923931.3A CN201510923931A CN106874299A CN 106874299 A CN106874299 A CN 106874299A CN 201510923931 A CN201510923931 A CN 201510923931A CN 106874299 A CN106874299 A CN 106874299A
- Authority
- CN
- China
- Prior art keywords
- webpage
- target web
- accessed
- time period
- preset time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Abstract
This application discloses a kind of page detection method and device.Wherein, the method includes:Access log to targeted website in preset time period is parsed, and obtains multiple accessed webpages in preset time period;Target web is determined from the accessed webpage of multiple, target web is that webpage was not accessed for before preset time period;The content of pages of target web is crawled, the issuing time of target web is parsed from content of pages;Judge issuing time whether in preset time period;And when judging that issuing time is in preset time period, determine that target web is the webpage updated in preset time period.The low technical problem of the detection process efficiency of more new web page is caused present application addresses due to the webpage quantity for detecting greatly.
Description
Technical field
The application is related to internet arena, in particular to a kind of page detection method and device.
Background technology
Webpage on internet arena, website can be constantly updated, and network upgrade amount is also evaluate website performance one
Item important indicator.Here network upgrade amount refers to the quantity of the webpage of network upgrade within a certain period of time.In statistics
During network upgrade amount, how to determine which webpage is that the webpage that website updates within a certain period of time is one and is difficult to
The problem of solution.At present, the webpage on website is crawled typically by crawlers, then one by one web page analysis whether
It is the webpage for updating.If however, wanting the website of statistical updating amount bigger, the webpage number for crawling every time is more, and
Most of in these webpages is not the webpage for updating so that needs the webpage quantity of detection big, causes the inspection of more new web page
Survey process efficiency low.
For above-mentioned problem, effective solution is not yet proposed at present.
The content of the invention
The embodiment of the present application provides a kind of page detection method and device, at least to solve the webpage quantity due to detecting
Cause the low technical problem of the detection process efficiency of more new web page greatly.
According to the one side of the embodiment of the present application, there is provided a kind of page detection method, including:Targeted website is existed
Access log in preset time period is parsed, and obtains multiple accessed webpages in the preset time period;From institute
State during multiple is accessed webpages and determine target web, the target web is not to be interviewed before the preset time period
The webpage asked;The content of pages of the target web is crawled, the target web is parsed from the content of pages
Issuing time;Judge the issuing time whether in the preset time period;And when the issue is judged
Between in the preset time period when, determine that the target web is the webpage updated in the preset time period.
Further, determine that target web includes from the multiple accessed webpage:One by one will be the multiple interviewed
The URL and the URL of the webpage recorded before the preset time period for asking webpage are carried out
Matching, the URL that webpage is accessed in the multiple accessed webpage is not matched when described default
Between webpage on the targeted website that records before section URL when, by this do not match it is accessed
Webpage is used as the target web.
Further, one by one by the URL of the multiple accessed webpage with the preset time period it
The URL of the webpage of preceding record is matched, and the system of webpage is accessed in the multiple accessed webpage
One URLs does not match the unified money of the webpage on the targeted website recorded before the preset time period
During the finger URL of source, the accessed webpage that this is not matched includes as the target web:To the multiple accessed
The URL of each accessed webpage carries out Hash coding in webpage, obtains the multiple accessed webpage
In each accessed webpage URL cryptographic Hash;Institute is inquired about in the Bloom filter for pre-setting
The cryptographic Hash of the URL of each accessed webpage in multiple accessed webpages is stated, wherein, the cloth is grand
Be stored with the targeted website URL of the webpage issued before the preset time period in filter
Cryptographic Hash;The corresponding webpage of cryptographic Hash that will do not inquired is used as the target web.
Further, after the content of pages for crawling the target web, methods described also includes:According to the page
Face content judges whether the target web is list page;When it is list page to judge the target web, institute is abandoned
State target web.
Further, the issuing time that the target web is parsed from the content of pages includes:According to the mesh
The resolution rules of mark website configuration parse the issuing time of the target web from the content of pages;Or, press
The issuing time of the target web is parsed from the content of pages according to the resolution rules for pre-setting.
According to the another aspect of the embodiment of the present application, a kind of webpage detection means is additionally provided, including:First parsing is single
Unit, parses for the access log to targeted website in preset time period, obtains in the preset time period
The accessed webpage of multiple;First determining unit, for determining target web from the multiple accessed webpage, institute
It is not to be accessed for webpage before the preset time period to state target web;Second resolution unit, it is described for crawling
The content of pages of target web, parses the issuing time of the target web from the content of pages;First judges
Unit, for judging the issuing time whether in the preset time period;And second determining unit, it is used for
When judging that the issuing time is in the preset time period, determine that the target web is when described default
Between the webpage that updates in section.
Further, first determining unit is specifically for one by one determining the unified resource of the multiple accessed webpage
Position symbol is matched with the URL of the webpage recorded before the preset time period, in the multiple quilt
Access and the URL of webpage is accessed in webpage did not matched before the preset time period described in record
During the URL of the webpage on targeted website, the accessed webpage that this is not matched is used as the target network
Page.
Further, first determining unit includes:Coding module, for every in the multiple accessed webpage
One URL of accessed webpage carries out Hash coding, obtains each in the multiple accessed webpage
The cryptographic Hash of the URL of accessed webpage;Enquiry module, in the Bloom filter for pre-setting
The cryptographic Hash of the URL of each accessed webpage in the multiple accessed webpage is inquired about, wherein, institute
State the unified resource of the webpage issued before the preset time period on the targeted website that is stored with Bloom filter
The cryptographic Hash of finger URL;Determining module, the corresponding webpage of cryptographic Hash for that will not inquire is used as the target web.
Further, described device also includes:Second judging unit, in the page for crawling the target web
After appearance, judge whether the target web is list page according to the content of pages;Discarding unit, for judging
When to go out the target web be list page, the target web is abandoned.
Further, second resolution unit includes:First parsing module, for being configured according to the targeted website
Resolution rules the issuing time of the target web is parsed from the content of pages;Or, the second parsing module,
Issuing time for parsing the target web from the content of pages according to the resolution rules for pre-setting.
According to the embodiment of the present application, parsed by the access log to targeted website in preset time period, obtained
Multiple accessed webpages in preset time period;Target web is determined from the accessed webpage of multiple, target web is
Webpage was not accessed for before preset time period;The content of pages of target web is crawled, is parsed from content of pages
The issuing time of target web;Judge issuing time whether in the preset time period, judging that issuing time is in
When in preset time period, determine that target web is the webpage updated in preset time period.Due to only detecting Preset Time
Accessed webpage in section, for all webpages for crawling website in the prior art, the quantity of its webpage is significantly
Reduce, solve because the webpage quantity for detecting causes the low technical problem of the detection process efficiency of more new web page greatly, carry
The more detection efficiency of new web page high.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen
Schematic description and description please does not constitute the improper restriction to the application for explaining the application.In accompanying drawing
In:
Fig. 1 is the flow chart of the page detection method according to the embodiment of the present application;
Fig. 2 is the schematic diagram of the webpage detection means according to the embodiment of the present application.
Specific embodiment
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application
Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present application, it is clear that described embodiment
The only embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ability
The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, should all belong to
The scope of the application protection.
It should be noted that term " first ", " in the description and claims of this application and above-mentioned accompanying drawing
Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that this
The data that sample is used can be exchanged in the appropriate case, so as to embodiments herein described herein can with except
Here the order beyond those for illustrating or describing is implemented.Additionally, term " comprising " and " having " and they
Any deformation, it is intended that covering is non-exclusive to be included, for example, containing process, the side of series of steps or unit
Method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include unclear
List or for these processes, method, product or other intrinsic steps of equipment or unit.
According to the embodiment of the present application, there is provided a kind of embodiment of the method for page detection method, it is necessary to explanation,
The step of flow of accompanying drawing is illustrated can perform in the such as one group computer system of computer executable instructions, and
And, although logical order is shown in flow charts, but in some cases, can be with different from order herein
Perform shown or described step.
Fig. 1 is the flow chart of the page detection method according to the embodiment of the present application, as shown in figure 1, the method is included such as
Lower step:
Step S102, the access log to targeted website in preset time period is parsed, and is obtained in preset time period
Interior multiple accessed webpages.
Wherein, preset time period can refer to the time period to be detected, if for example, to detect targeted website 2015
The webpage that on December 1, in updates, then can determine the access log of this day from the access log of targeted website,
Then the accessed webpage in this day is therefrom parsed.Wherein, the access log of targeted website can be from the clothes of targeted website
Obtained on business device, it is also possible to monitored by the monitoring code being arranged on targeted website and obtained.Because the webpage for updating leads to
The concern and access of the network user can often be attracted, therefore, in the present embodiment, detecting the renewal in preset time period
During webpage, the accessed webpage in the preset time period is determined, in order to therefrom determine to be accessed for webpage for the first time.
Step S104, target web is determined from the accessed webpage of multiple, and target web is before preset time period
It is not accessed for webpage.
Because the accessed webpage in preset time period is included in the webpage updated in preset time period, it is also included within pre-
If the webpage being just updated between the time period, in the present embodiment, it is accessed for the first time from above-mentioned multiple accessed webpages
Webpage, i.e., webpage was not accessed for before preset time period, that is to say target web.
Alternatively, can in advance be counted in the present embodiment and webpage was accessed for before preset time period, and be remembered
Record, is then matched above-mentioned multiple accessed webpages with the webpage for recording respectively, if matched, shows phase
The webpage answered also is accessed between preset time period, then the webpage is not the webpage for updating, whereas if not
Be fitted on, then show that corresponding webpage is probably the webpage updated in preset time period, then as target web, so as to
In determining whether.
Step S106, crawls the content of pages of target web, and the issuing time of target web is parsed from content of pages.
In the present embodiment, target web can be one or more.If target web is multiple, need to crawl
The content of pages of each target web, and the issuing time of corresponding target web is parsed from the content of pages for crawling.
On usual website during more new web page, the issuing time of webpage can be recorded in the content of pages of webpage, the issuing time is
Webpage update time, by parse the issuing time can be accurate determine whether target web is in preset time period
The webpage of renewal.
Whether step S108, judge issuing time in preset time period.
Step S110, when judging that issuing time is in preset time period, determines that target web is in Preset Time
The webpage updated in section.Issuing time is being judged outside preset time period when, is determining that target web is not default
The webpage updated in time period.
According to the embodiment of the present application, parsed by the access log to targeted website in preset time period, obtained
Multiple accessed webpages in preset time period;Target web is determined from the accessed webpage of multiple, target web is
Webpage was not accessed for before preset time period;The content of pages of target web is crawled, is parsed from content of pages
The issuing time of target web;Judge issuing time whether in the preset time period, judging that issuing time is in
When in preset time period, determine that target web is the webpage updated in preset time period.Due to only detecting Preset Time
Accessed webpage in section, for all webpages for crawling website in the prior art, the quantity of its webpage is significantly
Reduce, solve because the webpage quantity for detecting causes the low technical problem of the detection process efficiency of more new web page greatly, carry
The more detection efficiency of new web page high.
Further, in the embodiment of the present application, done to being accessed for webpage i.e. target web for the first time in preset time period
Determine whether, crawl the web page contents of target web, parse its issuing time, determined by the issuing time
The webpage updated in preset time period, so that it is still accessed for for the first time in preset time period to reject renewal already
Webpage, improves the accuracy of detection more new web page.
Preferably, determine that target web includes from the accessed webpage of multiple:One by one by the system of the accessed webpage of multiple
One URLs is matched with the URL of the webpage recorded before preset time period, described many
The URL that webpage is accessed in individual accessed webpage does not match what is recorded before the preset time period
During the URL of the webpage on the targeted website, the accessed webpage that this is not matched is used as target network
Page.
In the present embodiment, webpage URL was accessed on pre-recorded targeted website before preset time period,
When determining target web, can be by using the URL (URL) of accessed webpage and in preset time period
The mode that the URL of the webpage for recording before is matched judges whether accessed webpage is target web.
Specifically, parsed in the access log from targeted website in the preset time period, all accessed webpages
URL, the URL is matched with pre-recorded URL, if matching identical URL, then it is assumed that the URL is not
It is to be accessed for webpage, i.e. non-targeted webpage for the first time in preset time period;, whereas if identical URL is not matched,
Then the corresponding webpages of the URL are target web.
It is further preferred that the URL of the accessed webpage of multiple is remembered with before preset time period one by one
The URL of the webpage of record is matched, and the unified money of webpage is accessed in the multiple accessed webpage
The unified resource that source finger URL does not match the webpage on the targeted website recorded before the preset time period is determined
During the symbol of position, the accessed webpage that this is not matched includes as target web:Each in webpage is accessed to multiple
The URL of accessed webpage carries out Hash coding, obtains each accessed net in multiple accessed webpages
The cryptographic Hash of the URL of page;Inquire about every in multiple accessed webpages in the Bloom filter for pre-setting
One cryptographic Hash of the URL of accessed webpage, wherein, it is stored with targeted website in Bloom filter
The cryptographic Hash of the URL of the webpage issued before preset time period;The cryptographic Hash correspondence that will do not inquired
Webpage as target web.
Specifically, when URL matchings are carried out, it is possible to use the Bloom filter for pre-setting, the Bloom filter exists
Build after completing, all webpages issued before calculating preset time period according to preset rules on targeted website
URL cryptographic Hash, and store in Bloom filter, so, detect target web during, according to identical
Rule calculate in preset time period be accessed webpage URL cryptographic Hash, then utilize and looked into Bloom filter
The cryptographic Hash is ask, when identical cryptographic Hash is inquired, shows the corresponding webpage of the cryptographic Hash before preset time period
It is issued;, whereas if not inquiring, then show that the webpage is target web.
It is grand in cloth using this by calculating the cryptographic Hash of the URL that webpage is accessed in preset time period in the present embodiment
Cryptographic Hash is inquired about in filter, relative to direct by the way of URL carries out matching inquiry, matching inquiry can be reduced
Complexity, improve search efficiency.
Further, it is specific as follows before the detection of target web is carried out, it is necessary to first build Bloom filter:
The total amount n of the URL of the webpage of estimation targeted website scale, i.e. targeted website, then sets the grand filtering of cloth first
The first prime number x that can be accommodated in device, the n values can be the value according to x to be determined, for example, x is multiplied by 10 as estimating
The first prime number n accommodated in the Bloom filter of calculation, according to actual conditions typing fault tolerance p, such as 0.001%.
Then the memory size m bits of needs are calculated:
By m, n obtains the number of hash function:
URL has been accessed in initializing Bloom filter, and extraction system finally according to above parameter (m, p, k),
After Hash coding is carried out to URL, the cryptographic Hash for obtaining will be encoded and be stored in Bloom filter.
Preferably, after the cryptographic Hash in the absence of the URL of accessed webpage is inquired, method is also wrapped
Include:By in the cryptographic Hash storage of the URL of accessed webpage to Bloom filter.
In the present embodiment, after target web is determined, can be by the cryptographic Hash of the URL of target web storage to cloth
In grand filter, the webpage of the renewal in the preset time period is weeded out during with the webpage for ensureing renewal continuous after sensing.
Preferably, after the content of pages for crawling target web, method also includes:Target is judged according to content of pages
Whether webpage is list page;When it is list page to judge target web, target web is abandoned.
Due to that can there are some list pages (being referred to as navigation page) in targeted website, and generally comprised only in list page
Hyperlink for being connected to other webpages, without actual content of pages.In order to avoid list page is to judged result
Influence, in the embodiment of the present application, after the content of pages for crawling target web, judge whether the webpage is list
Page, if it is, abandoning the list page, parses without the content of pages to the list page, and reducing needs solution
The data volume of analysis.
Preferably, the issuing time that target web is parsed from content of pages includes:According to the solution that targeted website configures
Analysis rule parses the issuing time of target web from content of pages;Or, according to the resolution rules for pre-setting from
The issuing time of target web is parsed in content of pages.
If targeted website is configured with resolution rules, when Webpage Context resolution is carried out, can be advised according to the parsing
Then parse issuing time.If targeted website is configured without resolution rules, can be carried out according to general rule
Parsing.
A kind of preferred embodiment to the embodiment of the present application is described below, specifically includes:
Step 1:Monitoring code Tracker is disposed in targeted website.Monitoring code Tracker can be one section of JS
Script, is embedded in the source code of targeted website, and access log that can be by user in website is sent to the service specified
Device;
Step 2:Access log of the targeted website that resolution server is collected into one by one in preset time period;
Step 3:Extract the URL of the webpage that the URL in access log, i.e. user access in preset time period;
Step 4:The URL obtained to step 3 carries out Hash coding, corresponding cryptographic Hash is obtained, then by cloth
The cryptographic Hash is inquired about in grand filter and be whether there is detecting the URL, if in the presence of represent the URL preset time period with
Preceding to be accessed, then the webpage is not the webpage that the new webpage issued updates, if the URL is in preset time period
It is before not visited, then it is assumed that the corresponding webpages of the URL are target web;;
Step 5:Parsing has collected all access logs in preset time period;
Step 6:For the target web obtained in step 5, the corresponding pages of each URL are crawled by crawlers
Content.Compared to all URL for crawling almost whole website in the prior art, and the application is by the treatment of former steps
The target web quantity for obtaining afterwards is few, therefore the content for crawling is less;
Step 7:If detecting targeted website is configured with resolution rules, according to the page that resolution rules parsing is crawled
Issue date in content, if being parsed according to general rule without if.Then the issue date for parsing is contrasted,
If the access date that the issue date is equal to the webpage is that in preset time period, can determine that the webpage is when default
Between the webpage that updates in section, by its URL labeled as the webpage updated in preset time period, otherwise it is assumed that the URL is not
Updated in preset time period;
Step 8:For obtaining webpage in step 7, judge whether it is original list, if then abandoning the webpage.
Step 9:The URL counted in recording step 8 and corresponding date;
Step 10:In judging the cryptographic Hash write-in Bloom filter of the URL of target web for obtaining in step 4.
To sum up, the embodiment of the present invention can reach following technique effect:
1st, server cost and bandwidth cost when reptile crawls network upgrade is greatly decreased;
2nd, the isolated island page (not being accessed for the page) can be effectively identified, so as to lift renewal amount statistical accuracy;
3rd, list page judgement is increased, the degree of accuracy is further improved;
4th, the speed of history page judgement is greatly improved by using Bloom filter.
The embodiment of the present application additionally provides a kind of webpage detection means, and the device can be used for performing the embodiment of the present application
Page detection method, as shown in Fig. 2 the device includes:First resolution unit 10, the first determining unit 20, second
Resolution unit 30, the first judging unit 40 and the second determining unit 50.
The access log that first resolution unit 10 is used for targeted website in preset time period is parsed, and is obtained in institute
State multiple accessed webpages in preset time period.
Wherein, preset time period can refer to the time period to be detected, if for example, to detect targeted website 2015
The webpage that on December 1, in updates, then can determine the access log of this day from the access log of targeted website,
Then the accessed webpage in this day is therefrom parsed.Wherein, the access log of targeted website can be from the clothes of targeted website
Obtained on business device, it is also possible to monitored by the monitoring code being arranged on targeted website and obtained.Because the webpage for updating leads to
The concern and access of the network user can often be attracted, therefore, in the present embodiment, detecting the renewal in preset time period
During webpage, the accessed webpage in the preset time period is determined, in order to therefrom determine to be accessed for webpage for the first time.
First determining unit 20 is used to determine target web from the multiple accessed webpage that the target web to be
Webpage was not accessed for before the preset time period.
Because the accessed webpage in preset time period is included in the webpage updated in preset time period, it is also included within pre-
If the webpage being just updated between the time period, in the present embodiment, it is accessed for the first time from above-mentioned multiple accessed webpages
Webpage, i.e., webpage was not accessed for before preset time period, that is to say target web.
Alternatively, can in advance be counted in the present embodiment and webpage was accessed for before preset time period, and be remembered
Record, is then matched above-mentioned multiple accessed webpages with the webpage for recording respectively, if matched, shows phase
The webpage answered also is accessed between preset time period, then the webpage is not the webpage for updating, whereas if not
Be fitted on, then show that corresponding webpage is probably the webpage updated in preset time period, then as target web, so as to
In determining whether.
Second resolution unit 30 is used to crawl the content of pages of the target web, and institute is parsed from the content of pages
State the issuing time of target web.
In the present embodiment, target web can be one or more.If target web is multiple, need to crawl
The content of pages of each target web, and the issuing time of corresponding target web is parsed from the content of pages for crawling.
On usual website during more new web page, the issuing time of webpage can be recorded in the content of pages of webpage, the issuing time is
Webpage update time, by parse the issuing time can be accurate determine whether target web is in preset time period
The webpage of renewal.
Whether the first judging unit 40 is used to judge the issuing time in the preset time period.
Second determining unit 50 is used for when judging that the issuing time is in the preset time period, it is determined that described
Target web is the webpage updated in the preset time period.
According to the embodiment of the present application, parsed by the access log to targeted website in preset time period, obtained
Multiple accessed webpages in preset time period;Target web is determined from the accessed webpage of multiple, target web is
Webpage was not accessed for before preset time period;The content of pages of target web is crawled, is parsed from content of pages
The issuing time of target web;Judge issuing time whether in the preset time period, judging that issuing time is in
When in preset time period, determine that target web is the webpage updated in preset time period.Due to only detecting Preset Time
Accessed webpage in section, for all webpages for crawling website in the prior art, the quantity of its webpage is significantly
Reduce, solve because the webpage quantity for detecting causes the low technical problem of the detection process efficiency of more new web page greatly, carry
The more detection efficiency of new web page high.
Further, in the embodiment of the present application, done to being accessed for webpage i.e. target web for the first time in preset time period
Determine whether, crawl the web page contents of target web, parse its issuing time, determined by the issuing time
The webpage updated in preset time period, so that it is still accessed for for the first time in preset time period to reject renewal already
Webpage, improves the accuracy of detection more new web page.
Preferably, first determining unit is specifically for one by one positioning the unified resource of the multiple accessed webpage
Accord with and being matched with the URL of the webpage recorded before the preset time period, the multiple interviewed
Ask that the URL of accessed webpage in webpage does not match the mesh recorded before the preset time period
During the URL of the webpage on mark website, the accessed webpage that this is not matched is used as the target web.
In the present embodiment, webpage URL was accessed on pre-recorded targeted website before preset time period,
When determining target web, can be by using the URL (URL) of accessed webpage and in preset time period
The mode that the URL of the webpage for recording before is matched judges whether accessed webpage is target web.
Specifically, parsed in the access log from targeted website in the preset time period, all accessed webpages
URL, the URL is matched with pre-recorded URL, if matching identical URL, then it is assumed that the URL is not
It is to be accessed for webpage, i.e. non-targeted webpage for the first time in preset time period;, whereas if identical URL is not matched,
Then the corresponding webpages of the URL are target web.
Preferably, first determining unit includes:Coding module, for each in the multiple accessed webpage
The URL of individual accessed webpage carries out Hash coding, obtains each quilt in the multiple accessed webpage
Access the cryptographic Hash of the URL of webpage;Enquiry module, for being looked into the Bloom filter for pre-setting
The cryptographic Hash of the URL of each accessed webpage in the multiple accessed webpage is ask, wherein, it is described
The unified resource of the webpage issued before the preset time period of being stored with Bloom filter on the targeted website is determined
The cryptographic Hash of position symbol;Determining module, the corresponding webpage of cryptographic Hash for that will not inquire is used as the target web.
Specifically, when URL matchings are carried out, it is possible to use the Bloom filter for pre-setting, the Bloom filter exists
Build after completing, all webpages issued before calculating preset time period according to preset rules on targeted website
URL cryptographic Hash, and store in Bloom filter, so, detect target web during, according to identical
Rule calculate in preset time period be accessed webpage URL cryptographic Hash, then utilize and looked into Bloom filter
The cryptographic Hash is ask, when identical cryptographic Hash is inquired, shows the corresponding webpage of the cryptographic Hash before preset time period
It is issued;, whereas if not inquiring, then show that the webpage is target web.
It is grand in cloth using this by calculating the cryptographic Hash of the URL that webpage is accessed in preset time period in the present embodiment
Cryptographic Hash is inquired about in filter, relative to direct by the way of URL carries out matching inquiry, matching inquiry can be reduced
Complexity, improve search efficiency.
Preferably, described device also includes:Second judging unit, for crawling the content of pages of the target web
Afterwards, judge whether the target web is list page according to the content of pages;Discarding unit, for judging
When the target web is list page, the target web is abandoned.
Due to that can there are some list pages (being referred to as navigation page) in targeted website, and generally comprised only in list page
Hyperlink for being connected to other webpages, without actual content of pages.In order to avoid list page is to judged result
Influence, in the embodiment of the present application, after the content of pages for crawling target web, judge whether the webpage is list
Page, if it is, abandoning the list page, parses without the content of pages to the list page, and reducing needs solution
The data volume of analysis.
Preferably, second resolution unit includes:First parsing module, for what is configured according to the targeted website
Resolution rules parse the issuing time of the target web from the content of pages;Or, the second parsing module,
Issuing time for parsing the target web from the content of pages according to the resolution rules for pre-setting.
If targeted website is configured with resolution rules, when Webpage Context resolution is carried out, can be advised according to the parsing
Then parse issuing time.If targeted website is configured without resolution rules, can be carried out according to general rule
Parsing.
The webpage detection means include processor and memory, above-mentioned first resolution unit 10, the first determining unit 20,
Second resolution unit 30, the first judging unit 40 and second determining unit 50 etc. are as program unit storage in storage
In device, by computing device storage said procedure unit in memory.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one
Or more, the webpage that targeted website updates in preset time period is detected by adjusting kernel parameter.
Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/
Or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory, memory includes at least one
Individual storage chip.
Present invention also provides a kind of embodiment of computer program product, when being performed on data processing equipment, fit
In the program code for performing initialization there are as below methods step:Access log to targeted website in preset time period enters
Row parsing, obtains multiple accessed webpages in preset time period;Target web is determined from the accessed webpage of multiple,
Target web is that webpage was not accessed for before preset time period;The content of pages of target web is crawled, from the page
The issuing time of target web is parsed in appearance;Judge issuing time whether in preset time period;And judging
Go out issuing time to be in when in preset time period, determine that target web is the webpage updated in preset time period.
Above-mentioned the embodiment of the present application sequence number is for illustration only, and the quality of embodiment is not represented.
In above-described embodiment of the application, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment
The part of detailed description, may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, can be by other
Mode realize.Wherein, device embodiment described above is only schematical, such as division of described unit,
Can be a kind of division of logic function, there can be other dividing mode when actually realizing, for example multiple units or component
Can combine or be desirably integrated into another system, or some features can be ignored, or do not perform.It is another, institute
Display or the coupling each other for discussing or direct-coupling or communication connection can be by some interfaces, unit or mould
The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.
The unit that is illustrated as separating component can be or may not be it is physically separate, it is aobvious as unit
The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to
On multiple units.Some or all of unit therein can be according to the actual needs selected to realize this embodiment scheme
Purpose.
In addition, during each functional unit in the application each embodiment can be integrated in a processing unit, it is also possible to
It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.It is above-mentioned integrated
Unit can both be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit is to realize in the form of SFU software functional unit and as independent production marketing or when using,
Can store in a computer read/write memory medium.Based on such understanding, the technical scheme essence of the application
On all or part of the part that is contributed to prior art in other words or the technical scheme can be with software product
Form is embodied, and the computer software product is stored in a storage medium, including some instructions are used to so that one
Platform computer equipment (can be personal computer, server or network equipment etc.) performs each embodiment institute of the application
State all or part of step of method.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only
Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD
Etc. it is various can be with the medium of store program codes.
The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art
For member, on the premise of the application principle is not departed from, some improvements and modifications can also be made, these improve and moisten
Decorations also should be regarded as the protection domain of the application.
Claims (10)
1. a kind of page detection method, it is characterised in that including:
Access log to targeted website in preset time period is parsed, and is obtained in the preset time period
The accessed webpage of multiple;
Target web is determined from the multiple accessed webpage, the target web is in the Preset Time
Webpage is not accessed for before section;
The content of pages of the target web is crawled, the hair of the target web is parsed from the content of pages
The cloth time;
Judge the issuing time whether in the preset time period;And
When judging that the issuing time is in the preset time period, determine that the target web is in institute
State the webpage updated in preset time period.
2. method according to claim 1, it is characterised in that determine target from the multiple accessed webpage
Webpage includes:
The URL of the multiple accessed webpage is recorded with before the preset time period one by one
The URL of webpage matched, the unification of webpage is accessed in the multiple accessed webpage
URLs does not match the unification of the webpage on the targeted website recorded before the preset time period
During URLs, the accessed webpage that this is not matched is used as the target web.
3. method according to claim 2, it is characterised in that one by one by the unified money of the multiple accessed webpage
Source finger URL is matched with the URL of the webpage recorded before the preset time period, in institute
State be accessed in multiple accessed webpages webpage URLs do not match the preset time period it
During the URL of the webpage on the targeted website of preceding record, the accessed net that this is not matched
Page includes as the target web:
URL to each accessed webpage in the multiple accessed webpage carries out Hash coding,
Obtain the cryptographic Hash of the URL of each accessed webpage in the multiple accessed webpage;
Each accessed webpage in the multiple accessed webpage is inquired about in the Bloom filter for pre-setting
The cryptographic Hash of URL, wherein, in institute on the targeted website that is stored with the Bloom filter
The cryptographic Hash of the URL of the webpage issued before stating preset time period;
The corresponding webpage of cryptographic Hash that will do not inquired is used as the target web.
4. method according to claim 1, it is characterised in that after the content of pages for crawling the target web,
Methods described also includes:
Judge whether the target web is list page according to the content of pages;
When it is list page to judge the target web, the target web is abandoned.
5. method according to claim 1, it is characterised in that the target network is parsed from the content of pages
The issuing time of page includes:
The resolution rules configured according to the targeted website parse the target web from the content of pages
Issuing time;Or
The issuing time of the target web is parsed from the content of pages according to the resolution rules for pre-setting.
6. a kind of webpage detection means, it is characterised in that including:
First resolution unit, parses for the access log to targeted website in preset time period, obtains
Multiple accessed webpages in the preset time period;
First determining unit, for determining target web, the target network from the multiple accessed webpage
Page is not to be accessed for webpage before the preset time period;
Second resolution unit, the content of pages for crawling the target web is parsed from the content of pages
Go out the issuing time of the target web;
First judging unit, for judging the issuing time whether in the preset time period;And
Second determining unit, for judge the issuing time be in the preset time period in when, it is determined that
The target web is the webpage updated in the preset time period.
7. device according to claim 6, it is characterised in that first determining unit is specifically for one by one by institute
State the URL of multiple accessed webpages and the unification of the webpage recorded before the preset time period
URLs is matched, and the URL of webpage is accessed in the multiple accessed webpage not
Match before the preset time period record the targeted website on webpage URL when,
The accessed webpage that this is not matched is used as the target web.
8. device according to claim 7, it is characterised in that first determining unit includes:
Coding module, for the unified resource positioning to each accessed webpage in the multiple accessed webpage
Symbol carries out Hash coding, obtains the unified resource positioning of each accessed webpage in the multiple accessed webpage
The cryptographic Hash of symbol;
Enquiry module, it is each in the multiple accessed webpage for being inquired about in the Bloom filter for pre-setting
The cryptographic Hash of the URL of individual accessed webpage, wherein, it is stored with the Bloom filter described
The cryptographic Hash of the URL of the webpage issued before the preset time period on targeted website;
Determining module, the corresponding webpage of cryptographic Hash for that will not inquire is used as the target web.
9. device according to claim 6, it is characterised in that described device also includes:
Second judging unit, for after the content of pages for crawling the target web, according in the page
Appearance judges whether the target web is list page;
Discarding unit, for when it is list page to judge the target web, abandoning the target web.
10. device according to claim 6, it is characterised in that second resolution unit includes:
First parsing module, the resolution rules for being configured according to the targeted website are solved from the content of pages
Separate out the issuing time of the target web;Or
Second parsing module, it is described for being parsed from the content of pages according to the resolution rules for pre-setting
The issuing time of target web.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510923931.3A CN106874299A (en) | 2015-12-14 | 2015-12-14 | Page detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510923931.3A CN106874299A (en) | 2015-12-14 | 2015-12-14 | Page detection method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106874299A true CN106874299A (en) | 2017-06-20 |
Family
ID=59178259
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510923931.3A Pending CN106874299A (en) | 2015-12-14 | 2015-12-14 | Page detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106874299A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710834A (en) * | 2018-11-16 | 2019-05-03 | 北京字节跳动网络技术有限公司 | Similar web page detection method, device, storage medium and electronic equipment |
CN110287393A (en) * | 2019-06-26 | 2019-09-27 | 深信服科技股份有限公司 | A kind of webpage acquisition methods, device, equipment and computer readable storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102984121A (en) * | 2011-06-02 | 2013-03-20 | 富士通株式会社 | Access monitoring method and information processing apparatus |
US20130144928A1 (en) * | 2011-12-05 | 2013-06-06 | Microsoft Corporation | Minimal download and simulated page navigation features |
CN103177090A (en) * | 2013-03-08 | 2013-06-26 | 亿赞普(北京)科技有限公司 | Topic detection method and device based on big data |
CN104182548A (en) * | 2014-09-10 | 2014-12-03 | 北京国双科技有限公司 | Webpage updating and processing method and device |
CN104391953A (en) * | 2014-11-27 | 2015-03-04 | 北京国双科技有限公司 | Method and device for detecting web page updating |
CN104794193A (en) * | 2015-04-17 | 2015-07-22 | 南京大学 | Webpage increment capture method for valid link acquisition |
-
2015
- 2015-12-14 CN CN201510923931.3A patent/CN106874299A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102984121A (en) * | 2011-06-02 | 2013-03-20 | 富士通株式会社 | Access monitoring method and information processing apparatus |
US20130144928A1 (en) * | 2011-12-05 | 2013-06-06 | Microsoft Corporation | Minimal download and simulated page navigation features |
CN103177090A (en) * | 2013-03-08 | 2013-06-26 | 亿赞普(北京)科技有限公司 | Topic detection method and device based on big data |
CN104182548A (en) * | 2014-09-10 | 2014-12-03 | 北京国双科技有限公司 | Webpage updating and processing method and device |
CN104391953A (en) * | 2014-11-27 | 2015-03-04 | 北京国双科技有限公司 | Method and device for detecting web page updating |
CN104794193A (en) * | 2015-04-17 | 2015-07-22 | 南京大学 | Webpage increment capture method for valid link acquisition |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710834A (en) * | 2018-11-16 | 2019-05-03 | 北京字节跳动网络技术有限公司 | Similar web page detection method, device, storage medium and electronic equipment |
CN109710834B (en) * | 2018-11-16 | 2020-01-10 | 北京字节跳动网络技术有限公司 | Similar webpage detection method and device, storage medium and electronic equipment |
CN110287393A (en) * | 2019-06-26 | 2019-09-27 | 深信服科技股份有限公司 | A kind of webpage acquisition methods, device, equipment and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110099059B (en) | Domain name identification method and device and storage medium | |
CN106874165B (en) | Webpage detection method and device | |
US20180114139A1 (en) | Customized website predictions for machine-learning systems | |
CN106202101B (en) | Advertisement identification method and device | |
CN107578263A (en) | A kind of detection method, device and the electronic equipment of advertisement abnormal access | |
CN106570013A (en) | Method and device for processing page access data | |
CN106936778A (en) | The abnormal detection method of website traffic and device | |
CN106933905B (en) | Method and device for monitoring webpage access data | |
CN106610994A (en) | Method and device for counting click paths | |
CN106484738A (en) | A kind of page processing method and device | |
CN106874299A (en) | Page detection method and device | |
AU2019387166A1 (en) | A system and method of reconstructing browser interaction from session data having incomplete tracking data | |
CN106815248A (en) | Web analytics method and device | |
CN106487833A (en) | The statistical method of isolated user number and device in network monitor | |
CN106933903B (en) | Storage method and device applied to distributed storage | |
CN107135199A (en) | The detection method and device at webpage back door | |
US20190286671A1 (en) | Algorithmic computation of entity information from ip address | |
CN110457603A (en) | Customer relationship abstracting method, device, electronic equipment and readable storage medium storing program for executing | |
CN106874298A (en) | Page detection method and device | |
CN106547780A (en) | Article reprints statistics of variables method and device | |
CN107357795B (en) | Method and device for monitoring association degree between websites | |
CN112307298B (en) | Method and device for generating personal brand label | |
CN110472137B (en) | Negative sample construction method, device and system of recognition model | |
US20210056561A1 (en) | Method and system for identifying electronic devices of genuine customers of organizations | |
CN106708878B (en) | Terminal identification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170620 |