CN103383665B - Be suitable in url data crawl the method for data buffer storage and device - Google Patents

Be suitable in url data crawl the method for data buffer storage and device Download PDF

Info

Publication number
CN103383665B
CN103383665B CN201310293574.8A CN201310293574A CN103383665B CN 103383665 B CN103383665 B CN 103383665B CN 201310293574 A CN201310293574 A CN 201310293574A CN 103383665 B CN103383665 B CN 103383665B
Authority
CN
China
Prior art keywords
bloomfilter
storage container
data
storage
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310293574.8A
Other languages
Chinese (zh)
Other versions
CN103383665A (en
Inventor
韩孟岗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201610237936.5A priority Critical patent/CN105930405B/en
Priority to CN201310293574.8A priority patent/CN103383665B/en
Publication of CN103383665A publication Critical patent/CN103383665A/en
Application granted granted Critical
Publication of CN103383665B publication Critical patent/CN103383665B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a kind of method be suitable for data buffer storage in url data crawl, comprising: periodically capture url data; The url data at every turn captured is cached to all in order in the first Bloom filter Bloomfilter storage container and the 2nd Bloomfilter storage container, wherein, a Bloomfilter storage container is identical with the memory capacity of the 2nd Bloomfilter storage container; In the storing process of url data, monitor the url data memory space of a described Bloomfilter storage container and the 2nd Bloomfilter storage container; According to the data storage condition of monitoring, empty described 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn.Adopt the present invention to trade space for time, improve the stability of data, avoid traffic fluctuations, effectively decrease the fluctuation range of system, can be reduced the impact to other modules of system.

Description

Be suitable in url data crawl the method for data buffer storage and device
Technical field
The present invention relates to internet arena, be specifically related to a kind of being suitable in url data crawl the method for data buffer storage and device.
Background technology
In webpage capture system, the crawl for most of webpage has cyclic parameter to arrange, such as at least interval some time, the just crawl of consideration renewal property.Capture the waste easily causing capturing resource too frequently, unnecessary pressure is also brought to targeted website.Due to the limited space of general memory, in order to process this data stream endlessly, direct thinking sets a time window exactly, the data scrubbing before time window is fallen, and vacating space receives new data on the horizon.But all data before disposable clean up time window, data itself can produce very large fluctuation, easily produce larger impact for business.
Summary of the invention
In view of the above problems, propose the present invention in case provide a kind of overcome the problems referred to above or solve the problem at least in part be suitable for url data capture in the method for data buffer storage and corresponding device.
According to one aspect of the present invention, provide a kind of method be suitable for data buffer storage in url data crawl, comprising:
Periodically capture url data;
The url data at every turn captured is cached to all in order in the first Bloom filter Bloomfilter storage container and the 2nd Bloomfilter storage container, wherein, a Bloomfilter storage container is identical with the memory capacity of the 2nd Bloomfilter storage container;
In the storing process of url data, monitor the url data memory space of a described Bloomfilter storage container and the 2nd Bloomfilter storage container;
According to the data storage condition of monitoring, empty described 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn.
Alternatively, the described data storage condition according to monitoring, empties described 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn, comprising:
When the memory data output that a described Bloomfilter storage container stores arrives preset critical first, empty described 2nd Bloomfilter storage container.
Alternatively, the described data storage condition according to monitoring, empties described 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn, also comprises:
After described 2nd Bloomfilter storage container is cleared first,
When the memory data output of described 2nd Bloomfilter storage container reaches preset critical again, empty a described Bloomfilter storage container; And
When the memory data output of a described Bloomfilter storage container reaches preset critical again, empty described 2nd Bloomfilter storage container.
Alternatively, described preset critical is 1/2 of memory capacity.
Alternatively, the memory capacity of a described Bloomfilter storage container and the 2nd Bloomfilter storage container regulates according to the mechanical periodicity capturing url data.
According to another aspect of the present invention, provide a kind of device be suitable for data buffer storage in url data crawl, comprising:
Data extractor, is configured to periodically capture url data;
First Bloom filter Bloomfilter storage container, is configured to the url data that described in orderly buffer memory, data extractor captures at every turn;
2nd Bloomfilter storage container, identical with a described Bloomfilter storage container capacity, be configured to url data that in order buffer memory described in data extractor at every turn capture synchronous with a described Bloomfilter storage container;
Watch-dog, is configured in the storing process of url data, monitors the url data memory space of a described Bloomfilter storage container and the 2nd Bloomfilter storage container;
Data clearing device, is configured to the data storage condition monitored according to described watch-dog, empties described 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn.
Alternatively, described data clearing device is also configured to:
When the memory data output that described watch-dog monitors a described Bloomfilter storage container storage arrives preset critical first, empty described 2nd Bloomfilter storage container.
Alternatively, described data clearing device is also configured to:
After described 2nd Bloomfilter storage container is cleared first,
When the memory data output that described watch-dog monitors described 2nd Bloomfilter storage container reaches preset critical again, empty a described Bloomfilter storage container; And
When the memory data output that described watch-dog monitors a described Bloomfilter storage container reaches preset critical again, empty described 2nd Bloomfilter storage container.
Alternatively, described preset critical is 1/2 of memory capacity.
Alternatively, said apparatus also comprises:
Volume regulator, is configured to the mechanical periodicity capturing url data according to described data extractor, regulates the memory capacity of a described Bloomfilter storage container and the 2nd Bloomfilter storage container.
The method adopting the embodiment of the present invention to provide and device, can reach following beneficial effect:
In embodiments of the present invention, url data periodically captures, and therefore url data is data stream type sustainable existence, and therefore, the total amount of url data is also that streaming increases.Be cached to all in order by the url data at every turn captured in a Bloomfilter storage container and the 2nd Bloomfilter storage container, the data in two storage containers are synchronous, two storage containers redundancy each other.In storing process, the url data memory space of monitoring the one Bloomfilter storage container and the 2nd Bloomfilter storage container, empties the 2nd Bloomfilter storage container and a Bloomfilter storage container in turn according to monitored results.As the above analysis, in embodiments of the present invention, provide a Bloomfilter storage container and the 2nd Bloomfilter storage container carries out url data storage, and be not only a Bloomfilter storage container.Accordingly, at data deletion, in the embodiment of the present invention, the 2nd Bloomfilter storage container and a Bloomfilter storage container empty in turn, that is, empty at every turn and only can remove a part of url data, retain a part of url data, time sequencing attribute is converted into spatial order attribute, and manner of cleaning up is simple.And the embodiment of the present invention can't the disposable removing by all data, improve the stability of data, avoid traffic fluctuations, effectively decrease the fluctuation range of system, can be reduced the impact to other modules of system.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows the processing flow chart be suitable for according to an embodiment of the invention to the method for data buffer storage in url data crawl;
Fig. 2 shows the first structural representation be suitable for according to an embodiment of the invention to the device of data buffer storage in url data crawl; And
Fig. 3 shows the second structural representation be suitable for according to an embodiment of the invention to the device of data buffer storage in url data crawl.
Embodiment
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
For solving the problems of the technologies described above, the present invention adopts Bloom filter to provide a kind of inventive concept be suitable for data buffer storage in url data crawl.Advantage due to Bloom filter is that space efficiency and query time are all considerably beyond general algorithm, and shortcoming has certain false recognition rate and deletes difficulty, be not suitable for setup times window sector of breakdown to remove, the manner of cleaning up of general employing is exactly all empty Bloomfilter, but this way can produce very large fluctuation to data, easily larger impact is produced for business.
Based on this, inventive concept that the present invention is perfect further, makes not delete all data during each data dump, the fluctuation brought with this next mild cleaning operation.
Based on foregoing invention design, embodiments provide one and be suitable for URL(UniversalResourceLocator, URL(uniform resource locator)) method to data buffer storage in data grabber.Fig. 1 shows the processing flow chart be suitable for according to an embodiment of the invention to the method for data buffer storage in url data crawl.See Fig. 1, the method at least comprises step S102 to step S108.
Step S102, periodicity capture url data.
Wherein, url data herein captures and is morely applicable to spider Grasp Modes.
Step S104, the url data at every turn captured to be cached in a Bloomfilter storage container and the 2nd Bloomfilter storage container all in order.
Wherein, a Bloomfilter storage container is identical with the memory capacity of the 2nd Bloomfilter storage container.It should be noted that, first, second is only for identifying and distinguishing different Bloomfilter storage containers herein, and both essence are identical, and first, second is also not used in sequence.
In actual applications, the memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container is not unalterable.Consider that " the crawl url data " mentioned in step S102 is periodically variable, it captures period distances and may shorten, and also may increase.Such as, crawl cycle of initial setting be 10ms once, the crawl cycle so after change may be 5ms once, also may be 15ms once, concrete mechanical periodicity is determined according to actual conditions, is not limited to the numeral in above-mentioned exemplifying.
When capturing period distances and shortening, capture the increased frequency of url data, then the url data total amount grabbed increases, accordingly, and the now memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container appropriateness increase in proportion.In like manner, when capturing period distances growth, the number of times capturing url data reduces, then the url data total amount grabbed reduces, accordingly, and the now memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container appropriateness reduction in proportion.
Give a concrete illustration and be described.In this example, crawl cycle of initial setting be 10ms once, the crawl cycle after change be 5ms once, then within the identical time, the url data total amount captured after change is the twice of the ULR data total amount captured before change, for storing the ULR data of crawl in time, the memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container can be increased to the twice of former memory capacity.It should be noted that, twice is herein the ratio that an appropriateness increases, and is not fixed proportion, also can be adjusted to higher by both memory capacity, form restriction at this.
Give one example again to the crawl cycle increase situation be described.In this example, crawl cycle of initial setting be 10ms once, the crawl cycle after change be 20ms once, then within the identical time, the url data total amount captured after change is 1/2 of the ULR data total amount captured before change, for saving storage space, avoid the wasting of resources, the memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container can be decreased to 1/2 of former memory capacity.It should be noted that, herein 1/2 is the ratio that an appropriateness reduces, and is not fixed proportion, also can be adjusted to lower by both memory capacity, form restriction at this.
As the above analysis, the memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container is also revocable, can regulate according to the mechanical periodicity capturing url data.
Step S106, in the storing process of url data, monitoring the one Bloomfilter storage container and the url data memory space of the 2nd Bloomfilter storage container.
Step S108, according to monitoring data storage condition, empty the 2nd Bloomfilter storage container and a Bloomfilter storage container in turn.
In embodiments of the present invention, url data periodically captures, and therefore url data is data stream type sustainable existence, and therefore, the total amount of url data is also that streaming increases.Be cached to all in order by the url data at every turn captured in a Bloomfilter storage container and the 2nd Bloomfilter storage container, the data in two storage containers are synchronous, two storage containers redundancy each other.In storing process, the url data memory space of monitoring the one Bloomfilter storage container and the 2nd Bloomfilter storage container, empties the 2nd Bloomfilter storage container and a Bloomfilter storage container in turn according to monitored results.As the above analysis, in embodiments of the present invention, provide a Bloomfilter storage container and the 2nd Bloomfilter storage container carries out url data storage, and be not only a Bloomfilter storage container.Accordingly, at data deletion, in the embodiment of the present invention, the 2nd Bloomfilter storage container and a Bloomfilter storage container empty in turn, that is, empty at every turn and only can remove a part of url data, retain a part of url data, time sequencing attribute is converted into spatial order attribute, and manner of cleaning up is simple.And the embodiment of the present invention can't the disposable removing by all data, improve the stability of data, avoid traffic fluctuations, effectively decrease the fluctuation range of system, can be reduced the impact to other modules of system.
Further, clear up the mode of data relative to the time window that utilizes mentioned in prior art, the embodiment of the present invention need not service time window, do not need the time attribute recording data, save storage overhead.
During enforcement, the initial setting up of a Bloomfilter storage container and the 2nd Bloomfilter storage container is for both is all blank storage containers, does not store url data.The url data that one Bloomfilter storage container and the 2nd Bloomfilter storage container stores synchronized capture, particularly, the ULR data captured are when writing memory window, need to write a Bloomfilter storage container and the 2nd Bloomfilter storage container simultaneously, two writing, so data syn-chronization increases in two Bloomfilter storage containers.Such as, URL1 be preserved, then proceed as follows:
1. in a Bloomfilter storage container, write URL1;
2. in the 2nd Bloomfilter storage container, write URL1;
3.URL1 data have write.
After the write operation of URL1 completes, in a Bloomfilter storage container and the 2nd Bloomfilter storage container, URL1 can be inquired.Namely be redundancy during URL1 write.
Therefore, a Bloomfilter storage container and the 2nd Bloomfilter storage container can arrive preset critical simultaneously first.Preset critical can be 1/2,1/3 or other values of wherein any one Bloomfilter storage container total volume.The 2nd Bloomfilter storage container can be specified during initial setting up to be elderly, and then when the memory data output that a Bloomfilter storage container stores arrives preset critical first, to empty the 2nd Bloomfilter storage container.Certainly, if specify a Bloomfilter storage container to be elderly during initial setting up, what empty first is exactly a Bloomfilter storage container.
Mention in step S108, empty the 2nd Bloomfilter storage container and a Bloomfilter storage container in turn, therefore, after emptying the 2nd Bloomfilter storage container first, also correspondingly can empty a Bloomfilter storage container, be the 2nd Bloomfilter storage container again, repeat this and empty order, reach the object emptying the 2nd Bloomfilter storage container and a Bloomfilter storage container in turn.Certainly, the trigger point at every turn emptied is all that url data memory space reaches preset critical again.Particularly, after the 2nd Bloomfilter storage container is cleared first, when the memory data output of the 2nd Bloomfilter storage container reaches preset critical again, empty a Bloomfilter storage container.Accordingly, when the memory data output of a Bloomfilter storage container reaches preset critical again, the 2nd Bloomfilter storage container is emptied.
It should be noted that, what the embodiment of the present invention provided is suitable for there is its scope of application to the method for data buffer storage during url data captures, and the data under main applicable time sequence eliminate scene:
1. first, data are orderly, free attribute or front and back position attribute;
2. the scope of secondly, observing/acting on is limited; Namely the scope of the window that can use is defined;
3. in window phase (i.e. data storage period), wall scroll data at most only allow to occur once, ignore more than repeating once.
For the method embodiment of the present invention provided is set forth clearer clearer, be now described with specific embodiment.
This example arranges the Bloomfilter storage container that two capacity are C, is labeled as container A and container B.The data total amount stored in whole space is labeled as n, and the data volume stored in container A is labeled as na, and the data volume stored in container B is labeled as nb.
In this example, two containers provide the stores service of data simultaneously.Starting stage, the data total amount n=0 stored in whole space, the data volume na=0 in container A, the data volume nb=0 in container B, and to arrange container B be container elderly.
Now be described according to the cleanup process of number change to two containers of the data total amount n stored in whole space, in storing process, the quantity of n is according to storage data variation.
As n<C/2, the data volume n<C/2 stored in whole storage space, data volume na<C/2 in container A, data volume nb<C/2 in container B, the data volume now stored in container A or B does not all reach preset critical, proceeds data and stores.
As n=C/2, the data volume stored in container A and B all reaches preset critical C/2, because initial setting up container B is container elderly, and therefore data in empty container B.After container B is cleared, the data volume n=C/2 stored in whole storage space, the data volume na=C/2 in container A, the data volume nb=0 in container B.
Proceed data to store, as C/2<=n<C, data quantity C/the 2<=n<C of the storage in whole storage space, data quantity C/2<=na<C in container A, data volume nb<C/2 in container B, the data volume now stored in container A or B does not all reach preset critical, proceeds data and stores.
As n=C, the data volume na=C in container A, the data volume nb=C/2 in container B, the data volume now stored in container B reaches preset critical C/2, data in empty container A.After container A is cleared, the data volume n=C/2 stored in whole storage space, the data volume na=0 in container A, the data volume nb=C/2 in container B, proceed data and store.
As C/2<=n<C, data quantity C/the 2<=n<C of the storage in whole storage space, data volume na<C/2 in container A, data quantity C/2<=nb<C in container B, the data volume now stored in container A or B does not all reach preset critical, proceeds data and stores.
As n=C, the data volume na=C/2 in container A, the data volume nb=C in container B, the data volume now stored in container A reaches preset critical C/2, data in empty container B.After container B is cleared, the data volume n=C/2 stored in whole storage space, the data volume na=C/2 in container A, the data volume nb=0 in container B.
Proceed data to store, repeat above-mentioned null clear operation according to the change of n value.
For above-mentioned null clear operation process, running time point and corresponding principle are set forth clearer clearer, be now described in table form, specifically see table one.Table one
n A B
1 n<C/2 na<C/2 nb<C/2
2 n=C/2 na=C/2 nb=0
3 C/2<=n<C C/2<=na<C nb<C/2
4 n=C na=0 nb=C/2
5 C/2<=n<C na<C/2 C/2<=nb<C
6 n=C na=C/2 nb=0
As can be seen from table one, container A and container B are when arriving preset critical first, and empty container B, subsequently, when container A or container B arrive preset critical, correspondence empties another one container.
Based on same inventive concept, embodiments provide a kind of device be suitable for data buffer storage in url data crawl.Fig. 2 shows the first structural representation be suitable for according to an embodiment of the invention to the device of data buffer storage in url data crawl.See Fig. 2, this device at least comprises:
Data extractor 210, is configured to periodically capture url data;
One Bloomfilter storage container 220, is coupled with data extractor 210, is configured to the url data that orderly data cached grabber 210 captures at every turn;
2nd Bloomfilter storage container 230, identical with Bloomfilter storage container 220 capacity, also be coupled with data extractor 210, be configured to the url data at every turn captured with the synchronous orderly data cached grabber 210 of a Bloomfilter storage container 220;
Watch-dog 240, be coupled with a Bloomfilter storage container 220 and the 2nd Bloomfilter storage container 230 respectively, be configured in the storing process of url data, the url data memory space of monitoring the one Bloomfilter storage container 220 and the 2nd Bloomfilter storage container 230;
Data clearing device 250, is configured to, according to the data storage condition of watch-dog monitoring, empty the 2nd Bloomfilter storage container 230 and a Bloomfilter storage container 220 in turn.
In embodiments of the present invention, above-mentioned each device all can utilize practical devices to realize.Prior art has various storer (such as RAM, ROM, EPROM, flash memory etc.), watch-dog (such as heartbeat equipment), data clearing device (such as data erasure apparatus), data extractor etc.The present invention is to provide and during url data captures, each several part the Nomenclature Composition and Structure of Complexes of the device of data buffer storage is protected being suitable for.
In a preferred embodiment, data clearing device 250 can also be configured to:
When the memory data output that watch-dog 240 monitors Bloomfilter storage container 220 storage arrives preset critical first, empty the 2nd Bloomfilter storage container 230.
In a preferred embodiment, data clearing device 250 can also be configured to:
After the 2nd Bloomfilter storage container 230 is cleared first,
When the memory data output that watch-dog 240 monitors the 2nd Bloomfilter storage container 230 reaches preset critical again, empty a Bloomfilter storage container 220; And
When the memory data output that watch-dog 240 monitors a Bloomfilter storage container 220 reaches preset critical again, empty the 2nd Bloomfilter storage container 230.
In a preferred embodiment, preset critical is 1/2 of the memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container.
Fig. 3 shows the second structural representation be suitable for according to an embodiment of the invention to the device of data buffer storage in url data crawl.In a preferred embodiment, see Fig. 3, be suitable for url data capture in the device of data buffer storage except comprising each device shown in Fig. 2, can also comprise:
Volume regulator 260, be coupled with data extractor 210, a Bloomfilter storage container 220 and the 2nd Bloomfilter storage container 230 respectively, be configured to the mechanical periodicity capturing url data according to data extractor 210, the memory capacity of a Bloomfilter storage container 220 and the 2nd Bloomfilter storage container 230 is regulated.
The method adopting the embodiment of the present invention to provide and device, can reach following beneficial effect:
In embodiments of the present invention, url data periodically captures, and therefore url data is data stream type sustainable existence, and therefore, the total amount of url data is also that streaming increases.Be cached to all in order by the url data at every turn captured in a Bloomfilter storage container and the 2nd Bloomfilter storage container, the data in two storage containers are synchronous, two storage containers redundancy each other.In storing process, the url data memory space of monitoring the one Bloomfilter storage container and the 2nd Bloomfilter storage container, empties the 2nd Bloomfilter storage container and a Bloomfilter storage container in turn according to monitored results.As the above analysis, in embodiments of the present invention, provide a Bloomfilter storage container and the 2nd Bloomfilter storage container carries out url data storage, and be not only a Bloomfilter storage container.Accordingly, at data deletion, in the embodiment of the present invention, the 2nd Bloomfilter storage container and a Bloomfilter storage container empty in turn, that is, empty at every turn and only can remove a part of url data, retain a part of url data, time sequencing attribute is converted into spatial order attribute, and manner of cleaning up is simple.And the embodiment of the present invention can't the disposable removing by all data, improve the stability of data, avoid traffic fluctuations, effectively decrease the fluctuation range of system, can be reduced the impact to other modules of system.
Further, clear up the mode of data relative to the time window that utilizes mentioned in prior art, the embodiment of the present invention need not service time window, do not need the time attribute recording data, save storage overhead.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that can use in practice microprocessor or digital signal processor (DSP) realize according to the embodiment of the present invention be suitable for url data capture in some or all functions to the some or all parts in the device of data buffer storage.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims (6)

1. be suitable for the method to data buffer storage in url data crawl, comprise:
Periodically capture url data;
The url data at every turn captured is cached to all in order in the first Bloom filter Bloomfilter storage container and the 2nd Bloomfilter storage container, wherein, a Bloomfilter storage container is identical with the memory capacity of the 2nd Bloomfilter storage container;
In the storing process of url data, monitor the url data memory space of a described Bloomfilter storage container and the 2nd Bloomfilter storage container;
According to the data storage condition of monitoring, empty described 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn,
Wherein, the step emptying storage container in turn comprises:
When the memory data output that a described Bloomfilter storage container stores arrives preset critical first, empty described 2nd Bloomfilter storage container;
After described 2nd Bloomfilter storage container is cleared first, and when the memory data output of described 2nd Bloomfilter storage container reaches preset critical again, empty a described Bloomfilter storage container; And
When the memory data output of a described Bloomfilter storage container reaches preset critical again, empty described 2nd Bloomfilter storage container.
2. method according to claim 1, wherein, described preset critical is 1/2 of memory capacity.
3. the method according to any one of claim 1 and 2, wherein, the memory capacity of a described Bloomfilter storage container and the 2nd Bloomfilter storage container regulates according to the mechanical periodicity capturing url data.
4. be suitable for the device to data buffer storage in url data crawl, comprise:
Data extractor, is configured to periodically capture url data;
First Bloom filter Bloomfilter storage container, is configured to the url data that described in orderly buffer memory, data extractor captures at every turn;
2nd Bloomfilter storage container, identical with a described Bloomfilter storage container capacity, be configured to url data that in order buffer memory described in data extractor at every turn capture synchronous with a described Bloomfilter storage container;
Watch-dog, is configured in the storing process of url data, monitors the url data memory space of a described Bloomfilter storage container and the 2nd Bloomfilter storage container;
Data clearing device, is configured to the data storage condition monitored according to described watch-dog, empties described 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn,
Wherein, described data clearing device is also configured to:
When the memory data output that described watch-dog monitors a described Bloomfilter storage container storage arrives preset critical first, empty described 2nd Bloomfilter storage container;
After described 2nd Bloomfilter storage container is cleared first, and when the memory data output that described watch-dog monitors described 2nd Bloomfilter storage container reaches preset critical again, empty a described Bloomfilter storage container; And
When the memory data output that described watch-dog monitors a described Bloomfilter storage container reaches preset critical again, empty described 2nd Bloomfilter storage container.
5. device according to claim 4, wherein, described preset critical is 1/2 of memory capacity.
6. the device according to any one of claim 4 and 5, wherein, also comprises:
Volume regulator, is configured to the mechanical periodicity capturing url data according to described data extractor, regulates the memory capacity of a described Bloomfilter storage container and the 2nd Bloomfilter storage container.
CN201310293574.8A 2013-07-12 2013-07-12 Be suitable in url data crawl the method for data buffer storage and device Expired - Fee Related CN103383665B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610237936.5A CN105930405B (en) 2013-07-12 2013-07-12 Suitable in url data crawl to the method and device of data buffer storage
CN201310293574.8A CN103383665B (en) 2013-07-12 2013-07-12 Be suitable in url data crawl the method for data buffer storage and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310293574.8A CN103383665B (en) 2013-07-12 2013-07-12 Be suitable in url data crawl the method for data buffer storage and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201610237936.5A Division CN105930405B (en) 2013-07-12 2013-07-12 Suitable in url data crawl to the method and device of data buffer storage

Publications (2)

Publication Number Publication Date
CN103383665A CN103383665A (en) 2013-11-06
CN103383665B true CN103383665B (en) 2016-04-27

Family

ID=49491462

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201310293574.8A Expired - Fee Related CN103383665B (en) 2013-07-12 2013-07-12 Be suitable in url data crawl the method for data buffer storage and device
CN201610237936.5A Active CN105930405B (en) 2013-07-12 2013-07-12 Suitable in url data crawl to the method and device of data buffer storage

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201610237936.5A Active CN105930405B (en) 2013-07-12 2013-07-12 Suitable in url data crawl to the method and device of data buffer storage

Country Status (1)

Country Link
CN (2) CN103383665B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10003574B1 (en) 2015-02-28 2018-06-19 Palo Alto Networks, Inc. Probabilistic duplicate detection

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933054B (en) * 2014-03-18 2018-07-06 上海帝联信息科技股份有限公司 The URL storage methods and device of cache resource file, cache server
CN106487759A (en) * 2015-08-28 2017-03-08 北京奇虎科技有限公司 The method and apparatus that URL effectiveness and safety are promoted in a kind of detection
FR3042624A1 (en) * 2015-10-19 2017-04-21 Orange METHOD FOR AIDING THE DETECTION OF INFECTION OF A TERMINAL BY MALWARE SOFTWARE
CN110020272B (en) * 2017-08-14 2021-11-05 中国电信股份有限公司 Caching method and device and computer storage medium
CN111159436B (en) * 2018-11-07 2023-12-12 腾讯科技(深圳)有限公司 Method, device and computing equipment for recommending multimedia content
CN111931028A (en) * 2020-08-18 2020-11-13 北京微步在线科技有限公司 Monitoring system and monitoring method based on k8s

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253991A (en) * 2011-05-25 2011-11-23 北京星网锐捷网络技术有限公司 Uniform resource locator (URL) storage method, web filtering method, device and system
US8250080B1 (en) * 2008-01-11 2012-08-21 Google Inc. Filtering in search engines
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539932A (en) * 2009-01-21 2009-09-23 北京跳网无限科技发展有限公司 Synchronization access technology of transforming web page
CN101742263A (en) * 2009-12-08 2010-06-16 北京互信互通信息技术股份有限公司 Method for storing surveillance video data
CN102214172B (en) * 2010-04-06 2013-05-08 腾讯科技(深圳)有限公司 Caching method and caching equipment
CN102137086B (en) * 2010-09-10 2013-09-11 华为技术有限公司 Method, device and system for processing data transmission
CN102164160B (en) * 2010-12-31 2015-06-17 青岛海信传媒网络技术有限公司 Method, device and system for supporting large quantity of concurrent downloading

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8250080B1 (en) * 2008-01-11 2012-08-21 Google Inc. Filtering in search engines
CN102253991A (en) * 2011-05-25 2011-11-23 北京星网锐捷网络技术有限公司 Uniform resource locator (URL) storage method, web filtering method, device and system
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10003574B1 (en) 2015-02-28 2018-06-19 Palo Alto Networks, Inc. Probabilistic duplicate detection

Also Published As

Publication number Publication date
CN103383665A (en) 2013-11-06
CN105930405A (en) 2016-09-07
CN105930405B (en) 2019-09-24

Similar Documents

Publication Publication Date Title
CN103383665B (en) Be suitable in url data crawl the method for data buffer storage and device
CN103942210A (en) Processing method, device and system of mass log information
KR102028708B1 (en) Method for parallel mining of temporal relations in large event file
CN104077402A (en) Data processing method and data processing system
CN103488709A (en) Method and system for building indexes and method and system for retrieving indexes
CN102802090B (en) Video copyright protection method and system
CN106294390A (en) A kind of data mining analysis method and system
CN105446893A (en) Data storage method and device
CN101562664A (en) Ticket processing method and system
CN105956068A (en) Webpage URL repetition elimination method based on distributed database
CN103164490A (en) Method and device for achieving high-efficient storage of data with non-fixed lengths
CN107203532A (en) Construction method, the implementation method of search and the device of directory system
CN106897280A (en) Data query method and device
CN104408169A (en) Multi-dimensional expression language based dimension query method and device
CN101777075B (en) Method for searching parallel audio fingerprint
CN102902784B (en) Web page classification storage system and method
CN105224661A (en) Conversational information search method and device
CN103559282B (en) The De-weight method and device of real-time system data
CN100555935C (en) Network monitoring data compression storage and associated detecting method based on similar data set
CN105426407A (en) Web data acquisition method based on content analysis
CN104219271B (en) Based on the asynchronous multiserver synchronous method for downloading the page of multithreading
CN104794158A (en) Domain name data repeated detection and fast index method in boundscript window
CN106354846A (en) Intelligent news manuscript selection method and system based on big data
CN105630983A (en) Resource obtaining and optimizing device and method
CN109271278A (en) A kind of method and apparatus of the reference number of determining disk snapshot data slicer

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160427

Termination date: 20210712