CN103383665A - Method and device suitable for caching data during URL data capture - Google Patents

Method and device suitable for caching data during URL data capture Download PDF

Info

Publication number
CN103383665A
CN103383665A CN2013102935748A CN201310293574A CN103383665A CN 103383665 A CN103383665 A CN 103383665A CN 2013102935748 A CN2013102935748 A CN 2013102935748A CN 201310293574 A CN201310293574 A CN 201310293574A CN 103383665 A CN103383665 A CN 103383665A
Authority
CN
China
Prior art keywords
bloomfilter
storage container
data
storage
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013102935748A
Other languages
Chinese (zh)
Other versions
CN103383665B (en
Inventor
韩孟岗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310293574.8A priority Critical patent/CN103383665B/en
Priority to CN201610237936.5A priority patent/CN105930405B/en
Publication of CN103383665A publication Critical patent/CN103383665A/en
Application granted granted Critical
Publication of CN103383665B publication Critical patent/CN103383665B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a method suitable for caching data during URL data capture. The method comprises the steps as follows: capturing URL data periodically, caching the URL data captured every time to a first Bloomfilter storage container and a second Bloomfilter storage container orderly, wherein the first Bloomfilter storage container has the same storage capacity with the second Bloomfilter storage container, monitoring URL data storage amounts in the first Bloomfilter storage container and the second Bloomfilter storage container in the storage process of the URL data, and emptying the second Bloomfilter storage container and the first Bloomfilter storage container by turns via the monitored data storage condition. With the adoption of the method, time can be exchanged with space, stability of data is improved, business fluctuation is avoided, fluctuation range of the system is effectively reduced, and impact on the other modules of the system can be reduced.

Description

Be suitable in url data crawl method and device to the data buffer memory
Technical field
The present invention relates to internet arena, be specifically related to a kind of method and device that is suitable in url data crawl the data buffer memory.
Background technology
In the webpage grasping system, for the crawl of most of webpages, the cyclic parameter setting is arranged, such as interval some time at least, just consider to upgrade the property crawl.Crawl easily causes grasping the waste of resource too frequently, and unnecessary pressure is also brought in the targeted website.Due to the limited space of general memory, in order to process this data stream endlessly, directly thinking is set a time window exactly, and the data scrubbing before time window is fallen, and vacating space is admitted new data on the horizon.But, disposable time window all data before that empty, data itself can produce very large fluctuation, easily produce larger impact for business.
Summary of the invention
In view of the above problems, the present invention has been proposed in case provide a kind of overcome the problems referred to above or address the above problem at least in part be suitable in url data crawl method and corresponding device to the data buffer memory.
According to one aspect of the present invention, provide a kind of be suitable in url data crawl that the method for data buffer memory is comprised:
Periodically grasp url data;
The url data of crawl at every turn all is cached in the first Bloom filter Bloomfilter storage container and the 2nd Bloomfilter storage container in order, wherein, a Bloomfilter storage container is identical with the memory capacity of the 2nd Bloomfilter storage container;
In the storing process of url data, monitor the url data memory space of a described Bloomfilter storage container and the 2nd Bloomfilter storage container;
Data storage condition according to monitoring empties described the 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn.
Alternatively, described data storage condition according to monitoring empties described the 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn, comprising:
When the memory data output of a described Bloomfilter storage container storage arrives preset critical first, empty described the 2nd Bloomfilter storage container.
Alternatively, described data storage condition according to monitoring empties described the 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn, also comprises:
After described the 2nd Bloomfilter storage container is cleared first,
When the memory data output of described the 2nd Bloomfilter storage container reaches preset critical again, empty a described Bloomfilter storage container; And
When the memory data output of a described Bloomfilter storage container reaches preset critical again, empty described the 2nd Bloomfilter storage container.
Alternatively, described preset critical is 1/2 of memory capacity.
Alternatively, the memory capacity of a described Bloomfilter storage container and the 2nd Bloomfilter storage container is regulated according to the cycle variation of crawl url data.
According to another aspect of the present invention, provide a kind of be suitable in url data crawl that the device of data buffer memory is comprised:
The data grabber is configured to periodically grasp url data;
The first Bloom filter Bloomfilter storage container is configured to the each url data that grasps of the described data grabber of orderly buffer memory;
The 2nd Bloomfilter storage container, identical with a described Bloomfilter storage container capacity, be configured to synchronize with a described Bloomfilter storage container url data of the each crawl of the described data grabber of buffer memory in order;
Watch-dog is configured in the storing process of url data, monitors the url data memory space of a described Bloomfilter storage container and the 2nd Bloomfilter storage container;
The data clearing device is configured to the data storage condition according to described watch-dog monitoring, empties in turn described the 2nd Bloomfilter storage container and a described Bloomfilter storage container.
Alternatively, described data clearing device also is configured to:
When described watch-dog monitors memory data output that a described Bloomfilter storage container stores and arrives preset critical first, empty described the 2nd Bloomfilter storage container.
Alternatively, described data clearing device also is configured to:
After described the 2nd Bloomfilter storage container is cleared first,
When the memory data output that monitors described the 2nd Bloomfilter storage container when described watch-dog reaches preset critical again, empty a described Bloomfilter storage container; And
When the memory data output that monitors a described Bloomfilter storage container when described watch-dog reaches preset critical again, empty described the 2nd Bloomfilter storage container.
Alternatively, described preset critical is 1/2 of memory capacity.
Alternatively, said apparatus also comprises:
Volume regulator is configured to change according to the cycle of described data grabber crawl url data, and the memory capacity of a described Bloomfilter storage container and the 2nd Bloomfilter storage container is regulated.
The method and the device that adopt the embodiment of the present invention to provide can reach following beneficial effect:
In embodiments of the present invention, url data is periodically crawl, so url data is the data stream type sustainable existence, and therefore, the total amount of url data is also that streaming increases.The url data of crawl at every turn all is cached in a Bloomfilter storage container and the 2nd Bloomfilter storage container in order, and the data in two storage containers are synchronous, and two storage containers are redundancy each other.In storing process, the url data memory space of monitoring the one Bloomfilter storage container and the 2nd Bloomfilter storage container empties the 2nd Bloomfilter storage container and a Bloomfilter storage container in turn according to monitored results.As the above analysis, in embodiments of the present invention, provide a Bloomfilter storage container and the 2nd Bloomfilter storage container to carry out the url data storage, and be not only a Bloomfilter storage container.Accordingly, at the data deletion, in the embodiment of the present invention, the 2nd Bloomfilter storage container and a Bloomfilter storage container empty in turn, that is to say, only empty at every turn and can remove a part of url data, keep a part of url data, the time sequencing attribute is converted into the spatial order attribute, manner of cleaning up is simple.And the embodiment of the present invention can't improve the stability of data with the disposable removing of all data, avoids traffic fluctuations, has effectively reduced the fluctuation range of system, can reduce the impact to other modules of system.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, for can clearer understanding technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of drawings
By reading hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used for the purpose of preferred implementation is shown, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:
Fig. 1 shows the processing flow chart that is suitable for according to an embodiment of the invention in url data crawl the method for data buffer memory;
Fig. 2 shows the first structural representation that is suitable for according to an embodiment of the invention in url data crawl the device of data buffer memory; And
Fig. 3 shows the second structural representation that is suitable for according to an embodiment of the invention in url data crawl the device of data buffer memory.
Embodiment
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration at this algorithm that provides.Various general-purpose systems also can with based on using together with this teaching.According to top description, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.Should be understood that and to utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.
For solving the problems of the technologies described above, the present invention adopts Bloom filter that a kind of inventive concept that is suitable in the url data crawl the data buffer memory is provided.Because the advantage of Bloom filter is that space efficiency and query time are all considerably beyond general algorithm, and shortcoming is that certain false recognition rate and deletion difficulty are arranged, being not suitable for setup times window sector of breakdown removes, the general manner of cleaning up that adopts is exactly all to empty Bloomfilter, but this way can produce very large fluctuation to data, easily produces larger impact for business.
Based on this, all data are not deleted in inventive concept that the present invention is perfect further when making each data dump, with this fluctuation that comes mild cleaning operation to bring.
Based on foregoing invention design, the embodiment of the present invention provides a kind of URL(Universal of being suitable for Resource Locator, URL(uniform resource locator)) in the data crawl to the method for data buffer memory.Fig. 1 shows the processing flow chart that is suitable for according to an embodiment of the invention in url data crawl the method for data buffer memory.Referring to Fig. 1, the method comprises that at least step S102 is to step S108.
Step S102, periodicity grasp url data.
Wherein, herein the more spider Grasp Modes that is applicable to of url data crawl.
Step S104, the url data that will at every turn grasp all are cached in a Bloomfilter storage container and the 2nd Bloomfilter storage container in order.
Wherein, a Bloomfilter storage container is identical with the memory capacity of the 2nd Bloomfilter storage container.Need to prove, first, second only is used for identifying and distinguish different Bloomfilter storage containers herein, and both essence are identical, and first, second also is not used in sequence.
In actual applications, the memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container is not unalterable.Consider that " the crawl url data " mentioned in step S102 is periodically variable, all period interval of its crawl may shorten, and also may increase.For example, the crawl cycle of initial setting be 10ms once, the crawl cycle after changing so may be 5ms once, may be also 15ms once, the concrete cycle changes decides according to actual conditions, is not limited to the numeral in above-mentioned exemplifying.
When all period interval of crawl shorten, grasp the increased frequency of url data, the url data total amount that grabs increases, and is corresponding, the memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container appropriateness increase in proportion at this moment.In like manner, when all period interval of crawl increased, the number of times of crawl url data reduced, and the url data total amount that grabs reduces, and is corresponding, this moment the one Bloomfilter storage container and the 2nd Bloomfilter storage container memory capacity in proportion appropriateness reduce.
Give a concrete illustration and be described.In this example, the crawl cycle of initial setting be 10ms once, the crawl cycle after variation be 5ms once, within the identical time, after changing, the url data total amount of crawl is the twice of the ULR data total amount of crawl before changing, be the ULR data of in time storage crawl, the memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container can increase to the twice of former memory capacity.Need to prove, twice herein is a ratio that appropriateness increases, and is not fixed proportion, also both memory capacity can be transferred to highlyer, does not form restriction at this.
The situation that the crawl cycle is increased of giving one example again describes.In this example, the crawl cycle of initial setting be 10ms once, the crawl cycle after variation be 20ms once, within the identical time, after variation, the url data total amount of crawl is to change 1/2 of the front ULR data total amount that grasps, for saving storage space, avoid the wasting of resources, the memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container can be decreased to 1/2 of former memory capacity.Need to prove, herein 1/2 be a ratio that appropriateness reduces, be not fixed proportion, also both memory capacity can be transferred to lowlyer, do not form restriction at this.
As the above analysis, the memory capacity of a Bloomfilter storage container and the 2nd Bloomfilter storage container is also revocable, can change according to the cycle of crawl url data to regulate.
Step S106, in the storing process of url data, the url data memory space of monitoring the one Bloomfilter storage container and the 2nd Bloomfilter storage container.
Step S108, according to the monitoring the data storage condition, empty in turn the 2nd Bloomfilter storage container and a Bloomfilter storage container.
In embodiments of the present invention, url data is periodically crawl, so url data is the data stream type sustainable existence, and therefore, the total amount of url data is also that streaming increases.The url data of crawl at every turn all is cached in a Bloomfilter storage container and the 2nd Bloomfilter storage container in order, and the data in two storage containers are synchronous, and two storage containers are redundancy each other.In storing process, the url data memory space of monitoring the one Bloomfilter storage container and the 2nd Bloomfilter storage container empties the 2nd Bloomfilter storage container and a Bloomfilter storage container in turn according to monitored results.As the above analysis, in embodiments of the present invention, provide a Bloomfilter storage container and the 2nd Bloomfilter storage container to carry out the url data storage, and be not only a Bloomfilter storage container.Accordingly, at the data deletion, in the embodiment of the present invention, the 2nd Bloomfilter storage container and a Bloomfilter storage container empty in turn, that is to say, only empty at every turn and can remove a part of url data, keep a part of url data, the time sequencing attribute is converted into the spatial order attribute, manner of cleaning up is simple.And the embodiment of the present invention can't improve the stability of data with the disposable removing of all data, avoids traffic fluctuations, has effectively reduced the fluctuation range of system, can reduce the impact to other modules of system.
Further, with respect to the mode of utilizing time window cleaning data of mentioning in prior art, the embodiment of the present invention need not service time window, do not need the time attribute of record data, save storage overhead.
During enforcement, the initial setting up of a Bloomfilter storage container and the 2nd Bloomfilter storage container is that both are all blank storage containers, does not store url data.The url data of the one Bloomfilter storage container and the crawl of the 2nd Bloomfilter storage container stores synchronized, particularly, the ULR data of crawl are when writing memory window, need to write simultaneously a Bloomfilter storage container and the 2nd Bloomfilter storage container, two writing, so in two Bloomfilter storage containers, data synchronously increase.For example, preserve URL1, proceed as follows:
1. write URL1 in a Bloomfilter storage container;
2. write URL1 in the 2nd Bloomfilter storage container;
3.URL1 writing, data complete.
After the write operation of URL1 is completed, can inquire URL1 in a Bloomfilter storage container and the 2nd Bloomfilter storage container.Be that to write fashionable be redundancy to URL1.
Therefore, a Bloomfilter storage container and the 2nd Bloomfilter storage container can arrive preset critical simultaneously first.Preset critical can be 1/2,1/3 or other values of wherein any one Bloomfilter storage container total volume.Can specify the 2nd Bloomfilter storage container during initial setting up is senior people, and then when the memory data output that a Bloomfilter storage container stores arrives preset critical first, empties the 2nd Bloomfilter storage container.Certainly, be senior people if specify a Bloomfilter storage container during initial setting up, what empty first is exactly a Bloomfilter storage container.
Mention in step S108, empty in turn the 2nd Bloomfilter storage container and a Bloomfilter storage container, therefore, after emptying first the 2nd Bloomfilter storage container, also can correspondingly empty a Bloomfilter storage container, be the 2nd Bloomfilter storage container again, repeat this and empty order, reach the purpose that empties in turn the 2nd Bloomfilter storage container and a Bloomfilter storage container.Certainly, the trigger point that at every turn empties is all that the url data memory space reaches preset critical again.Particularly, after the 2nd Bloomfilter storage container is cleared first, when the memory data output of the 2nd Bloomfilter storage container reaches preset critical again, empty a Bloomfilter storage container.Accordingly, when the memory data output of a Bloomfilter storage container reaches preset critical again, empty the 2nd Bloomfilter storage container.
Need to prove, the embodiment of the present invention provides is suitable for that in the url data crawl, the method to the data buffer memory has its scope of application, and the data under main applicable time sequence are eliminated scene:
1. at first, data are orderly, free attribute or front and back position attribute;
2. secondly, the scope of observing/acting on is limited; Namely define the scope of the window that can use;
3. in window phase (being the data storage periods), the wall scroll data at most only allow to occur once, ignore more than repeating once.
Clearer clearer for the method elaboration that the embodiment of the present invention is provided, now be described with specific embodiment.
This example arranges the Bloomfilter storage container that two capacity are C, is labeled as container A and container B.The data total amount of having stored in whole space is labeled as n, and the data volume of storing in container A is labeled as na, and the data volume of storing in container B is labeled as nb.
In this example, two containers provide the stores service of data simultaneously.Starting stage, the data total amount n=0 that has stored in whole space, the data volume na=0 in container A, the data volume nb=0 in container B, and the container that container B is senior people is set.
Now be described according to the number change of the data total amount n that has stored in the whole space process that empties to two containers, the quantity of n is according to the storage data variation in storing process.
When n<C/2, the data volume n that stores in whole storage space<C/2, the data volume na in container A<C/2, the data volume nb in container B<C/2, the data volume that store in container A or B this moment does not all reach preset critical, proceeds the data storage.
When n=C/2, the data volume of storing in container A and B all reaches preset critical C/2, is the container of senior people due to the initial setting up container B, so data in empty container B.After container B is cleared, the data volume n=C/2 that stores in whole storage space, the data volume na=C/2 in container A, the data volume nb=0 in container B.
Proceed the data storage, when C/2<=n<C, the data quantity C of the storage in whole storage space/2<=n<C, data quantity C in container A/2<=na<C, data volume nb in container B<C/2, the data volume that store in container A or B this moment does not all reach preset critical, proceeds the data storage.
When n=C, the data volume na=C in container A, the data volume nb=C/2 in container B, the data volume that store in container B this moment reaches preset critical C/2, data in empty container A.After container A is cleared, the data volume n=C/2 that stores in whole storage space, the data volume na=0 in container A, the data volume nb=C/2 in container B proceeds the data storage.
When C/2<=n<C, the data quantity C of the storage in whole storage space/2<=n<C, the data volume na in container A<C/2, the data quantity C in container B/2<=nb<C, the data volume that store in container A or B this moment does not all reach preset critical, proceeds the data storage.
When n=C, the data volume na=C/2 in container A, the data volume nb=C in container B, the data volume that store in container A this moment reaches preset critical C/2, data in empty container B.After container B is cleared, the data volume n=C/2 that stores in whole storage space, the data volume na=C/2 in container A, the data volume nb=0 in container B.
Proceed data storages, repeat the above-mentioned operation that empties according to the variation of n value.
For emptying operating process, running time point and corresponding principle to set forth ground clearer clearer above-mentioned, existing form with form is described, specifically referring to table one.Table one
? n A B
1 n<C/2 na<C/2 nb<C/2
2 n=C/2 na=C/2 nb=0
3 C/2<=n<C C/2<=na<C nb<C/2
4 n=C na=0 nb=C/2
5 C/2<=n<C na<C/2 C/2<=nb<C
6 n=C na=C/2 nb=0
Can be found out by table one, container A and container B when arriving preset critical first, empty container B, subsequently, when container A or container B arrived preset critical, correspondence emptied the another one container.
Based on same inventive concept, the embodiment of the present invention provides a kind of device that is suitable in url data crawl the data buffer memory.Fig. 2 shows the first structural representation that is suitable for according to an embodiment of the invention in url data crawl the device of data buffer memory.Referring to Fig. 2, this device comprises at least:
Data grabber 210 is configured to periodically grasp url data;
The one Bloomfilter storage container 220 with 210 couplings of data grabber, is configured to the url data of orderly data cached grabber 210 each crawls;
The 2nd Bloomfilter storage container 230, identical with Bloomfilter storage container 220 capacity, also with 210 couplings of data grabber, be configured to synchronize with a Bloomfilter storage container 220 the each url data that grasps of orderly data cached grabber 210;
Watch-dog 240, respectively with a Bloomfilter storage container 220 and the 2nd Bloomfilter storage container 230 couplings, be configured in the storing process of url data the url data memory space of monitoring the one Bloomfilter storage container 220 and the 2nd Bloomfilter storage container 230;
Data clearing device 250 is configured to the data storage condition according to the watch-dog monitoring, empties in turn the 2nd Bloomfilter storage container 230 and a Bloomfilter storage container 220.
In embodiments of the present invention, above-mentioned each device all can utilize practical devices to realize.Prior art has various storer (such as RAM, ROM, EPROM, flash memory etc.), watch-dog (for example heartbeat equipment), data clearing device (for example data erasure apparatus), data grabber etc.The present invention is to provide and to being suitable in url data crawl, the each several part the Nomenclature Composition and Structure of Complexes of the device of data buffer memory is protected.
In a preferred embodiment, data clearing device 250 can also be configured to:
When watch-dog 240 monitors memory data output that a Bloomfilter storage container 220 stores and arrives preset critical first, empty the 2nd Bloomfilter storage container 230.
In a preferred embodiment, data clearing device 250 can also be configured to:
After the 2nd Bloomfilter storage container 230 is cleared first,
When the memory data output that monitors the 2nd Bloomfilter storage container 230 when watch-dog 240 reaches preset critical again, empty a Bloomfilter storage container 220; And
When the memory data output that monitors a Bloomfilter storage container 220 when watch-dog 240 reaches preset critical again, empty the 2nd Bloomfilter storage container 230.
In a preferred embodiment, preset critical be a Bloomfilter storage container and the 2nd Bloomfilter storage container memory capacity 1/2.
Fig. 3 shows the second structural representation that is suitable for according to an embodiment of the invention in url data crawl the device of data buffer memory.In a preferred embodiment, referring to Fig. 3, be suitable in url data crawl device to the data buffer memory except comprising each device shown in Figure 2, can also comprise:
Volume regulator 260, respectively with data grabber 210, a Bloomfilter storage container 220 and the 2nd Bloomfilter storage container 230 couplings, be configured to change according to the cycle of data grabber 210 crawl url datas, the memory capacity of a Bloomfilter storage container 220 and the 2nd Bloomfilter storage container 230 is regulated.
The method and the device that adopt the embodiment of the present invention to provide can reach following beneficial effect:
In embodiments of the present invention, url data is periodically crawl, so url data is the data stream type sustainable existence, and therefore, the total amount of url data is also that streaming increases.The url data of crawl at every turn all is cached in a Bloomfilter storage container and the 2nd Bloomfilter storage container in order, and the data in two storage containers are synchronous, and two storage containers are redundancy each other.In storing process, the url data memory space of monitoring the one Bloomfilter storage container and the 2nd Bloomfilter storage container empties the 2nd Bloomfilter storage container and a Bloomfilter storage container in turn according to monitored results.As the above analysis, in embodiments of the present invention, provide a Bloomfilter storage container and the 2nd Bloomfilter storage container to carry out the url data storage, and be not only a Bloomfilter storage container.Accordingly, at the data deletion, in the embodiment of the present invention, the 2nd Bloomfilter storage container and a Bloomfilter storage container empty in turn, that is to say, only empty at every turn and can remove a part of url data, keep a part of url data, the time sequencing attribute is converted into the spatial order attribute, manner of cleaning up is simple.And the embodiment of the present invention can't improve the stability of data with the disposable removing of all data, avoids traffic fluctuations, has effectively reduced the fluctuation range of system, can reduce the impact to other modules of system.
Further, with respect to the mode of utilizing time window cleaning data of mentioning in prior art, the embodiment of the present invention need not service time window, do not need the time attribute of record data, save storage overhead.
In the instructions that provides herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be in the situation that do not have these details to put into practice.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the description to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes in the above.Yet the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires the more feature of feature clearly put down in writing than institute in each claim.Or rather, as following claims reflected, inventive aspect was to be less than all features of the disclosed single embodiment in front.Therefore, follow claims of embodiment and incorporate clearly thus this embodiment into, wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can adaptively change and they are arranged in one or more equipment different from this embodiment the module in the equipment in embodiment.Can be combined into a module or unit or assembly to the module in embodiment or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment make up.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed), disclosed each feature can be by providing identical, being equal to or the alternative features of similar purpose replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment mean be in scope of the present invention within and form different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, perhaps realizes with the software module of moving on one or more processor, and perhaps the combination with them realizes.It will be understood by those of skill in the art that and to use in practice microprocessor or digital signal processor (DSP) to realize being suitable in url data crawl some or all some or repertoire of parts in the device of data buffer memory according to the embodiment of the present invention.The present invention can also be embodied as be used to part or all equipment or the device program (for example, computer program and computer program) of carrying out method as described herein.The program of the present invention that realizes like this can be stored on computer-readable medium, perhaps can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, and perhaps provides on carrier signal, perhaps provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not break away from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or step in claim.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title with these word explanations.

Claims (10)

1. one kind is suitable in url data crawl the method for data buffer memory is comprised:
Periodically grasp url data;
The url data of crawl at every turn all is cached in the first Bloom filter Bloomfilter storage container and the 2nd Bloomfilter storage container in order, wherein, a Bloomfilter storage container is identical with the memory capacity of the 2nd Bloomfilter storage container;
In the storing process of url data, monitor the url data memory space of a described Bloomfilter storage container and the 2nd Bloomfilter storage container;
Data storage condition according to monitoring empties described the 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn.
2. method according to claim 1, wherein, described data storage condition according to monitoring empties described the 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn, comprising:
When the memory data output of a described Bloomfilter storage container storage arrives preset critical first, empty described the 2nd Bloomfilter storage container.
3. method according to claim 2, wherein, described data storage condition according to monitoring empties described the 2nd Bloomfilter storage container and a described Bloomfilter storage container in turn, also comprises:
After described the 2nd Bloomfilter storage container is cleared first,
When the memory data output of described the 2nd Bloomfilter storage container reaches preset critical again, empty a described Bloomfilter storage container; And
When the memory data output of a described Bloomfilter storage container reaches preset critical again, empty described the 2nd Bloomfilter storage container.
4. according to claim 2 or 3 described methods, wherein, described preset critical is 1/2 of memory capacity.
5. the described method of according to claim 1 to 4 any one, wherein, the memory capacity of a described Bloomfilter storage container and the 2nd Bloomfilter storage container changed according to cycle of crawl url data regulates.
6. one kind is suitable in url data crawl the device of data buffer memory is comprised:
The data grabber is configured to periodically grasp url data;
The first Bloom filter Bloomfilter storage container is configured to the each url data that grasps of the described data grabber of orderly buffer memory;
The 2nd Bloomfilter storage container, identical with a described Bloomfilter storage container capacity, be configured to synchronize with a described Bloomfilter storage container url data of the each crawl of the described data grabber of buffer memory in order;
Watch-dog is configured in the storing process of url data, monitors the url data memory space of a described Bloomfilter storage container and the 2nd Bloomfilter storage container;
The data clearing device is configured to the data storage condition according to described watch-dog monitoring, empties in turn described the 2nd Bloomfilter storage container and a described Bloomfilter storage container.
7. device according to claim 6, wherein, described data clearing device also is configured to:
When described watch-dog monitors memory data output that a described Bloomfilter storage container stores and arrives preset critical first, empty described the 2nd Bloomfilter storage container.
8. device according to claim 7, wherein, described data clearing device also is configured to:
After described the 2nd Bloomfilter storage container is cleared first,
When the memory data output that monitors described the 2nd Bloomfilter storage container when described watch-dog reaches preset critical again, empty a described Bloomfilter storage container; And
When the memory data output that monitors a described Bloomfilter storage container when described watch-dog reaches preset critical again, empty described the 2nd Bloomfilter storage container.
9. according to claim 7 or 8 described devices, wherein, described preset critical is 1/2 of memory capacity.
10. the described device of according to claim 6 to 9 any one wherein, also comprises:
Volume regulator is configured to change according to the cycle of described data grabber crawl url data, and the memory capacity of a described Bloomfilter storage container and the 2nd Bloomfilter storage container is regulated.
CN201310293574.8A 2013-07-12 2013-07-12 Be suitable in url data crawl the method for data buffer storage and device Expired - Fee Related CN103383665B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201310293574.8A CN103383665B (en) 2013-07-12 2013-07-12 Be suitable in url data crawl the method for data buffer storage and device
CN201610237936.5A CN105930405B (en) 2013-07-12 2013-07-12 Suitable in url data crawl to the method and device of data buffer storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310293574.8A CN103383665B (en) 2013-07-12 2013-07-12 Be suitable in url data crawl the method for data buffer storage and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201610237936.5A Division CN105930405B (en) 2013-07-12 2013-07-12 Suitable in url data crawl to the method and device of data buffer storage

Publications (2)

Publication Number Publication Date
CN103383665A true CN103383665A (en) 2013-11-06
CN103383665B CN103383665B (en) 2016-04-27

Family

ID=49491462

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201310293574.8A Expired - Fee Related CN103383665B (en) 2013-07-12 2013-07-12 Be suitable in url data crawl the method for data buffer storage and device
CN201610237936.5A Active CN105930405B (en) 2013-07-12 2013-07-12 Suitable in url data crawl to the method and device of data buffer storage

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201610237936.5A Active CN105930405B (en) 2013-07-12 2013-07-12 Suitable in url data crawl to the method and device of data buffer storage

Country Status (1)

Country Link
CN (2) CN103383665B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933054A (en) * 2014-03-18 2015-09-23 上海帝联信息科技股份有限公司 Uniform resource locator (URL) storage method and device of cache resource file, and cache server
CN106487759A (en) * 2015-08-28 2017-03-08 北京奇虎科技有限公司 The method and apparatus that URL effectiveness and safety are promoted in a kind of detection
US20180316697A1 (en) * 2015-10-19 2018-11-01 Orange Method of aiding the detection of infection of a terminal by malware
CN111159436A (en) * 2018-11-07 2020-05-15 腾讯科技(深圳)有限公司 Method and device for recommending multimedia content and computing equipment
CN111931028A (en) * 2020-08-18 2020-11-13 北京微步在线科技有限公司 Monitoring system and monitoring method based on k8s

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9634992B1 (en) 2015-02-28 2017-04-25 Palo Alto Networks, Inc. Probabilistic duplicate detection
CN110020272B (en) * 2017-08-14 2021-11-05 中国电信股份有限公司 Caching method and device and computer storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253991A (en) * 2011-05-25 2011-11-23 北京星网锐捷网络技术有限公司 Uniform resource locator (URL) storage method, web filtering method, device and system
US8250080B1 (en) * 2008-01-11 2012-08-21 Google Inc. Filtering in search engines
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539932A (en) * 2009-01-21 2009-09-23 北京跳网无限科技发展有限公司 Synchronization access technology of transforming web page
CN101742263A (en) * 2009-12-08 2010-06-16 北京互信互通信息技术股份有限公司 Method for storing surveillance video data
CN102214172B (en) * 2010-04-06 2013-05-08 腾讯科技(深圳)有限公司 Caching method and caching equipment
CN102137086B (en) * 2010-09-10 2013-09-11 华为技术有限公司 Method, device and system for processing data transmission
CN102164160B (en) * 2010-12-31 2015-06-17 青岛海信传媒网络技术有限公司 Method, device and system for supporting large quantity of concurrent downloading

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8250080B1 (en) * 2008-01-11 2012-08-21 Google Inc. Filtering in search engines
CN102253991A (en) * 2011-05-25 2011-11-23 北京星网锐捷网络技术有限公司 Uniform resource locator (URL) storage method, web filtering method, device and system
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933054A (en) * 2014-03-18 2015-09-23 上海帝联信息科技股份有限公司 Uniform resource locator (URL) storage method and device of cache resource file, and cache server
CN104933054B (en) * 2014-03-18 2018-07-06 上海帝联信息科技股份有限公司 The URL storage methods and device of cache resource file, cache server
CN106487759A (en) * 2015-08-28 2017-03-08 北京奇虎科技有限公司 The method and apparatus that URL effectiveness and safety are promoted in a kind of detection
US20180316697A1 (en) * 2015-10-19 2018-11-01 Orange Method of aiding the detection of infection of a terminal by malware
US10757118B2 (en) * 2015-10-19 2020-08-25 Orange Method of aiding the detection of infection of a terminal by malware
CN111159436A (en) * 2018-11-07 2020-05-15 腾讯科技(深圳)有限公司 Method and device for recommending multimedia content and computing equipment
CN111159436B (en) * 2018-11-07 2023-12-12 腾讯科技(深圳)有限公司 Method, device and computing equipment for recommending multimedia content
CN111931028A (en) * 2020-08-18 2020-11-13 北京微步在线科技有限公司 Monitoring system and monitoring method based on k8s

Also Published As

Publication number Publication date
CN105930405B (en) 2019-09-24
CN105930405A (en) 2016-09-07
CN103383665B (en) 2016-04-27

Similar Documents

Publication Publication Date Title
CN103383665A (en) Method and device suitable for caching data during URL data capture
CN103092999B (en) A kind of webpage capture period modulation method and apparatus
CN102054028B (en) Method for implementing web-rendering function by using web crawler system
CN102567407B (en) Method and system for collecting forum reply increment
CN104077402B (en) Data processing method and data handling system
CN103942210A (en) Processing method, device and system of mass log information
CN103077254B (en) Webpage acquisition methods and device
CN104090976A (en) Method and device for crawling webpages by search engine crawlers
CN102930059A (en) Method for designing focused crawler
CN103678408A (en) Method and device for inquiring data
CN105956068A (en) Webpage URL repetition elimination method based on distributed database
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN104408169A (en) Multi-dimensional expression language based dimension query method and device
CN103984757A (en) Method and system for inserting news information articles in search result page
CN106055697A (en) Unstructured event log data classification and storage method and device
CN102404411A (en) Data synchronization method of cloud storage system
CN103914486B (en) Document search and display system
CN101777075B (en) Method for searching parallel audio fingerprint
CN104967698A (en) Network data crawling method and apparatus
CN105426407A (en) Web data acquisition method based on content analysis
CN103034655A (en) Collection method and system of user behavior information and related equipment
CN104794158A (en) Domain name data repeated detection and fast index method in boundscript window
CN103336671A (en) Method and equipment for acquiring data from network
CN104219271B (en) Based on the asynchronous multiserver synchronous method for downloading the page of multithreading
CN106055691A (en) Storage processing method and storage processing system for distributed data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160427

Termination date: 20210712

CF01 Termination of patent right due to non-payment of annual fee