CN102253991A - Uniform resource locator (URL) storage method, web filtering method, device and system - Google Patents
Uniform resource locator (URL) storage method, web filtering method, device and system Download PDFInfo
- Publication number
- CN102253991A CN102253991A CN2011101879629A CN201110187962A CN102253991A CN 102253991 A CN102253991 A CN 102253991A CN 2011101879629 A CN2011101879629 A CN 2011101879629A CN 201110187962 A CN201110187962 A CN 201110187962A CN 102253991 A CN102253991 A CN 102253991A
- Authority
- CN
- China
- Prior art keywords
- url
- bloom filter
- deletion
- gateway equipment
- memory device
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a uniform resource locator (URL) storage method, a web filtering method, a web filtering device and a web filtering system. The URL storage method comprises the following steps of: S11, classifying URL according to a predetermined classification rule; S12, respectively generating bloom filters for storing various types of URLs; and S13, storing the URL in the corresponding bloom filter according to the types of the URLs. By the URL storage method and the web filtering method, the web filtering device and the web filtering system, efficient URL query can be provided while a web is filtered, so that the network performance is improved.
Description
Technical field
The present invention relates to communication technical field, relate in particular to URL storage means, Webpage filtering method, Apparatus and system.
Background technology
Along with the development of network technology and resource, network institute's loaded service and function be variation further, and enterprises is in order to improve security and employee's access to netwoks to be limited, usually at enterprise network export deployment URL(uniform resource locator) (URL) filtering gateway.Wherein, URL is used for a kind of identification method that webpage and other resource addresses are gone up in complete description the Internet (Internet), each webpage on the Internet all has the URL address of a unique correspondence, this URL address can be that a certain computing machine on local disk, the LAN (Local Area Network) also can be the website on the Internet, and the URL address is alleged network address usually.This url filtering gateway can be audited to the URL that the user visited, and judges whether it is legal, and stops this visit when being illegal when judged result.
The url filtering gateway is judged the URL of user capture, and whether legal need are based on a URL storehouse, be after the url filtering gateway obtains the URL of user capture, url field in the inquiry URL storehouse is with match hit and the corresponding record of this URL, and further inquire about the pairing sorting field of this url field, knowing that this URL is pairing is categorized as Lawful access or unauthorized access, thereby carries out respective handling.Because the url filtering gateway is the outlet gateway, so work as number of users for a long time, a large amount of URL inquiries will greatly influence network performance.
Summary of the invention
The invention provides a kind of URL storage means, Webpage filtering method, Apparatus and system, high efficiency URL inquiry can be provided when carrying out home page filter, thereby improve network performance.
The invention provides a kind of URL storage means, comprising:
Step S11 classifies to URL according to the predtermined category rule;
Step S12 generates the Bloom filter that is used to store all types of URL respectively;
Step S13 according to the type of each URL, is stored in described URL in the corresponding described Bloom filter.
According to a further aspect in the invention, also provide a kind of URL memory device, comprising:
Sort module is used for according to the predtermined category rule URL being classified;
Generation module is used for generating respectively the Bloom filter that is used to store all types of URL;
Memory module is used for the type according to each URL, and described URL is stored in the corresponding described Bloom filter.
According to another aspect of the invention, also provide a kind of Webpage filtering method, comprising:
Step S21, filtering gateway equipment obtains the Bloom filter that classification and storage has URL from URL memory device provided by the invention;
Step S22, described filtering gateway equipment filters webpage according to the type of URL that stores in the described Bloom filter and described URL.
According to another aspect of the invention, also provide a kind of filtering gateway equipment, it is characterized in that, comprising:
Acquisition module is used for obtaining the Bloom filter that classification and storage has URL from the URL memory device;
Filtering module is used for the URL that stores according to described Bloom filter and the type of described URL webpage is filtered.
In accordance with a further aspect of the present invention, also provide a kind of webpage filter system, comprise URL memory device provided by the invention and filtering gateway equipment provided by the invention.
According to URL storage means of the present invention, Webpage filtering method, Apparatus and system, by creating Bloom filter corresponding to all kinds of URL, and URL is stored in by type in the Bloom filter of correspondence, so after screen pack pass equipment obtains the Bloom filter that stores URL, when the URL of user capture is audited, need not all to travel through the huge URL storehouse of amount of memory at each URL to be audited, greatly improved search efficiency, even therefore have a large number of users to initiate access to netwoks simultaneously, also can guarantee the performance of network.And, also saved a large amount of storage spaces.In addition, employing can also realize highly confidential to URL under the situation that need not additionally to encrypt.
Description of drawings
Fig. 1 utilizes URL to carry out the system architecture diagram of home page filter.
Fig. 2 is the schematic flow sheet of URL storage means of the present invention.
Fig. 3 is the structural representation of URL memory device of the present invention.
Fig. 4 is the schematic flow sheet of Webpage filtering method of the present invention.
Fig. 5 is the structural representation of filtering gateway equipment of the present invention.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer,, technical scheme of the present invention is clearly and completely described below in conjunction with accompanying drawing.
Fig. 1 utilizes URL to carry out the system architecture diagram of home page filter.As shown in Figure 1, comprise the URL server that is used to generate the URL storehouse and obtain the URL storehouse to carry out the filtering gateway equipment of home page filter from the URL server by Internet.The URL storage means of following embodiment is the performed operation of URL server.
Fig. 2 is the schematic flow sheet of URL storage means of the present invention.As shown in Figure 2, this URL storage means may further comprise the steps:
Step S11 classifies to URL according to the predtermined category rule;
The URL type of being divided particularly, for example comprises Lawful access, unauthorized access etc.In this step, after the setting type, whole type name is stored in the management document, and the whole URL in the URL storehouse is divided according to the type of setting according to user's request.
Step S12 generates the Bloom filter that is used to store all types of URL respectively;
Above-mentioned steps can specifically comprise:
Step S121, the quantity of adding up all types of URL;
Step S122, according to quantity and the shared byte number of each URL of described URL, the application memory headroom;
Step S123 determines the hash function of described Bloom filter according to the highest false percent of pass of the quantity of described URL and setting.
Wherein, Bloom filter is proposed by Ba Dunbulong in one nine seven zero years, and its principle is as follows: a Bloom filter is by k separate hash function h1, h2 ..., hk and the bit vector composition that length is m, wherein, the codomain of each hash function be 0,1 ..., m-1}, again because a byte has 8 bits, so the manual memory headroom of bit vector is m/8 byte, and all of bit vector all are initialized as 0.S set={ s1, s2, ..., sn} calculates a hash sequence (h1 (s) with each element among k the hash function pair set S, h2 (s), ..., hk (s)), then hash sequence bit corresponding in the bit vector is made as 1, the data element set S that then claimed this Bloom filter device, this Bloom filter has been represented data element set S in other words.If for example if h1 (s1)=5, then the 6th with bit vector is made as 1, if h2 (s1)=10, then the 11st with bit vector is made as 1, up to hk (s1)=n-1, the n position of bit vector is made as 1.When whether certain data element of inquiry is in S set, to hash sequence of data element calculating, if each on the pairing bit vector of hash sequence is 1, then thinks this data element S, otherwise do not belong to S with a same k hash function.
The memory headroom of being applied in above-mentioned steps S122 is the space that is used for Bloom filter.For example obtain a type name, utilize this type name in the URL storehouse, to add up the record number of URL under the type by management document, to calculate the size of Bloom filter according to following formula 1:
T=2
nFormula 1
Wherein, T is the size of Bloom filter, and n should satisfy the natural number of following formula 2:
2
N-1<count (url) * B<2
nFormula 2
Wherein, count (url) is the record number of URL under the type, and B is the shared byte number of each URL.
As long as the hash function of determined Bloom filter can be various ways in step S123, it can satisfy the highest false percent of pass that its false percent of pass is no more than setting when whole URL of storage the type.Wherein, false percent of pass is a Bloom filter when carrying out the element inquiry, thinks by mistake and belongs to the probability that this vacation is passed through in the set not belonging to element in the set.Pack in the long Bloom filter in m position of having used k hash function behind n the element, a certain position still is that 0 probability is in the bit vector: (1-1/m)
Kn, then false percent of pass p is p=[1-(1-1/m)
Kn]
kTherefore, after setting the highest acceptable false percent of pass, then can determine the number k and the bit vector length m of hash function in conjunction with the record number of URL according to user's request.For example, when the record number of URL is 100w, the highest vacation that the user sets is by 0.0001, then the number k of hash function and bit vector length m respectively can be 8 and 2000w, promptly this moment determined hash function can for satisfy number be 8 and bit vector length be one group of mathematical function of the random structure of 2000w.
Step S13 according to the type of each URL, is stored in described URL in the corresponding described Bloom filter.
Above-mentioned steps can specifically comprise:
Step S131 calculates the hashed value of described URL according to hash function;
Step S132 according to predetermined flag set-up mode, is provided with flag with described hashed value corresponding position in the memory headroom of being applied for.
Wherein, this flag set-up mode can be any identification means of user's setting.The flag set-up mode that for example sets is setbit (a, i): ((a) [(i)/NBBY] |=1<<((i) %NBBY)), wherein array a is the first address of the memory headroom applied for of Bloom filter, then a[(i)/NBBY] be the value of the individual position of [(i)/NBBY] of Bloom filter; I is the hashed value of the URL that calculated, and NBBY is the bit that byte is shared, i.e. a NBBY=8; ((a) [(i)/NBBY] |=1<<((i) %NBBY)) ((i) %NBBY) in the individual position of the [(i)/NBBY] of expression Bloom filter) individual bit is set to 1.
Whole URL to same type all carry out above-mentioned steps S131 and step S132, in the Bloom filter that is stored to the type.And, carry out above-mentioned steps S12 and step S13 respectively for the URL of all types.So far, realized the form storage with the classification Bloom filter with the URL in the URL storehouse.
URL storage means according to the foregoing description, by creating Bloom filter corresponding to all kinds of URL, and URL is stored in by type in the Bloom filter of correspondence, so after screen pack pass equipment obtains the Bloom filter that stores URL, when the URL of user capture is audited, if in the time of will inquiring about this URL and whether be positioned at the Bloom filter of certain type, but for example inquire about the Bloom filter whether this URL is arranged in the URL that is used to store Lawful access, only need to calculate the hashed value of this URL according to the hash function of the type Bloom filter, if the position of the pairing bit vector of this hashed value is 1, can know that then this URL is arranged in this Bloom filter, promptly this URL is the Lawful access network address.Since need not all to travel through the huge URL storehouse of amount of memory at each URL to be audited, thus search efficiency greatly improved, so, also can guarantee the performance of network even there is a large number of users to initiate access to netwoks simultaneously.And, store URL by adopting Bloom filter, a large amount of storage spaces have also been saved, for example for the URL storehouse (supposing that each URL takies 20 bytes) of storing 480w URL record, under the situation that all Bloom filters are all used fully, store the required space of these URL and only be 91.5M (480w*20B/1024/1024).In addition, when adopting the URL storage means storage URL of the foregoing description,, also it can't be reduced to each URL clauses and subclauses, thereby under the situation that need not additionally to encrypt, can realize highly confidential to this URL storehouse even derived the URL storehouse by other people.
Further, in the URL of the foregoing description storage means, also comprise:
Step S14, when in described Bloom filter, increasing URL, generate delta package, the type that comprises the URL of the hashed value of the URL that is increased and described increase in the described delta package, described delta package is used to be sent to filtering gateway equipment, to carry out incremental update by filtering gateway equipment according to described delta package.
Particularly, when need increase URL in Bloom filter, the false percent of pass of this Bloom filter is improved.So the URL inspection of quantity that needs basis to increase increases these URL in former Bloom filter after, the false percent of pass that whether still can satisfy this Bloom filter is no more than the highest false percent of pass of setting.If can satisfy, then only generate delta package, more specifically, calculate the hashed value of the URL that will increase according to the hash function of this Bloom filter, with the hashed value of this URL together with the type corresponding stored of this URL in delta package.If can not satisfy, then regenerate the Bloom filter of the type, comprise whole URL and this URL that increases newly in the former Bloom filter in this newly-generated Bloom filter.The relation of the memory headroom T that memory headroom T ' that this newly-generated Bloom filter is shared and former Bloom filter are shared should satisfy T '=T*2.
Table 1 illustration the storage format in the delta package.As shown in table 1, when generating delta package in the manner described above, URL of every increase needs 4*N+1 bytes of memory space.Wherein, N represents the hash function number, the byte number (each field of Hash1-Hash8 is 4 bytes) that each hashed value that 4 expression storage computation go out will take, and the classification id of 1 expression URL correspondence takies 1 byte (Class_id is a byte).In addition, when increasing a plurality of URL, can also merge processing to generic URL, with conserve storage.For example, when inserting new hash value in database, whether inquiry earlier has identical value under this classification, if having, then this hash value is made as sky.Hash value only takies a byte when being empty, so when adopting this mode to store, if there have the URL of identical category hash value to occur to be identical, then can save 3 bytes of memory spaces for each identical hash value.And, if the pairing section H ash value of URL of a plurality of increases is overlapping, can this overlapping hash value of not duplicate record, promptly needn't fix and store, with conserve storage corresponding to the mode of 8 hash values according to a URL record.
Table 1
Field | ?Hash1 | Hash2 | Hash3 | Hash4 | Hash5 | Hash6 | Hash7 | Hash8 | Class_id |
Hashed value | ?a | b | c | d | e | f | g | h | l |
When the URL server is sent to filtering gateway equipment with the delta package that generates, filtering gateway is according to hashed value that writes down in the delta package and type, according to the identical mode of URL server execution in step S13, correspondence position in the Bloom filter of respective type is provided with flag, the URL that delta package write down can be added in the Bloom filter, thereby realize the incremental update of filtering gateway equipment.
URL storage means according to the foregoing description, can make when carrying out long-range incremental update, the URL server only needs to comprise the delta package of a little information to the transmission of filtering gateway equipment, and need not resend whole Bloom filter, can realize remote update easily.
Further, in the URL of the foregoing description storage means, also comprise:
Step S15, when from described Bloom filter, deleting URL, new Bloom filter behind the generation deletion URL and deletion bag, comprise in the described deletion bag that the type of the URL of the position of different value and deletion appears in the described Bloom filter carried out before deletion is upgraded and described new Bloom filter in same position, described deletion bag is used to be sent to filtering gateway equipment, upgrades to carry out deletion by filtering gateway equipment according to described deletion bag.
Particularly, the URL server for example adopts that the identical mode of step S12 generates new Bloom filter in the URL storage means with the foregoing description, and the employing mode identical with step S13 will be deleted the remaining URL in back and be stored in the new Bloom filter that is generated.More former Bloom filter and new Bloom filter are if the ident value difference of same position in two filtrators is then noted the ident value of this position in this position and the new Bloom filter in another file.Behind relatively intacter all positions, preserve this document, so just generated the deletion bag.
Table 2 illustration the storage format in the deletion bag.As shown in table 2, when generating the deletion bag in the manner described above, URL of every deletion needs N* (4+1) bytes of memory space, and wherein, N represents the hash function number, the byte number that each hashed value that 4 expression storage computation go out will take.The classification id of 1 expression URL correspondence takies 1 byte.And, also can adopt the storage format that is similar to table 1 in the deletion bag, so that needed byte space of URL of every deletion is 4*N+1.In addition, can also adopt the illustrative storage format of table 3 in the deletion bag, promptly in deletion bag, not only write down the type that the URL of the position of different value and described deletion appears in same position, also be documented in the value of this position in the newly-generated Bloom filter, needed byte space of URL of every deletion this moment is N* (4+1+1).Needed storage space of URL of above-mentioned illustrative every deletion is a theoretical value, and in actual applications, because certain position of being revised may be shared by many URL, this will make actual required space less than the above-mentioned theory value.
Table 2
Sequence number | Position (4bytes) | The URL id (1byte) that classifies |
Revise 1 | a | 1 |
Revise 2 | b | 1 |
Revise 3 | c | 1 |
Revise 4 | d | 1 |
Revise 5 | e | 1 |
Revise 6 | f | 1 |
Revise 7 | g | 1 |
Revise 8 | h | 1 |
Table 3
Sequence number | Position (4bytes) | The value of this position (1byte) | The URL id (1byte) that classifies |
Revise 1 | a | ?0 | 1 |
When carrying out remote update, the URL server should be deleted bag and be sent to filtering gateway equipment.Filtering gateway equipment is according to the Bloom filter of location updating this locality of record in this deletion bag.Particularly, for example, can know pairing byte by (a/8+1), and can know pairing bit position by (a%8) for the position a in the deletion bag, change the value of the corresponding bit position of corresponding byte into 0 by 1, thereby the deletion that has realized filtering gateway equipment is upgraded.
URL storage means according to the foregoing description, can make when the long-range deletion of execution is upgraded, the URL server only needs to comprise the deletion bag of a little information to the transmission of filtering gateway equipment, and need not resend whole Bloom filter, can realize remote update easily.
Fig. 3 is the structural representation of URL memory device of the present invention.As shown in Figure 3, this URL memory device comprises:
Sort module is used for according to the predtermined category rule URL being classified;
Generation module is used for generating respectively the Bloom filter that is used to store all types of URL;
Memory module is used for the type according to each URL, and described URL is stored in the corresponding described Bloom filter.
The flow process that above-mentioned URL memory device carries out the URL storage is identical with the URL storage means of the foregoing description, so locate to repeat no more.
URL memory device according to the foregoing description, by creating Bloom filter corresponding to all kinds of URL, and URL is stored in by type in the Bloom filter of correspondence, so after screen pack pass equipment obtains the Bloom filter that stores URL, when the URL of user capture is audited, if in the time of will inquiring about this URL and whether be positioned at the Bloom filter of certain type, but for example inquire about the Bloom filter whether this URL is arranged in the URL that is used to store Lawful access, only need to calculate the hashed value of this URL according to the hash function of the type Bloom filter, if the position of the pairing bit vector of this hashed value is 1, can know that then this URL is arranged in this Bloom filter, promptly this URL is the Lawful access network address.Since need not all to travel through the huge URL storehouse of amount of memory at each URL to be audited, thus search efficiency greatly improved, so, also can guarantee the performance of network even there is a large number of users to initiate access to netwoks simultaneously.And, store URL by adopting Bloom filter, also saved a large amount of storage spaces.In addition, when adopting the URL memory device, stores URL of the foregoing description,, also it can't be reduced to each URL clauses and subclauses, thereby under the situation that need not additionally to encrypt, can realize highly confidential to URL even derived this Bloom filter by other people.
Further, in the URL of the foregoing description memory device, also comprise:
The incremental update module is used to generate delta package, comprises the hashed value of the URL that is increased and the type of described URL in the described delta package, and described delta package is used to be sent to filtering gateway equipment, upgrades to be carried out according to described delta package by filtering gateway equipment.
According to the URL memory device of the foregoing description, can make when carrying out long-range incremental update, only need to comprise the delta package of a little information, and need not resend whole Bloom filter to the transmission of filtering gateway equipment, can realize remote update easily.
Further, in the URL of the foregoing description memory device, described incremental update module also is used for the URL quantity according to described Bloom filter behind the increase URL, and whether the false percent of pass of described Bloom filter is no more than the highest predefined false percent of pass behind the described URL of check increase; If then generate described delta package; If not, then regenerate Bloom filter, the shared memory headroom of the Bloom filter that regenerates is for increasing the twice of the preceding shared memory headroom of Bloom filter of URL.
Can avoid causing the false percent of pass of Bloom filter to surpass the highest predefined false percent of pass according to the URL memory device of the foregoing description because of increasing URL.
Further, in the URL of the foregoing description memory device, also comprise:
The deletion update module, be used to generate new Bloom filter and deletion bag, comprise in the described deletion bag that the type of the URL of the position of different value and deletion appears in the described Bloom filter carried out before deletion is upgraded and described new Bloom filter in same position, described deletion bag is used to be sent to filtering gateway equipment, upgrades to be carried out according to described deletion bag by filtering gateway equipment.
According to the URL memory device of the foregoing description, can make when the long-range deletion of execution is upgraded, only need comprise the deletion bag of a little information, and need not resend whole Bloom filter to the transmission of filtering gateway equipment, can realize remote update easily.
Fig. 4 is the schematic flow sheet of Webpage filtering method of the present invention.As shown in Figure 4, this Webpage filtering method comprises:
Step S21, filtering gateway equipment obtains the Bloom filter that classification and storage has URL from the URL memory device;
Step S22, described filtering gateway equipment filters webpage according to the type of URL that stores in the described Bloom filter and described URL.
Wherein, this URL memory device URL memory device that is above-mentioned arbitrary embodiment.Particularly, after filtering gateway equipment obtains the Bloom filter that stores URL, when the URL of needs audits, when needing this URL of inquiry whether to be positioned at the Bloom filter of particular type, but for example inquire about the Bloom filter whether this URL is arranged in the URL that is used to store Lawful access, only need to calculate the hashed value of this URL according to the hash function of the type Bloom filter, if the position of the pairing bit vector of this hashed value is 1, can know that then this URL is arranged in this Bloom filter, promptly this URL is the Lawful access network address, allows this access to netwoks this moment; Otherwise, stop this access to netwoks.
Webpage filtering method according to the foregoing description, because when carrying out the URL audit, need not all to travel through the huge URL storehouse of amount of memory at each URL to be audited, so greatly improved search efficiency, even, also can guarantee the performance of network so there is a large number of users to initiate access to netwoks simultaneously.
Further, in the Webpage filtering method of the foregoing description, also comprise:
Step S23, described filtering gateway equipment obtains delta package from described URL memory device, and carries out incremental update according to described delta package, comprises the type of the URL of the hashed value of URL of increase and described increase in the described delta package.
Particularly, filtering gateway is according to hashed value that writes down in the delta package and type, according to the identical mode of step S13 in the URL storage means of the foregoing description, correspondence position in the Bloom filter of respective type is provided with flag, the URL that delta package write down can be added in the Bloom filter, thereby realize the incremental update of filtering gateway equipment.
According to the Webpage filtering method of the foregoing description, can make that when carrying out long-range incremental update filtering gateway equipment only needs to obtain the delta package that comprises a little information from the URL memory device, and need not obtain whole Bloom filter, can realize remote update easily.
Further, in the Webpage filtering method of the foregoing description, also comprise:
Step S24, described filtering gateway equipment obtains the deletion bag from described URL memory device, carry out deletion according to described deletion bag and upgrade, comprise in the described deletion bag that the type of the URL of the position of different value and described deletion appears in the new Bloom filter behind the described Bloom filter of URL memory device before deletion URL and the deletion URL in same position.
Particularly, filtering gateway equipment is revised local Bloom filter according to the position of record in the deletion bag with corresponding to the ident value of this position, thereby the deletion that has realized filtering gateway equipment is upgraded.
According to the Webpage filtering method of the foregoing description, can make that when the long-range deletion of execution was upgraded, filtering gateway equipment only needed obtain the deletion bag that comprises a little information from the URL memory device, and need not obtain whole Bloom filter, can realize remote update easily.
Fig. 5 is the structural representation of filtering gateway equipment of the present invention.As shown in Figure 5, this filtering gateway equipment comprises:
Acquisition module is used for obtaining the Bloom filter that classification and storage has URL from the URL memory device;
Filtering module is used for the URL that stores according to described Bloom filter and the type of described URL webpage is filtered.
The flow process that the filtering gateway equipment of the foregoing description is carried out home page filter is identical with the Webpage filtering method of the foregoing description, so locate to repeat no more.
Filtering gateway equipment according to the foregoing description, because when carrying out the URL audit, when needing this URL of inquiry whether to be positioned at the Bloom filter of particular type, only need to calculate the hashed value of this URL according to the hash function of the type Bloom filter, if the position of the pairing bit vector of this hashed value is 1, can know that then this URL is arranged in this Bloom filter, need not all to travel through the huge URL storehouse of amount of memory at each URL to be audited, so greatly improved search efficiency, even, also can guarantee the performance of network so there is a large number of users to initiate access to netwoks simultaneously.
Further, in the filtering gateway equipment of the foregoing description, also comprise:
The incremental update module is used for obtaining delta package from described URL memory device, and carries out incremental update according to described delta package, comprises the type of the URL of the hashed value of URL of increase and described increase in the described delta package.
According to the filtering gateway equipment of the foregoing description, when carrying out long-range incremental update, only need to obtain the delta package that comprises a little information, and need not obtain whole Bloom filter from the URL memory device, can realize remote update easily.
Further, in the filtering gateway equipment of the foregoing description, also comprise:
The deletion update module, obtain the deletion bag from described URL memory device, carry out deletion according to described deletion bag and upgrade, comprise in the described deletion bag that the type of the URL of the position of different value and described deletion appears in the new Bloom filter behind the described Bloom filter of URL memory device before deletion URL and the deletion URL in same position.
According to the filtering gateway equipment of the foregoing description, when the long-range deletion of execution is upgraded, only need obtain the deletion bag that comprises a little information, and need not obtain whole Bloom filter from the URL memory device, can realize remote update easily.
The present invention also provides a kind of webpage filter system, comprises the URL memory device of above-mentioned arbitrary embodiment and the filtering gateway equipment of arbitrary embodiment.
According to the webpage filter system of the foregoing description,, also can guarantee the performance of network even when having a large number of users to initiate access to netwoks simultaneously.
It should be noted that at last: above embodiment only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to previous embodiment the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment put down in writing, and perhaps part technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the spirit and scope of various embodiments of the present invention technical scheme.
Claims (15)
1. a uniform resource position mark URL storage means is characterized in that, comprising:
Step S11 classifies to URL according to the predtermined category rule;
Step S12 generates the Bloom filter that is used to store all types of URL respectively;
Step S13 according to the type of each URL, is stored in described URL in the corresponding described Bloom filter.
2. URL storage means according to claim 1 is characterized in that, also comprises:
Step S14, when in described Bloom filter, increasing URL, generate delta package, comprise the hashed value of the URL that is increased and the type of the URL that is increased in the described delta package, described delta package is used to be sent to filtering gateway equipment, to carry out incremental update by filtering gateway equipment according to described delta package.
3. URL storage means according to claim 2 is characterized in that, also comprises before described step S14:
According to the URL quantity in the described Bloom filter behind the increase URL, whether the false percent of pass of described Bloom filter is no more than the highest predefined false percent of pass behind the described URL of check increase; If then carry out described step S14; If not, then regenerate Bloom filter, the shared memory headroom of the Bloom filter that regenerates is for increasing the twice of the preceding shared memory headroom of Bloom filter of URL.
4. URL storage means according to claim 1 and 2 is characterized in that, also comprises:
Step S15, when from described Bloom filter, deleting URL, new Bloom filter behind the generation deletion URL and deletion bag, comprise in the described deletion bag that the type of the URL of the position of different value and deletion appears in the described Bloom filter carried out before deletion is upgraded and described new Bloom filter in same position, described deletion bag is used to be sent to filtering gateway equipment, upgrades to carry out deletion by filtering gateway equipment according to described deletion bag.
5. a URL memory device is characterized in that, comprising:
Sort module is used for according to the predtermined category rule URL being classified;
Generation module is used for generating respectively the Bloom filter that is used to store all types of URL;
Memory module is used for the type according to each URL, and described URL is stored in the corresponding described Bloom filter.
6. URL memory device according to claim 5 is characterized in that, also comprises:
The incremental update module is used to generate delta package, comprises the hashed value of the URL that is increased and the type of the URL that is increased in the described delta package, and described delta package is used to be sent to filtering gateway equipment, upgrades to be carried out according to described delta package by filtering gateway equipment.
7. URL memory device according to claim 6, it is characterized in that, described incremental update module also is used for the URL quantity according to described Bloom filter behind the increase URL, and whether the false percent of pass of described Bloom filter is no more than the highest predefined false percent of pass behind the described URL of check increase; If then generate described delta package; If not, then regenerate Bloom filter, the shared memory headroom of the Bloom filter that regenerates is for increasing the twice of the preceding shared memory headroom of Bloom filter of URL.
8. according to claim 5 or 6 described URL memory devices, it is characterized in that, also comprise:
The deletion update module, be used to generate new Bloom filter and deletion bag, comprise in the described deletion bag that the type of the URL of the position of different value and deletion appears in the described Bloom filter carried out before deletion is upgraded and described new Bloom filter in same position, described deletion bag is used to be sent to filtering gateway equipment, upgrades to be carried out according to described deletion bag by filtering gateway equipment.
9. a Webpage filtering method is characterized in that, comprising:
Step S21, filtering gateway equipment is from obtaining the Bloom filter that classification and storage has URL as the arbitrary described URL memory device of claim 5 to 8;
Step S22, described filtering gateway equipment filters webpage according to the type of URL that stores in the described Bloom filter and described URL.
10. Webpage filtering method according to claim 9 is characterized in that, also comprises:
Step S23, described filtering gateway equipment obtains delta package from described URL memory device, and carries out incremental update according to described delta package, comprises the type of the URL of the hashed value of URL of increase and described increase in the described delta package.
11. according to Claim 8 or 9 described Webpage filtering methods, it is characterized in that, also comprise:
Step S24, described filtering gateway equipment obtains the deletion bag from described URL memory device, carry out deletion according to described deletion bag and upgrade, comprise in the described deletion bag that the type of the URL of the position of different value and described deletion appears in the new Bloom filter behind the described Bloom filter of URL memory device before deletion URL and the deletion URL in same position.
12. a filtering gateway equipment is characterized in that, comprising:
Acquisition module is used for obtaining the Bloom filter that classification and storage has URL from the URL memory device;
Filtering module is used for the URL that stores according to described Bloom filter and the type of described URL webpage is filtered.
13. filtering gateway equipment according to claim 12 is characterized in that, also comprises:
The incremental update module is used for obtaining delta package from described URL memory device, and carries out incremental update according to described delta package, comprises the type of the URL of the hashed value of URL of increase and described increase in the described delta package.
14. according to claim 12 or 13 described filtering gateway equipment, it is characterized in that, also comprise:
The deletion update module, obtain the deletion bag from described URL memory device, carry out deletion according to described deletion bag and upgrade, comprise in the described deletion bag that the type of the URL of the position of different value and described deletion appears in the new Bloom filter behind the described Bloom filter of URL memory device before deletion URL and the deletion URL in same position.
15. a webpage filter system is characterized in that, comprises as the arbitrary described URL memory device of claim 5 to 8 with as the arbitrary described filtering gateway equipment of claim 12 to 14.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110187962.9A CN102253991B (en) | 2011-05-25 | 2011-07-06 | Uniform resource locator (URL) storage method, web filtering method, device and system |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110137020 | 2011-05-25 | ||
CN201110137020.X | 2011-05-25 | ||
CN201110187962.9A CN102253991B (en) | 2011-05-25 | 2011-07-06 | Uniform resource locator (URL) storage method, web filtering method, device and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102253991A true CN102253991A (en) | 2011-11-23 |
CN102253991B CN102253991B (en) | 2014-07-30 |
Family
ID=44981255
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110187962.9A Active CN102253991B (en) | 2011-05-25 | 2011-07-06 | Uniform resource locator (URL) storage method, web filtering method, device and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102253991B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103383665A (en) * | 2013-07-12 | 2013-11-06 | 北京奇虎科技有限公司 | Method and device suitable for caching data during URL data capture |
CN103544316A (en) * | 2013-11-06 | 2014-01-29 | 苏州大拿信息技术有限公司 | Uniform resource locator (URL) filtering system and achieving method thereof |
CN105119916A (en) * | 2015-08-21 | 2015-12-02 | 福建天晴数码有限公司 | http-based authentication method and system |
US20150356196A1 (en) * | 2014-06-04 | 2015-12-10 | International Business Machines Corporation | Classifying uniform resource locators |
CN105320740A (en) * | 2015-09-22 | 2016-02-10 | 清华大学 | WeChat article and official account acquisition method and acquisition system |
CN105653627A (en) * | 2015-12-28 | 2016-06-08 | 湖南蚁坊软件有限公司 | Bloom filter-based data classification method |
CN106970984A (en) * | 2017-03-29 | 2017-07-21 | 杭州迪普科技股份有限公司 | A kind of url filtering storehouse update method and device |
CN107888659A (en) * | 2017-10-12 | 2018-04-06 | 北京京东尚科信息技术有限公司 | The processing method and system of user's request |
CN109977261A (en) * | 2019-04-02 | 2019-07-05 | 北京奇艺世纪科技有限公司 | A kind of processing method of request of data, device and server |
CN112948370A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Data classification method and device and computer equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11167580A (en) * | 1997-12-04 | 1999-06-22 | Nec Corp | Automatic sorting device and method for url of web client |
CN101261644A (en) * | 2008-04-30 | 2008-09-10 | 杭州华三通信技术有限公司 | Method and device for accessing united resource positioning symbol database |
JP2010123000A (en) * | 2008-11-20 | 2010-06-03 | Nippon Telegr & Teleph Corp <Ntt> | Web page group extraction method, device and program |
CN101901248A (en) * | 2010-04-07 | 2010-12-01 | 北京星网锐捷网络技术有限公司 | Method and device for creating and updating Bloom filter and searching elements |
CN101923568A (en) * | 2010-06-23 | 2010-12-22 | 北京星网锐捷网络技术有限公司 | Method for increasing and canceling elements of Bloom filter and Bloom filter |
-
2011
- 2011-07-06 CN CN201110187962.9A patent/CN102253991B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11167580A (en) * | 1997-12-04 | 1999-06-22 | Nec Corp | Automatic sorting device and method for url of web client |
CN101261644A (en) * | 2008-04-30 | 2008-09-10 | 杭州华三通信技术有限公司 | Method and device for accessing united resource positioning symbol database |
JP2010123000A (en) * | 2008-11-20 | 2010-06-03 | Nippon Telegr & Teleph Corp <Ntt> | Web page group extraction method, device and program |
CN101901248A (en) * | 2010-04-07 | 2010-12-01 | 北京星网锐捷网络技术有限公司 | Method and device for creating and updating Bloom filter and searching elements |
CN101923568A (en) * | 2010-06-23 | 2010-12-22 | 北京星网锐捷网络技术有限公司 | Method for increasing and canceling elements of Bloom filter and Bloom filter |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103383665B (en) * | 2013-07-12 | 2016-04-27 | 北京奇虎科技有限公司 | Be suitable in url data crawl the method for data buffer storage and device |
CN103383665A (en) * | 2013-07-12 | 2013-11-06 | 北京奇虎科技有限公司 | Method and device suitable for caching data during URL data capture |
CN103544316A (en) * | 2013-11-06 | 2014-01-29 | 苏州大拿信息技术有限公司 | Uniform resource locator (URL) filtering system and achieving method thereof |
CN103544316B (en) * | 2013-11-06 | 2017-02-08 | 苏州大拿信息技术有限公司 | Uniform resource locator (URL) filtering system and achieving method thereof |
US9582565B2 (en) * | 2014-06-04 | 2017-02-28 | International Business Machines Corporation | Classifying uniform resource locators |
US20170103138A1 (en) * | 2014-06-04 | 2017-04-13 | International Business Machines Corporation | Classifying uniform resource locators |
US9928292B2 (en) * | 2014-06-04 | 2018-03-27 | International Business Machines Corporation | Classifying uniform resource locators |
US20160179929A1 (en) * | 2014-06-04 | 2016-06-23 | International Business Machines Corporation | Classifying uniform resource locators |
US20150356196A1 (en) * | 2014-06-04 | 2015-12-10 | International Business Machines Corporation | Classifying uniform resource locators |
US9569522B2 (en) * | 2014-06-04 | 2017-02-14 | International Business Machines Corporation | Classifying uniform resource locators |
US9928301B2 (en) * | 2014-06-04 | 2018-03-27 | International Business Machines Corporation | Classifying uniform resource locators |
US20170337258A1 (en) * | 2014-06-04 | 2017-11-23 | International Business Machines Corporation | Classifying uniform resource locators |
US20170109429A1 (en) * | 2014-06-04 | 2017-04-20 | International Business Machines Corporation | Classifying uniform resource locators |
CN105119916B (en) * | 2015-08-21 | 2018-04-10 | 福建天晴数码有限公司 | A kind of authentication method and system based on http |
CN105119916A (en) * | 2015-08-21 | 2015-12-02 | 福建天晴数码有限公司 | http-based authentication method and system |
CN105320740A (en) * | 2015-09-22 | 2016-02-10 | 清华大学 | WeChat article and official account acquisition method and acquisition system |
CN105320740B (en) * | 2015-09-22 | 2018-10-16 | 清华大学 | The acquisition methods and acquisition system of wechat article and public platform |
CN105653627A (en) * | 2015-12-28 | 2016-06-08 | 湖南蚁坊软件有限公司 | Bloom filter-based data classification method |
CN106970984A (en) * | 2017-03-29 | 2017-07-21 | 杭州迪普科技股份有限公司 | A kind of url filtering storehouse update method and device |
CN106970984B (en) * | 2017-03-29 | 2020-11-06 | 杭州迪普科技股份有限公司 | URL filter library updating method and device |
CN107888659A (en) * | 2017-10-12 | 2018-04-06 | 北京京东尚科信息技术有限公司 | The processing method and system of user's request |
CN109977261A (en) * | 2019-04-02 | 2019-07-05 | 北京奇艺世纪科技有限公司 | A kind of processing method of request of data, device and server |
CN112948370A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Data classification method and device and computer equipment |
Also Published As
Publication number | Publication date |
---|---|
CN102253991B (en) | 2014-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102253991B (en) | Uniform resource locator (URL) storage method, web filtering method, device and system | |
CN101901248B (en) | Method and device for creating and updating Bloom filter and searching elements | |
CN111459985B (en) | Identification information processing method and device | |
CN102918534B (en) | Inquiry pipeline | |
CN104679778A (en) | Search result generating method and device | |
JP6716727B2 (en) | Streaming data distributed processing method and apparatus | |
US20130191523A1 (en) | Real-time analytics for large data sets | |
CN101916261A (en) | Data partitioning method for distributed parallel database system | |
CN101944124A (en) | Distributed file system management method, device and corresponding file system | |
CN102810089A (en) | Short link system based on content and implementation method thereof | |
CN104104717A (en) | Inputting channel data statistical method and device | |
CN103984753A (en) | Method and device for extracting web crawler reduplication-removing characteristic value | |
CN105069111A (en) | Similarity based data-block-grade data duplication removal method for cloud storage | |
CN106407303A (en) | Data storage method and apparatus, and data query method and apparatus | |
CN103067525A (en) | Cloud storage data backup method based on characteristic codes | |
CN102546253A (en) | Webpage tamper-resistant method, system and management server | |
CN104794228A (en) | Search result providing method and device | |
CN105677904B (en) | Small documents storage method and device based on distributed file system | |
CN102663007A (en) | Data storage and query method supporting agile development and lateral spreading | |
CN102591855A (en) | Data identification method and data identification system | |
WO2017000592A1 (en) | Data processing method, apparatus and system | |
CN110874429A (en) | Distributed web crawler performance optimization method oriented to mass data acquisition | |
JP2008102795A (en) | File management device, system, and program | |
CN107798106A (en) | A kind of URL De-weight methods in distributed reptile system | |
CN101667183B (en) | Method, device and system for establishing index based on customization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |