CN111858651A - Data processing method and data processing device - Google Patents
Data processing method and data processing device Download PDFInfo
- Publication number
- CN111858651A CN111858651A CN202010998674.0A CN202010998674A CN111858651A CN 111858651 A CN111858651 A CN 111858651A CN 202010998674 A CN202010998674 A CN 202010998674A CN 111858651 A CN111858651 A CN 111858651A
- Authority
- CN
- China
- Prior art keywords
- fingerprint
- data processing
- filter
- storage
- cells
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 28
- 238000012545 processing Methods 0.000 title claims abstract description 26
- 210000000352 storage cell Anatomy 0.000 claims abstract description 42
- 241000544061 Cuculus canorus Species 0.000 claims abstract description 21
- 210000004027 cell Anatomy 0.000 claims description 63
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000003780 insertion Methods 0.000 abstract description 41
- 230000037431 insertion Effects 0.000 abstract description 41
- 238000000034 method Methods 0.000 abstract description 15
- 238000012217 deletion Methods 0.000 abstract description 12
- 230000037430 deletion Effects 0.000 abstract description 12
- 238000002474 experimental method Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 15
- 238000012360 testing method Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000003247 decreasing effect Effects 0.000 description 2
- 239000004576 sand Substances 0.000 description 2
- 101100459518 Bacillus subtilis (strain 168) nadE gene Proteins 0.000 description 1
- 101100393826 Dickeya dadantii (strain 3937) outB gene Proteins 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Collating Specific Patterns (AREA)
Abstract
The embodiment of the application provides a data processing method and a data processing device, wherein the method comprises the following steps: acquiring a target element; calculating to obtain an element fingerprint of the target element; selecting a segment of a filter according to the element fingerprint through a global hash function; performing data processing on the element fingerprint by adopting a cuckoo hash algorithm in the selected section; wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints. By adopting the data processing method, the parallel operation of the target element relative to the filter, such as parallel element insertion, element query, element deletion and the like, can be realized, so that the better data throughput of the filter is ensured. Moreover, by adopting the cuckoo hash algorithm, the query precision of the target element can be ensured, and the query and deletion of the target element can be ensured to be constant time complexity.
Description
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a data processing method and a data processing device.
Background
Database systems may employ summary data structures to represent and store data sets and support approximate membership queries of constant time complexity, with typical summary data structures including bloom filters and variants thereof, and cuckoo filters and variants thereof.
Bloom filters using fixed length bit vectorskWhether the elements belong to a given set or not is represented by the value of each bit, element insertion and query with constant time complexity are provided, but the query performance of the bloom filter is weak, and extremely high false alarm rate of query is easily caused; in addition, the bloom filter is inefficient in space utilization and does not support reverse operation. In contrast, the valley filter provides two candidate cells to store the element fingerprint, which can achieve accurate element representation and query and deletion of constant time complexity.
However, the sequential insertion, query and deletion of elements in the cuckoo filter easily causes time-consuming data operation and inefficient processing, and this limitation is particularly obvious in the case of large-volume data sets.
Disclosure of Invention
An object of the embodiments of the present application is to provide a data processing method and a data processing apparatus, which are used to improve the operation efficiency of data operations.
Based on the above purpose, in a first aspect, an embodiment of the present application provides a data processing method, including:
acquiring a target element;
calculating to obtain an element fingerprint of the target element;
selecting a segment of a filter according to the element fingerprint through a global hash function;
performing data processing on the element fingerprint by adopting a cuckoo hash algorithm in the selected section;
wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints.
By adopting the data processing method, the data operation in one segment of the filter does not influence the data operation in other segments, so that the parallel operation of the target element relative to the filter can be realized, such as parallel element insertion, element query, element deletion and the like, and the better data throughput of the filter is ensured. Moreover, by adopting the cuckoo hash algorithm, the query precision of the target element can be ensured, and the query and deletion of the target element can be ensured to be constant time complexity.
In a possible implementation, the step of performing data processing on the element fingerprint by using a cuckoo hash algorithm in the selected segment includes:
selecting two storage cells as two candidate cells according to the element fingerprint;
when the element fingerprint can be stored in two candidate cells, storing the element fingerprint;
and when the element fingerprints cannot be stored in the two candidate cells, kicking out the first fingerprint which exists in the two candidate cells at random, and storing the element fingerprints in the storage positions vacated by the first fingerprint.
In a possible embodiment, the first fingerprint kicked out enters a re-allocation step, which comprises:
obtaining another storage cell corresponding to the first fingerprint according to the first fingerprint and the storage cell where the first fingerprint is kicked out;
when the first fingerprint can be stored in the other storage cell, storing the first fingerprint;
and when the first fingerprint cannot be stored in the other storage cell, randomly kicking out a second fingerprint which is already in the other candidate cell, and storing the first fingerprint in a position where the second fingerprint is vacated.
In a possible implementation, the step of storing the first fingerprint in the location vacated by the second fingerprint further includes:
and updating the first fingerprint into the second fingerprint, and repeating the above reallocation steps until all the element fingerprints are stored or the circulation times exceed a threshold value.
In a possible implementation, the step of performing data processing on the element fingerprint by using a cuckoo hash algorithm in the selected segment includes:
selecting two storage cells as two candidate cells according to the element fingerprint;
judging whether the element fingerprints exist in the two candidate cells or not;
if yes, judging that the target element belongs to the filter;
if not, the target element is judged not to belong to the filter.
In a possible implementation, the step of performing data processing on the element fingerprint by using a cuckoo hash algorithm in the selected segment includes:
determining whether the target element has been inserted into the filter;
if yes, selecting two storage cells as two candidate cells according to the element fingerprint;
finding and deleting copies of the element fingerprint in both of the candidate cells.
In a possible embodiment, the segment is formed by uniformly dividing the storage cells in the filter.
In a possible implementation manner, the selecting two storage cells as two candidate cells according to the element fingerprint is implemented by a first hash function and a second hash function, where the first hash function is:
the second hash function is:
wherein,xis used as a target element and is used as a target element,η x as a target elementxThe fingerprint of the element(s) of (c),mfor the number of cells the filter stores,sthe number of stages is segmented for the filter.
In one possible embodiment, the element fingerprint is obtained by the following formula:
η x =h 0 (x) mod 2 f
wherein,η x as a target elementxThe fingerprint of the element(s) of (c),fis composed ofη x The number of bits of (c).
In a second aspect, an embodiment of the present application provides a data processing apparatus, including:
an acquisition module configured to acquire a target element;
a fingerprint calculation module configured to calculate an element fingerprint of the target element;
a segment selection module configured to select segments of a filter from the element fingerprints through a global hash function;
the data operation module is used for processing the element fingerprints by adopting a cuckoo hash algorithm in the selected section;
wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints.
The apparatus of this embodiment may be configured to implement the technical solution of the first aspect, and the implementation principle and the technical effect are similar, which are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only the embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a first flowchart of a data processing method according to an embodiment of the present application;
fig. 2 is a flowchart of a data processing method according to an embodiment of the present application;
fig. 3 is a flow chart of a data processing method according to an embodiment of the present application;
FIG. 4a is a graph of the number of insertion failures versus the number of segments s in a test experiment;
FIG. 4b is a graph of the relationship between insertion time and number of segments s in a test experiment;
FIG. 4c is a graph of the number of insertion failures versus the number of stored bits b in a test experiment;
FIG. 4d is a graph of the relationship between insertion time and the number of stored bits b in a test experiment;
FIG. 4e is a diagram showing the relationship between the number of insertion failures and the redistribution coefficient r in the test experiment;
FIG. 4f is a graph of the relationship between insertion time and redistribution coefficient r in a test experiment;
fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
An embodiment of the present application provides a data processing method, as shown in fig. 1, the data processing method includes:
step S10: acquiring a target element;
the target elements may be elements to be inserted, elements to be queried, and elements to be deleted for a given dataset, that is, this step may obtain the target elements for an insertion process, a query process, or a deletion process.
Step S20: calculating to obtain an element fingerprint of the target element;
an element fingerprint is a digest value, or hash value, representing a target element, and may be calculated by a predetermined hash function, for example, in one possible embodiment, any target elementxIs/are as followsfBit element fingerprintη x Can be obtained by the following formula (1):
η x =h 0 (x) mod 2 f (1)
step S30: selecting a segment of a filter according to the element fingerprint through a global hash function;
the Filter in this embodiment is a variant of a Cuckoo Filter (CF for short), which may be referred to as a parallel Cuckoo Filter (PCF for short), and the PCF physically divides the Filter into twosIndependent segmentation and additionally introducing a global hash functionH(x)To target the elementxMapping into segments. Specifically, the filter comprisesmA plurality of memory cells, each memory cell comprisingbA plurality of storage bits to store at mostbIndividual element fingerprints, i.e. each memory bit is available for storing an element fingerprint, whereby the number of data bits that a memory bit can store is the same as in equation (1) abovefAre equal in number of bits.
In a filtermThe memory cells are uniformly dividedsA plurality of segments, each segment comprisingm/sAnd a memory cell. Global hash functionH(x)For being a target elementxElement fingerprint ofη x Randomly selecting a segment due to a global hash functionH(x)Is an element fingerprintη x When a segment is randomly selected, the probability of each segment being selected is1/sAnd therefore, based on the consideration of load balancing,sis usually segmented bymThe storage cells are uniformly divided, but the data processing method in the embodiment of the application can also be used for segmenting the storage cells not by the divisionmThe condition that each memory cell is uniformly divided。
Embodiments of the present application use PCFs to represent and store a given data set.
Step S40: and performing data processing on the element fingerprint by adopting a cuckoo hash algorithm in the selected section.
At the target elementxAfter being mapped to the corresponding segment, with respect to the element fingerprintη x All data operations are performed in the selected segment, the data operations are performed using a cuckoo hashing algorithm, and the data operations may be, for example, target elementsxElement insertion, element query, and element deletion.
By adopting the data processing method, the data operation in one segment of the filter does not influence the data operation in other segments, so that the parallel operation of the target element relative to the filter can be realized, such as parallel element insertion, element query, element deletion and the like, and the better data throughput of the filter is ensured. Moreover, by adopting the cuckoo hash algorithm, the query precision of the target element can be ensured, and the query and deletion of the target element can be ensured to be constant time complexity.
In the embodiment of the application, the data operation on the element fingerprint by using the cuckoo hash algorithm may include element insertion, element query and element deletion.
As shown in fig. 2, the inserting step of the target element may include:
step S411: selecting two storage cells as two candidate cells according to the element fingerprint;
in step S30, a target element may be calculatedxElement fingerprint ofη x Obtaining the element fingerprintη x The first hash function may then be passed through in the selected segmentAnd a second hash functionTwo memory cells are selected as two candidate cells,wherein the first hash functionAnd a second hash functionRespectively as follows:
as can be seen from the above equations (2) and (3), in the known element fingerprintη x And the element fingerprintη x When one candidate cell is corresponding to the target cell, the position of the other candidate cell can be deduced by executing exclusive OR operation without knowing the target dataxThe details of (1). That is, with the same element fingerprintxThe positions of the corresponding two candidate cells have duality, and the position of one candidate cell is the duality position of the other candidate cell.
Step S412: when the element fingerprint can be stored in two candidate cells, storing the element fingerprint;
if there is one storage bit available in two candidate cells, the storage bit can be used to directly store the element fingerprintη x (ii) a Selecting one storage bit storage element fingerprint when a plurality of available storage bits existη x Element fingerprintη x Returning "True" to represent the target element after the store is completexThe insertion was successful.
Step S413: and when the element fingerprints cannot be stored in the two candidate cells, kicking out the first fingerprint which exists in the two candidate cells at random, and storing the element fingerprints in the storage positions vacated by the first fingerprint.
There are no storage bits available in the two candidate cells to store an element fingerprintη x At the time, one candidate cell is randomly selected from two candidate cellsB r And in the selected candidate cellB r In which the first fingerprint already existing is randomly kicked outη r First fingerprint ofη r A fingerprint of an element that has been previously stored in the filter. First fingerprintη r After being kicked out, the first fingerprintη r The originally occupied storage bit is vacant and the target fingerprint isxElement fingerprint ofη x Will be placed in the first fingerprintη r In adjusted storage bits, i.e. elemental fingerprintsη x Pinch off first fingerprintη r To store the bits.
Further, the first fingerprint kicked outη r Entering a redistribution step for distribution, as shown in fig. 3, the redistribution step includes:
step S414: obtaining another storage cell corresponding to the first fingerprint according to the first fingerprint and the storage cell where the first fingerprint is kicked out;
two storage cells corresponding to the same element in the filter, and a kicked first fingerprintη r Two storage cells are also corresponded, respectivelyB r And a storage cellB a At the first fingerprintη r After being kicked out, the first fingerprint should be triedη r Store to the first fingerprintη r Another corresponding memory cellB a In (1). When the first fingerprint is knownη r Memory cell located before being kicked outB r Another memory cell can be calculated according to the above equation (3)B a The position of (a).
Step S415: when the first fingerprint can be stored in the other candidate cell, storing the first fingerprint;
in another memory cellB a When there is a storage bit available, the first fingerprint is generatedη r Is stored in, andreturning to 'True' to indicate that the reallocation is finished;
step S416: when the first fingerprint cannot be stored in the other storage cell, randomly kicking out a second fingerprint which is already in the other candidate cell, and storing the first fingerprint in a position where the second fingerprint is vacated;
if another memory cellB a Fails to successfully store the first fingerprintη r Then in another memory cellB a In the random kicking of the second fingerprint already existingη w Second finger printη w A fingerprint of an element that has been previously stored in the filter. Second fingerprintη w After being kicked out, the second finger markη w The originally occupied storage bit is vacant, the first fingerprintη r Will be placed on the second fingerprintη w In the adjusted storage location, i.e. the first fingerprintη r Second finger markη w To store the bits.
Then, the first fingerprint is updated to be the second fingerprint, and the steps S414 to S416 are repeated until all fingerprints are stored or the number of times of circulation reaches the threshold valueMAX S . When all fingerprints are stored, the reallocation is successful and finished, and a 'True' is returned to indicate that the insertion is successful; when the circulation times reach the set threshold valueMAX S When the reallocation fails, a "False" is returned to indicate the insertion failed.
In the above method, the maximum performance can be performedMAX S The secondary insertion loop, so the temporal complexity of the element insertion isO(MAX S ). And because the element insertion process is only performed in selected segments, the CF threshold is compared to that of a standard cuckoo filterMAX,In this exampleMAX S The insertion operation is also accelerated by this design, since it can be set much smaller, i.e. the number of cycles required at most is much smaller.
In one possible embodiment, the step of querying the target element may include:
selecting two storage cells as two candidate cells according to the element fingerprint;
judging whether the element fingerprints exist in the two candidate cells or not;
if yes, judging that the target element belongs to the filter;
if not, the target element is judged not to belong to the filter.
In the element query step of the data processing method in this embodiment, first, the target fingerprint is queriedxCorresponding element fingerprintη x, Selecting two storage units as two candidate cells through a first hash function and a second hash function, and then checking the two candidate cells: element fingerprint if two candidate cells have an ANDη x Returning the matched fingerprint data to 'True' to determine the target elementxBelongs to a filter (given data set); otherwise, returning to 'False' to judge the target elementxNot to the filter (given data set). As can be seen from the above description, the query steps provided by the embodiments of the present application only need to check at most 2 of the candidate cellsbThe bits are stored, and the time complexity of the element query remains constant.
In one possible embodiment, the deleting step of the target element may include:
determining whether the target element has been inserted into the filter;
if yes, selecting two storage cells as two candidate cells according to the element fingerprint;
finding and deleting copies of the element fingerprint in both of the candidate cells.
In the element deletion step of the data processing method provided by the embodiment of the application, the target fingerprint is deletedxCorresponding element fingerprintη x, Selecting two storage units as two candidate cells through a first hash function and a second hash function, and if any candidate cell can match with the element fingerprintη x Then fingerprint the elementη x Deleted from the candidate cell. Furthermore, the user must ensure that the deleted element has been previously inserted; otherwise, deleting an uninserted element may actually delete an existing element, resulting in a false negative error for the query. As can be seen from the above description, the time complexity of the delete operation is also constant.
In order to illustrate the working principle and the working performance of the data processing method provided by the embodiment of the application, the data processing method is tested according to the number of insertion failures and the processing of 1 × 105The total time when an element is present quantifies the insertion performance of the element. All experiments were performed on a host with 16 GB RAM and 12X 2.6 GHz CPU. Test data sets are combined into cold data sets, i.e., strings of different lengths and characters. All results in the experimental figures are the average of 10 executions.
The number of storage bits in each memory cell, i.e., the number of storage bits, is also varied hereinbNumber of segments in PCF, i.e. s, and reallocation factorTo determine their effect on the PCF. In the experiments herein, default settings were set ton= 1×105,f= 30,b= 8,r= 0.1,m= 16,384, the total filter capacity (i.e. the number of memory bits) is fixed at 131,072, and the results of the above experiments are shown in fig. 4a, 4b, 4c, 4d, 4e and 4 f.
FIG. 4a shows the number of insertion failures and the number of segments in the test experimentsFIG. 4b is a graph showing the relationship between the insertion time and the number of segments in the test experimentsThe relationship between the sections in PCF can be seen from FIG. 4a and FIG. 4bsIncreasing from 1 to 64, the number of insertion failures increased from 39 to 18414, while the insertion time decreased from 17,071s to 77 s. When number of segmentssAs the number of memory cells in each segment increases, the number of memory cells in each segment decreases accordingly. At a given redistribution factorrEach segment will search for fewer storage cells to accommodate the inserted element. This in turn results in more insertion failures and less time consumption. It should be noted that whensWhen =1, the filter is a standard CF. Insertion failure by CFThe least, but the most time spent and unacceptable.
FIG. 4c shows the number of insertion failures and the number of storage bits in the test experimentbFIG. 4d is a graph of the relationship between insertion time and the number of stored bits in a test experimentbThe relationship between the memory cells, as can be seen in FIG. 4c and FIG. 4d, when the number of memory bits in each memory cellbIncreasing from 1 to 64, both the number of insertion failures and the total insertion time decreased rapidly. With more storage bits in each memory cell, a conflicting element may find an empty storage bit in its candidate cell with a higher probability through the reassignment process. Further, the redistribution coefficient is adjusted by the textrTo influence the maximum reallocation time in each segment.
FIG. 4e shows the insertion failure times and redistribution coefficients in the test experimentrFIG. 4f is a graph of the relationship between insertion time and redistribution factor in a test experimentrThe re-allocation strategy slightly improves the element insertion success rate, as can be seen from fig. 4e and 4 f. If reallocation is not allowed, PCF will result in 1.6 x 104The secondary insertion fails, but whenrAt 0.01, the number of insertion failures is drastically reduced to 8470. Thereafter, even ifrAt 0.8, the number of insertion failures exhibited a slow downward trend, but still exceeded 8000. Part of this phenomenon is due to the infinite loop in the PCF in which the fingerprint is reassigned. The existence of such a cycle is increasedrThe value also does not increase the chances that the PCF will explore more empty storage bits. Furthermore, the total time consumption increased from 311s to 434s with increasing reallocation.
The PCF inserts elements in a parallel mode, so that the element insertion speed is improved, and the number of failed insertion elements can be well controlled by adjusting the storage bit number and the maximum redistribution times in each storage cell.
It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.
In addition, specific embodiments of the present specification have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 5, the apparatus according to the embodiment may include:
an acquisition module 100 configured to acquire a target element;
a fingerprint calculation module 200 configured to calculate an element fingerprint of the target element;
a segment selection module 300 configured to select segments of a filter from the element fingerprints by a global hash function;
the data operation module 400 is used for processing the element fingerprints by adopting a cuckoo hash algorithm in the selected segments;
wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints.
The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.
For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of the modules may be implemented in the same or multiple software and/or hardware when implementing the embodiments of the present application.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.
Claims (10)
1. A data processing method, comprising:
acquiring a target element;
calculating to obtain an element fingerprint of the target element;
selecting a segment of a filter according to the element fingerprint through a global hash function;
performing data processing on the element fingerprint by adopting a cuckoo hash algorithm in the selected section;
wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints.
2. The data processing method of claim 1, wherein the step of performing data processing on the element fingerprint using a cuckoo hash algorithm in the selected segment comprises:
selecting two storage cells as two candidate cells according to the element fingerprint;
when the element fingerprint can be stored in two candidate cells, storing the element fingerprint;
and when the element fingerprints cannot be stored in the two candidate cells, kicking out the first fingerprint which exists in the two candidate cells at random, and storing the element fingerprints in the storage positions vacated by the first fingerprint.
3. The data processing method of claim 2, wherein the kicked-out first fingerprint enters a re-assigning step, the re-assigning step comprising:
obtaining another storage cell corresponding to the first fingerprint according to the first fingerprint and the storage cell where the first fingerprint is kicked out;
when the first fingerprint can be stored in the other storage cell, storing the first fingerprint;
and when the first fingerprint cannot be stored in the other storage cell, randomly kicking out a second fingerprint which is already in the other candidate cell, and storing the first fingerprint in a position where the second fingerprint is vacated.
4. The data processing method of claim 3, wherein the step of storing the first fingerprint in the location vacated by the second fingerprint is further followed by:
and updating the first fingerprint to be the second fingerprint, and circulating the step of reallocation until all the element fingerprints are stored or the circulation times exceed a threshold value.
5. The data processing method of claim 1, wherein the step of performing data processing on the element fingerprint using a cuckoo hash algorithm in the selected segment comprises:
selecting two storage cells as two candidate cells according to the element fingerprint;
judging whether the element fingerprints exist in the two candidate cells or not;
if yes, judging that the target element belongs to the filter;
if not, the target element is judged not to belong to the filter.
6. The data processing method of claim 1, wherein the step of performing data processing on the element fingerprint using a cuckoo hash algorithm in the selected segment comprises:
determining whether the target element has been inserted into the filter;
if yes, selecting two storage cells as two candidate cells according to the element fingerprint;
finding and deleting copies of the element fingerprint in both of the candidate cells.
7. The data processing method according to any one of claims 2 to 6, wherein the segments are evenly divided among the storage cells in the filter.
8. The data processing method according to claim 7, wherein the selecting two storage cells as two candidate cells according to the element fingerprint is implemented by a first hash function and a second hash function, wherein the first hash function is:
the second hash function is:
wherein,xis used as a target element and is used as a target element,η x as a target elementxThe fingerprint of the element(s) of (c),mfor the number of cells the filter stores,sthe number of stages is segmented for the filter.
9. The data processing method of claim 1, wherein the element fingerprint is obtained by the following formula:
η x =h 0 (x) mod 2 f
wherein,η x as a target elementxThe fingerprint of the element(s) of (c),fis composed ofη x The number of bits of (c).
10. A data processing apparatus, comprising:
an acquisition module configured to acquire a target element;
a fingerprint calculation module configured to calculate an element fingerprint of the target element;
a segment selection module configured to select segments of a filter from the element fingerprints through a global hash function;
the data operation module is used for processing the element fingerprints by adopting a cuckoo hash algorithm in the selected section;
wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010998674.0A CN111858651A (en) | 2020-09-22 | 2020-09-22 | Data processing method and data processing device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010998674.0A CN111858651A (en) | 2020-09-22 | 2020-09-22 | Data processing method and data processing device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111858651A true CN111858651A (en) | 2020-10-30 |
Family
ID=72968505
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010998674.0A Pending CN111858651A (en) | 2020-09-22 | 2020-09-22 | Data processing method and data processing device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111858651A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113486025A (en) * | 2021-07-28 | 2021-10-08 | 北京腾云天下科技有限公司 | Data storage method, data query method and device |
CN113535706A (en) * | 2021-08-03 | 2021-10-22 | 重庆赛渝深科技有限公司 | Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter |
CN113535705A (en) * | 2021-08-03 | 2021-10-22 | 佛山赛思禅科技有限公司 | SFAD cuckoo filter and data de-duplication method based on SFAD cuckoo filter |
CN113641681A (en) * | 2021-10-13 | 2021-11-12 | 南京大数据集团有限公司 | Space self-adaptive mass data query method |
CN114268501A (en) * | 2021-12-24 | 2022-04-01 | 深信服科技股份有限公司 | Data processing method, firewall generation method, computing device and storage medium |
CN114527929A (en) * | 2020-11-23 | 2022-05-24 | 洪文圳 | Cloud storage data fusion method based on double-hash fuzzy bloom filter |
CN117891858A (en) * | 2024-03-14 | 2024-04-16 | 苏州大学 | Space-time efficient parallel approximate member query method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107256130A (en) * | 2017-06-06 | 2017-10-17 | 华中科技大学 | Data store optimization method and system based on Cuckoo Hash calculations |
CN109815234A (en) * | 2018-12-29 | 2019-05-28 | 杭州中科先进技术研究院有限公司 | A kind of multiple cuckoo filter under streaming computing model |
CN111552692A (en) * | 2020-04-30 | 2020-08-18 | 南方科技大学 | Plus-minus cuckoo filter |
-
2020
- 2020-09-22 CN CN202010998674.0A patent/CN111858651A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107256130A (en) * | 2017-06-06 | 2017-10-17 | 华中科技大学 | Data store optimization method and system based on Cuckoo Hash calculations |
CN109815234A (en) * | 2018-12-29 | 2019-05-28 | 杭州中科先进技术研究院有限公司 | A kind of multiple cuckoo filter under streaming computing model |
CN111552692A (en) * | 2020-04-30 | 2020-08-18 | 南方科技大学 | Plus-minus cuckoo filter |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114527929A (en) * | 2020-11-23 | 2022-05-24 | 洪文圳 | Cloud storage data fusion method based on double-hash fuzzy bloom filter |
CN113486025A (en) * | 2021-07-28 | 2021-10-08 | 北京腾云天下科技有限公司 | Data storage method, data query method and device |
CN113486025B (en) * | 2021-07-28 | 2023-07-25 | 北京腾云天下科技有限公司 | Data storage method, data query method and device |
CN113535705B (en) * | 2021-08-03 | 2024-02-02 | 佛山赛思禅科技有限公司 | SFAD cuckoo filter and repeated data deleting method based on SFAD cuckoo filter |
CN113535706B (en) * | 2021-08-03 | 2023-05-23 | 佛山赛思禅科技有限公司 | Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter |
CN113535705A (en) * | 2021-08-03 | 2021-10-22 | 佛山赛思禅科技有限公司 | SFAD cuckoo filter and data de-duplication method based on SFAD cuckoo filter |
CN113535706A (en) * | 2021-08-03 | 2021-10-22 | 重庆赛渝深科技有限公司 | Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter |
CN113641681B (en) * | 2021-10-13 | 2022-02-22 | 南京大数据集团有限公司 | Space self-adaptive mass data query method |
CN113641681A (en) * | 2021-10-13 | 2021-11-12 | 南京大数据集团有限公司 | Space self-adaptive mass data query method |
CN114268501A (en) * | 2021-12-24 | 2022-04-01 | 深信服科技股份有限公司 | Data processing method, firewall generation method, computing device and storage medium |
CN114268501B (en) * | 2021-12-24 | 2024-02-23 | 深信服科技股份有限公司 | Data processing method, firewall generating method, computing device and storage medium |
CN117891858A (en) * | 2024-03-14 | 2024-04-16 | 苏州大学 | Space-time efficient parallel approximate member query method and system |
CN117891858B (en) * | 2024-03-14 | 2024-07-05 | 苏州大学 | Space-time efficient parallel approximate member query method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111858651A (en) | Data processing method and data processing device | |
KR101367450B1 (en) | Performing concurrent rehashing of a hash table for multithreaded applications | |
CN110489405B (en) | Data processing method, device and server | |
CN108920412B (en) | Algorithm automatic tuning method for heterogeneous computer system structure | |
CN107133228A (en) | A kind of method and device of fast resampling | |
CN112882663B (en) | Random writing method, electronic equipment and storage medium | |
CN115510092A (en) | Approximate member query optimization method based on cuckoo filter | |
CN114556309A (en) | Memory space allocation method and device and storage medium | |
CN106980471B (en) | Method and device for improving hard disk writing performance of intelligent equipment | |
CN110008382B (en) | Method, system and equipment for determining TopN data | |
US9298505B2 (en) | Time and space-deterministic task scheduling apparatus and method using multi-dimensional scheme | |
CN116775695A (en) | Dynamic combination query optimization method and device based on index and storage medium | |
CN108647289B (en) | Hash table building method based on valley Hash and bloom filter | |
CN112269947B (en) | Caching method and device for space text data, electronic equipment and storage medium | |
CN112800057B (en) | Fingerprint table management method and device | |
CN107341113B (en) | Cache compression method and device | |
CN111104435B (en) | Metadata organization method, device and equipment and computer readable storage medium | |
CN112100446B (en) | Search method, readable storage medium, and electronic device | |
CN115221360A (en) | Tree structure configuration method and system | |
CN112506440A (en) | Data searching method and equipment based on dichotomy | |
CN113127694A (en) | Data storage method and device, electronic equipment and storage medium | |
US9824105B2 (en) | Adaptive probabilistic indexing with skip lists | |
CN117891858B (en) | Space-time efficient parallel approximate member query method and system | |
CN113177224B (en) | Block chain based data sealing method, device, equipment and storage medium | |
CN116149573B (en) | Method, system, equipment and medium for processing queue by RAID card cluster |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |