CN111858651A - Data processing method and data processing device - Google Patents

Data processing method and data processing device Download PDF

Info

Publication number
CN111858651A
CN111858651A CN202010998674.0A CN202010998674A CN111858651A CN 111858651 A CN111858651 A CN 111858651A CN 202010998674 A CN202010998674 A CN 202010998674A CN 111858651 A CN111858651 A CN 111858651A
Authority
CN
China
Prior art keywords
fingerprint
data processing
filter
storage
cells
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010998674.0A
Other languages
Chinese (zh)
Inventor
郭得科
罗来龙
廖汉龙
袁昊
武睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202010998674.0A priority Critical patent/CN111858651A/en
Publication of CN111858651A publication Critical patent/CN111858651A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The embodiment of the application provides a data processing method and a data processing device, wherein the method comprises the following steps: acquiring a target element; calculating to obtain an element fingerprint of the target element; selecting a segment of a filter according to the element fingerprint through a global hash function; performing data processing on the element fingerprint by adopting a cuckoo hash algorithm in the selected section; wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints. By adopting the data processing method, the parallel operation of the target element relative to the filter, such as parallel element insertion, element query, element deletion and the like, can be realized, so that the better data throughput of the filter is ensured. Moreover, by adopting the cuckoo hash algorithm, the query precision of the target element can be ensured, and the query and deletion of the target element can be ensured to be constant time complexity.

Description

Data processing method and data processing device
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a data processing method and a data processing device.
Background
Database systems may employ summary data structures to represent and store data sets and support approximate membership queries of constant time complexity, with typical summary data structures including bloom filters and variants thereof, and cuckoo filters and variants thereof.
Bloom filters using fixed length bit vectorskWhether the elements belong to a given set or not is represented by the value of each bit, element insertion and query with constant time complexity are provided, but the query performance of the bloom filter is weak, and extremely high false alarm rate of query is easily caused; in addition, the bloom filter is inefficient in space utilization and does not support reverse operation. In contrast, the valley filter provides two candidate cells to store the element fingerprint, which can achieve accurate element representation and query and deletion of constant time complexity.
However, the sequential insertion, query and deletion of elements in the cuckoo filter easily causes time-consuming data operation and inefficient processing, and this limitation is particularly obvious in the case of large-volume data sets.
Disclosure of Invention
An object of the embodiments of the present application is to provide a data processing method and a data processing apparatus, which are used to improve the operation efficiency of data operations.
Based on the above purpose, in a first aspect, an embodiment of the present application provides a data processing method, including:
acquiring a target element;
calculating to obtain an element fingerprint of the target element;
selecting a segment of a filter according to the element fingerprint through a global hash function;
performing data processing on the element fingerprint by adopting a cuckoo hash algorithm in the selected section;
wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints.
By adopting the data processing method, the data operation in one segment of the filter does not influence the data operation in other segments, so that the parallel operation of the target element relative to the filter can be realized, such as parallel element insertion, element query, element deletion and the like, and the better data throughput of the filter is ensured. Moreover, by adopting the cuckoo hash algorithm, the query precision of the target element can be ensured, and the query and deletion of the target element can be ensured to be constant time complexity.
In a possible implementation, the step of performing data processing on the element fingerprint by using a cuckoo hash algorithm in the selected segment includes:
selecting two storage cells as two candidate cells according to the element fingerprint;
when the element fingerprint can be stored in two candidate cells, storing the element fingerprint;
and when the element fingerprints cannot be stored in the two candidate cells, kicking out the first fingerprint which exists in the two candidate cells at random, and storing the element fingerprints in the storage positions vacated by the first fingerprint.
In a possible embodiment, the first fingerprint kicked out enters a re-allocation step, which comprises:
obtaining another storage cell corresponding to the first fingerprint according to the first fingerprint and the storage cell where the first fingerprint is kicked out;
when the first fingerprint can be stored in the other storage cell, storing the first fingerprint;
and when the first fingerprint cannot be stored in the other storage cell, randomly kicking out a second fingerprint which is already in the other candidate cell, and storing the first fingerprint in a position where the second fingerprint is vacated.
In a possible implementation, the step of storing the first fingerprint in the location vacated by the second fingerprint further includes:
and updating the first fingerprint into the second fingerprint, and repeating the above reallocation steps until all the element fingerprints are stored or the circulation times exceed a threshold value.
In a possible implementation, the step of performing data processing on the element fingerprint by using a cuckoo hash algorithm in the selected segment includes:
selecting two storage cells as two candidate cells according to the element fingerprint;
judging whether the element fingerprints exist in the two candidate cells or not;
if yes, judging that the target element belongs to the filter;
if not, the target element is judged not to belong to the filter.
In a possible implementation, the step of performing data processing on the element fingerprint by using a cuckoo hash algorithm in the selected segment includes:
determining whether the target element has been inserted into the filter;
if yes, selecting two storage cells as two candidate cells according to the element fingerprint;
finding and deleting copies of the element fingerprint in both of the candidate cells.
In a possible embodiment, the segment is formed by uniformly dividing the storage cells in the filter.
In a possible implementation manner, the selecting two storage cells as two candidate cells according to the element fingerprint is implemented by a first hash function and a second hash function, where the first hash function is:
Figure 100002_DEST_PATH_IMAGE001
the second hash function is:
Figure 100002_DEST_PATH_IMAGE002
wherein,xis used as a target element and is used as a target element,η x as a target elementxThe fingerprint of the element(s) of (c),mfor the number of cells the filter stores,sthe number of stages is segmented for the filter.
In one possible embodiment, the element fingerprint is obtained by the following formula:
η x =h 0 (x) mod 2 f
wherein,η x as a target elementxThe fingerprint of the element(s) of (c),fis composed ofη x The number of bits of (c).
In a second aspect, an embodiment of the present application provides a data processing apparatus, including:
an acquisition module configured to acquire a target element;
a fingerprint calculation module configured to calculate an element fingerprint of the target element;
a segment selection module configured to select segments of a filter from the element fingerprints through a global hash function;
the data operation module is used for processing the element fingerprints by adopting a cuckoo hash algorithm in the selected section;
wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints.
The apparatus of this embodiment may be configured to implement the technical solution of the first aspect, and the implementation principle and the technical effect are similar, which are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only the embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a first flowchart of a data processing method according to an embodiment of the present application;
fig. 2 is a flowchart of a data processing method according to an embodiment of the present application;
fig. 3 is a flow chart of a data processing method according to an embodiment of the present application;
FIG. 4a is a graph of the number of insertion failures versus the number of segments s in a test experiment;
FIG. 4b is a graph of the relationship between insertion time and number of segments s in a test experiment;
FIG. 4c is a graph of the number of insertion failures versus the number of stored bits b in a test experiment;
FIG. 4d is a graph of the relationship between insertion time and the number of stored bits b in a test experiment;
FIG. 4e is a diagram showing the relationship between the number of insertion failures and the redistribution coefficient r in the test experiment;
FIG. 4f is a graph of the relationship between insertion time and redistribution coefficient r in a test experiment;
fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
An embodiment of the present application provides a data processing method, as shown in fig. 1, the data processing method includes:
step S10: acquiring a target element;
the target elements may be elements to be inserted, elements to be queried, and elements to be deleted for a given dataset, that is, this step may obtain the target elements for an insertion process, a query process, or a deletion process.
Step S20: calculating to obtain an element fingerprint of the target element;
an element fingerprint is a digest value, or hash value, representing a target element, and may be calculated by a predetermined hash function, for example, in one possible embodiment, any target elementxIs/are as followsfBit element fingerprintη x Can be obtained by the following formula (1):
η x =h 0 (x) mod 2 f (1)
step S30: selecting a segment of a filter according to the element fingerprint through a global hash function;
the Filter in this embodiment is a variant of a Cuckoo Filter (CF for short), which may be referred to as a parallel Cuckoo Filter (PCF for short), and the PCF physically divides the Filter into twosIndependent segmentation and additionally introducing a global hash functionH(x)To target the elementxMapping into segments. Specifically, the filter comprisesmA plurality of memory cells, each memory cell comprisingbA plurality of storage bits to store at mostbIndividual element fingerprints, i.e. each memory bit is available for storing an element fingerprint, whereby the number of data bits that a memory bit can store is the same as in equation (1) abovefAre equal in number of bits.
In a filtermThe memory cells are uniformly dividedsA plurality of segments, each segment comprisingm/sAnd a memory cell. Global hash functionH(x)For being a target elementxElement fingerprint ofη x Randomly selecting a segment due to a global hash functionH(x)Is an element fingerprintη x When a segment is randomly selected, the probability of each segment being selected is1/sAnd therefore, based on the consideration of load balancing,sis usually segmented bymThe storage cells are uniformly divided, but the data processing method in the embodiment of the application can also be used for segmenting the storage cells not by the divisionmThe condition that each memory cell is uniformly divided。
Embodiments of the present application use PCFs to represent and store a given data set.
Step S40: and performing data processing on the element fingerprint by adopting a cuckoo hash algorithm in the selected section.
At the target elementxAfter being mapped to the corresponding segment, with respect to the element fingerprintη x All data operations are performed in the selected segment, the data operations are performed using a cuckoo hashing algorithm, and the data operations may be, for example, target elementsxElement insertion, element query, and element deletion.
By adopting the data processing method, the data operation in one segment of the filter does not influence the data operation in other segments, so that the parallel operation of the target element relative to the filter can be realized, such as parallel element insertion, element query, element deletion and the like, and the better data throughput of the filter is ensured. Moreover, by adopting the cuckoo hash algorithm, the query precision of the target element can be ensured, and the query and deletion of the target element can be ensured to be constant time complexity.
In the embodiment of the application, the data operation on the element fingerprint by using the cuckoo hash algorithm may include element insertion, element query and element deletion.
As shown in fig. 2, the inserting step of the target element may include:
step S411: selecting two storage cells as two candidate cells according to the element fingerprint;
in step S30, a target element may be calculatedxElement fingerprint ofη x Obtaining the element fingerprintη x The first hash function may then be passed through in the selected segment
Figure DEST_PATH_IMAGE003
And a second hash function
Figure DEST_PATH_IMAGE004
Two memory cells are selected as two candidate cells,wherein the first hash function
Figure 16110DEST_PATH_IMAGE003
And a second hash function
Figure 198829DEST_PATH_IMAGE004
Respectively as follows:
Figure DEST_PATH_IMAGE005
as can be seen from the above equations (2) and (3), in the known element fingerprintη x And the element fingerprintη x When one candidate cell is corresponding to the target cell, the position of the other candidate cell can be deduced by executing exclusive OR operation without knowing the target dataxThe details of (1). That is, with the same element fingerprintxThe positions of the corresponding two candidate cells have duality, and the position of one candidate cell is the duality position of the other candidate cell.
Step S412: when the element fingerprint can be stored in two candidate cells, storing the element fingerprint;
if there is one storage bit available in two candidate cells, the storage bit can be used to directly store the element fingerprintη x (ii) a Selecting one storage bit storage element fingerprint when a plurality of available storage bits existη x Element fingerprintη x Returning "True" to represent the target element after the store is completexThe insertion was successful.
Step S413: and when the element fingerprints cannot be stored in the two candidate cells, kicking out the first fingerprint which exists in the two candidate cells at random, and storing the element fingerprints in the storage positions vacated by the first fingerprint.
There are no storage bits available in the two candidate cells to store an element fingerprintη x At the time, one candidate cell is randomly selected from two candidate cellsB r And in the selected candidate cellB r In which the first fingerprint already existing is randomly kicked outη r First fingerprint ofη r A fingerprint of an element that has been previously stored in the filter. First fingerprintη r After being kicked out, the first fingerprintη r The originally occupied storage bit is vacant and the target fingerprint isxElement fingerprint ofη x Will be placed in the first fingerprintη r In adjusted storage bits, i.e. elemental fingerprintsη x Pinch off first fingerprintη r To store the bits.
Further, the first fingerprint kicked outη r Entering a redistribution step for distribution, as shown in fig. 3, the redistribution step includes:
step S414: obtaining another storage cell corresponding to the first fingerprint according to the first fingerprint and the storage cell where the first fingerprint is kicked out;
two storage cells corresponding to the same element in the filter, and a kicked first fingerprintη r Two storage cells are also corresponded, respectivelyB r And a storage cellB a At the first fingerprintη r After being kicked out, the first fingerprint should be triedη r Store to the first fingerprintη r Another corresponding memory cellB a In (1). When the first fingerprint is knownη r Memory cell located before being kicked outB r Another memory cell can be calculated according to the above equation (3)B a The position of (a).
Step S415: when the first fingerprint can be stored in the other candidate cell, storing the first fingerprint;
in another memory cellB a When there is a storage bit available, the first fingerprint is generatedη r Is stored in, andreturning to 'True' to indicate that the reallocation is finished;
step S416: when the first fingerprint cannot be stored in the other storage cell, randomly kicking out a second fingerprint which is already in the other candidate cell, and storing the first fingerprint in a position where the second fingerprint is vacated;
if another memory cellB a Fails to successfully store the first fingerprintη r Then in another memory cellB a In the random kicking of the second fingerprint already existingη w Second finger printη w A fingerprint of an element that has been previously stored in the filter. Second fingerprintη w After being kicked out, the second finger markη w The originally occupied storage bit is vacant, the first fingerprintη r Will be placed on the second fingerprintη w In the adjusted storage location, i.e. the first fingerprintη r Second finger markη w To store the bits.
Then, the first fingerprint is updated to be the second fingerprint, and the steps S414 to S416 are repeated until all fingerprints are stored or the number of times of circulation reaches the threshold valueMAX S . When all fingerprints are stored, the reallocation is successful and finished, and a 'True' is returned to indicate that the insertion is successful; when the circulation times reach the set threshold valueMAX S When the reallocation fails, a "False" is returned to indicate the insertion failed.
In the above method, the maximum performance can be performedMAX S The secondary insertion loop, so the temporal complexity of the element insertion isO(MAX S ). And because the element insertion process is only performed in selected segments, the CF threshold is compared to that of a standard cuckoo filterMAX,In this exampleMAX S The insertion operation is also accelerated by this design, since it can be set much smaller, i.e. the number of cycles required at most is much smaller.
In one possible embodiment, the step of querying the target element may include:
selecting two storage cells as two candidate cells according to the element fingerprint;
judging whether the element fingerprints exist in the two candidate cells or not;
if yes, judging that the target element belongs to the filter;
if not, the target element is judged not to belong to the filter.
In the element query step of the data processing method in this embodiment, first, the target fingerprint is queriedxCorresponding element fingerprintη x, Selecting two storage units as two candidate cells through a first hash function and a second hash function, and then checking the two candidate cells: element fingerprint if two candidate cells have an ANDη x Returning the matched fingerprint data to 'True' to determine the target elementxBelongs to a filter (given data set); otherwise, returning to 'False' to judge the target elementxNot to the filter (given data set). As can be seen from the above description, the query steps provided by the embodiments of the present application only need to check at most 2 of the candidate cellsbThe bits are stored, and the time complexity of the element query remains constant.
In one possible embodiment, the deleting step of the target element may include:
determining whether the target element has been inserted into the filter;
if yes, selecting two storage cells as two candidate cells according to the element fingerprint;
finding and deleting copies of the element fingerprint in both of the candidate cells.
In the element deletion step of the data processing method provided by the embodiment of the application, the target fingerprint is deletedxCorresponding element fingerprintη x, Selecting two storage units as two candidate cells through a first hash function and a second hash function, and if any candidate cell can match with the element fingerprintη x Then fingerprint the elementη x Deleted from the candidate cell. Furthermore, the user must ensure that the deleted element has been previously inserted; otherwise, deleting an uninserted element may actually delete an existing element, resulting in a false negative error for the query. As can be seen from the above description, the time complexity of the delete operation is also constant.
In order to illustrate the working principle and the working performance of the data processing method provided by the embodiment of the application, the data processing method is tested according to the number of insertion failures and the processing of 1 × 105The total time when an element is present quantifies the insertion performance of the element. All experiments were performed on a host with 16 GB RAM and 12X 2.6 GHz CPU. Test data sets are combined into cold data sets, i.e., strings of different lengths and characters. All results in the experimental figures are the average of 10 executions.
The number of storage bits in each memory cell, i.e., the number of storage bits, is also varied hereinbNumber of segments in PCF, i.e. s, and reallocation factor
Figure DEST_PATH_IMAGE006
To determine their effect on the PCF. In the experiments herein, default settings were set ton= 1×105f= 30,b= 8,r= 0.1,m= 16,384, the total filter capacity (i.e. the number of memory bits) is fixed at 131,072, and the results of the above experiments are shown in fig. 4a, 4b, 4c, 4d, 4e and 4 f.
FIG. 4a shows the number of insertion failures and the number of segments in the test experimentsFIG. 4b is a graph showing the relationship between the insertion time and the number of segments in the test experimentsThe relationship between the sections in PCF can be seen from FIG. 4a and FIG. 4bsIncreasing from 1 to 64, the number of insertion failures increased from 39 to 18414, while the insertion time decreased from 17,071s to 77 s. When number of segmentssAs the number of memory cells in each segment increases, the number of memory cells in each segment decreases accordingly. At a given redistribution factorrEach segment will search for fewer storage cells to accommodate the inserted element. This in turn results in more insertion failures and less time consumption. It should be noted that whensWhen =1, the filter is a standard CF. Insertion failure by CFThe least, but the most time spent and unacceptable.
FIG. 4c shows the number of insertion failures and the number of storage bits in the test experimentbFIG. 4d is a graph of the relationship between insertion time and the number of stored bits in a test experimentbThe relationship between the memory cells, as can be seen in FIG. 4c and FIG. 4d, when the number of memory bits in each memory cellbIncreasing from 1 to 64, both the number of insertion failures and the total insertion time decreased rapidly. With more storage bits in each memory cell, a conflicting element may find an empty storage bit in its candidate cell with a higher probability through the reassignment process. Further, the redistribution coefficient is adjusted by the textrTo influence the maximum reallocation time in each segment.
FIG. 4e shows the insertion failure times and redistribution coefficients in the test experimentrFIG. 4f is a graph of the relationship between insertion time and redistribution factor in a test experimentrThe re-allocation strategy slightly improves the element insertion success rate, as can be seen from fig. 4e and 4 f. If reallocation is not allowed, PCF will result in 1.6 x 104The secondary insertion fails, but whenrAt 0.01, the number of insertion failures is drastically reduced to 8470. Thereafter, even ifrAt 0.8, the number of insertion failures exhibited a slow downward trend, but still exceeded 8000. Part of this phenomenon is due to the infinite loop in the PCF in which the fingerprint is reassigned. The existence of such a cycle is increasedrThe value also does not increase the chances that the PCF will explore more empty storage bits. Furthermore, the total time consumption increased from 311s to 434s with increasing reallocation.
The PCF inserts elements in a parallel mode, so that the element insertion speed is improved, and the number of failed insertion elements can be well controlled by adjusting the storage bit number and the maximum redistribution times in each storage cell.
It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.
In addition, specific embodiments of the present specification have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 5, the apparatus according to the embodiment may include:
an acquisition module 100 configured to acquire a target element;
a fingerprint calculation module 200 configured to calculate an element fingerprint of the target element;
a segment selection module 300 configured to select segments of a filter from the element fingerprints by a global hash function;
the data operation module 400 is used for processing the element fingerprints by adopting a cuckoo hash algorithm in the selected segments;
wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints.
The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.
For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of the modules may be implemented in the same or multiple software and/or hardware when implementing the embodiments of the present application.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A data processing method, comprising:
acquiring a target element;
calculating to obtain an element fingerprint of the target element;
selecting a segment of a filter according to the element fingerprint through a global hash function;
performing data processing on the element fingerprint by adopting a cuckoo hash algorithm in the selected section;
wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints.
2. The data processing method of claim 1, wherein the step of performing data processing on the element fingerprint using a cuckoo hash algorithm in the selected segment comprises:
selecting two storage cells as two candidate cells according to the element fingerprint;
when the element fingerprint can be stored in two candidate cells, storing the element fingerprint;
and when the element fingerprints cannot be stored in the two candidate cells, kicking out the first fingerprint which exists in the two candidate cells at random, and storing the element fingerprints in the storage positions vacated by the first fingerprint.
3. The data processing method of claim 2, wherein the kicked-out first fingerprint enters a re-assigning step, the re-assigning step comprising:
obtaining another storage cell corresponding to the first fingerprint according to the first fingerprint and the storage cell where the first fingerprint is kicked out;
when the first fingerprint can be stored in the other storage cell, storing the first fingerprint;
and when the first fingerprint cannot be stored in the other storage cell, randomly kicking out a second fingerprint which is already in the other candidate cell, and storing the first fingerprint in a position where the second fingerprint is vacated.
4. The data processing method of claim 3, wherein the step of storing the first fingerprint in the location vacated by the second fingerprint is further followed by:
and updating the first fingerprint to be the second fingerprint, and circulating the step of reallocation until all the element fingerprints are stored or the circulation times exceed a threshold value.
5. The data processing method of claim 1, wherein the step of performing data processing on the element fingerprint using a cuckoo hash algorithm in the selected segment comprises:
selecting two storage cells as two candidate cells according to the element fingerprint;
judging whether the element fingerprints exist in the two candidate cells or not;
if yes, judging that the target element belongs to the filter;
if not, the target element is judged not to belong to the filter.
6. The data processing method of claim 1, wherein the step of performing data processing on the element fingerprint using a cuckoo hash algorithm in the selected segment comprises:
determining whether the target element has been inserted into the filter;
if yes, selecting two storage cells as two candidate cells according to the element fingerprint;
finding and deleting copies of the element fingerprint in both of the candidate cells.
7. The data processing method according to any one of claims 2 to 6, wherein the segments are evenly divided among the storage cells in the filter.
8. The data processing method according to claim 7, wherein the selecting two storage cells as two candidate cells according to the element fingerprint is implemented by a first hash function and a second hash function, wherein the first hash function is:
Figure DEST_PATH_IMAGE001
the second hash function is:
Figure DEST_PATH_IMAGE002
wherein,xis used as a target element and is used as a target element,η x as a target elementxThe fingerprint of the element(s) of (c),mfor the number of cells the filter stores,sthe number of stages is segmented for the filter.
9. The data processing method of claim 1, wherein the element fingerprint is obtained by the following formula:
η x =h 0 (x) mod 2 f
wherein,η x as a target elementxThe fingerprint of the element(s) of (c),fis composed ofη x The number of bits of (c).
10. A data processing apparatus, comprising:
an acquisition module configured to acquire a target element;
a fingerprint calculation module configured to calculate an element fingerprint of the target element;
a segment selection module configured to select segments of a filter from the element fingerprints through a global hash function;
the data operation module is used for processing the element fingerprints by adopting a cuckoo hash algorithm in the selected section;
wherein the filter comprises a plurality of segments, the segments comprising a plurality of storage cells, the storage cells comprising a plurality of storage bits for storing the element fingerprints.
CN202010998674.0A 2020-09-22 2020-09-22 Data processing method and data processing device Pending CN111858651A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010998674.0A CN111858651A (en) 2020-09-22 2020-09-22 Data processing method and data processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010998674.0A CN111858651A (en) 2020-09-22 2020-09-22 Data processing method and data processing device

Publications (1)

Publication Number Publication Date
CN111858651A true CN111858651A (en) 2020-10-30

Family

ID=72968505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010998674.0A Pending CN111858651A (en) 2020-09-22 2020-09-22 Data processing method and data processing device

Country Status (1)

Country Link
CN (1) CN111858651A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486025A (en) * 2021-07-28 2021-10-08 北京腾云天下科技有限公司 Data storage method, data query method and device
CN113535706A (en) * 2021-08-03 2021-10-22 重庆赛渝深科技有限公司 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
CN113535705A (en) * 2021-08-03 2021-10-22 佛山赛思禅科技有限公司 SFAD cuckoo filter and data de-duplication method based on SFAD cuckoo filter
CN113641681A (en) * 2021-10-13 2021-11-12 南京大数据集团有限公司 Space self-adaptive mass data query method
CN114268501A (en) * 2021-12-24 2022-04-01 深信服科技股份有限公司 Data processing method, firewall generation method, computing device and storage medium
CN114527929A (en) * 2020-11-23 2022-05-24 洪文圳 Cloud storage data fusion method based on double-hash fuzzy bloom filter
CN117891858A (en) * 2024-03-14 2024-04-16 苏州大学 Space-time efficient parallel approximate member query method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256130A (en) * 2017-06-06 2017-10-17 华中科技大学 Data store optimization method and system based on Cuckoo Hash calculations
CN109815234A (en) * 2018-12-29 2019-05-28 杭州中科先进技术研究院有限公司 A kind of multiple cuckoo filter under streaming computing model
CN111552692A (en) * 2020-04-30 2020-08-18 南方科技大学 Plus-minus cuckoo filter

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256130A (en) * 2017-06-06 2017-10-17 华中科技大学 Data store optimization method and system based on Cuckoo Hash calculations
CN109815234A (en) * 2018-12-29 2019-05-28 杭州中科先进技术研究院有限公司 A kind of multiple cuckoo filter under streaming computing model
CN111552692A (en) * 2020-04-30 2020-08-18 南方科技大学 Plus-minus cuckoo filter

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114527929A (en) * 2020-11-23 2022-05-24 洪文圳 Cloud storage data fusion method based on double-hash fuzzy bloom filter
CN113486025A (en) * 2021-07-28 2021-10-08 北京腾云天下科技有限公司 Data storage method, data query method and device
CN113486025B (en) * 2021-07-28 2023-07-25 北京腾云天下科技有限公司 Data storage method, data query method and device
CN113535705B (en) * 2021-08-03 2024-02-02 佛山赛思禅科技有限公司 SFAD cuckoo filter and repeated data deleting method based on SFAD cuckoo filter
CN113535706B (en) * 2021-08-03 2023-05-23 佛山赛思禅科技有限公司 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
CN113535705A (en) * 2021-08-03 2021-10-22 佛山赛思禅科技有限公司 SFAD cuckoo filter and data de-duplication method based on SFAD cuckoo filter
CN113535706A (en) * 2021-08-03 2021-10-22 重庆赛渝深科技有限公司 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
CN113641681B (en) * 2021-10-13 2022-02-22 南京大数据集团有限公司 Space self-adaptive mass data query method
CN113641681A (en) * 2021-10-13 2021-11-12 南京大数据集团有限公司 Space self-adaptive mass data query method
CN114268501A (en) * 2021-12-24 2022-04-01 深信服科技股份有限公司 Data processing method, firewall generation method, computing device and storage medium
CN114268501B (en) * 2021-12-24 2024-02-23 深信服科技股份有限公司 Data processing method, firewall generating method, computing device and storage medium
CN117891858A (en) * 2024-03-14 2024-04-16 苏州大学 Space-time efficient parallel approximate member query method and system
CN117891858B (en) * 2024-03-14 2024-07-05 苏州大学 Space-time efficient parallel approximate member query method and system

Similar Documents

Publication Publication Date Title
CN111858651A (en) Data processing method and data processing device
KR101367450B1 (en) Performing concurrent rehashing of a hash table for multithreaded applications
CN110489405B (en) Data processing method, device and server
CN108920412B (en) Algorithm automatic tuning method for heterogeneous computer system structure
CN107133228A (en) A kind of method and device of fast resampling
CN112882663B (en) Random writing method, electronic equipment and storage medium
CN115510092A (en) Approximate member query optimization method based on cuckoo filter
CN114556309A (en) Memory space allocation method and device and storage medium
CN106980471B (en) Method and device for improving hard disk writing performance of intelligent equipment
CN110008382B (en) Method, system and equipment for determining TopN data
US9298505B2 (en) Time and space-deterministic task scheduling apparatus and method using multi-dimensional scheme
CN116775695A (en) Dynamic combination query optimization method and device based on index and storage medium
CN108647289B (en) Hash table building method based on valley Hash and bloom filter
CN112269947B (en) Caching method and device for space text data, electronic equipment and storage medium
CN112800057B (en) Fingerprint table management method and device
CN107341113B (en) Cache compression method and device
CN111104435B (en) Metadata organization method, device and equipment and computer readable storage medium
CN112100446B (en) Search method, readable storage medium, and electronic device
CN115221360A (en) Tree structure configuration method and system
CN112506440A (en) Data searching method and equipment based on dichotomy
CN113127694A (en) Data storage method and device, electronic equipment and storage medium
US9824105B2 (en) Adaptive probabilistic indexing with skip lists
CN117891858B (en) Space-time efficient parallel approximate member query method and system
CN113177224B (en) Block chain based data sealing method, device, equipment and storage medium
CN116149573B (en) Method, system, equipment and medium for processing queue by RAID card cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination