CN111723266A - Mass data processing method and device - Google Patents

Mass data processing method and device Download PDF

Info

Publication number
CN111723266A
CN111723266A CN201910207946.8A CN201910207946A CN111723266A CN 111723266 A CN111723266 A CN 111723266A CN 201910207946 A CN201910207946 A CN 201910207946A CN 111723266 A CN111723266 A CN 111723266A
Authority
CN
China
Prior art keywords
hash
storage
values
hash value
storage block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910207946.8A
Other languages
Chinese (zh)
Inventor
余伟伟
闫创
任莉强
邢淇翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Wodong Tianjun Information Technology Co Ltd
Priority to CN201910207946.8A priority Critical patent/CN111723266A/en
Publication of CN111723266A publication Critical patent/CN111723266A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The disclosure provides a mass data processing method and device, and relates to the field of computers. The storage position of the first element in the hash table is positioned through the hash value of the first element, then the first element is searched in the elements stored in the storage block at the storage position, and because the hash value of the element but not the element is searched, whether the first element is a repeated element can be accurately judged, the problem of wrong judgment of the repeated element caused by 'hash collision' is solved, and the element is searched in a small number of elements through positioning, so that the judgment efficiency of the repeated element can be considered at the same time. The method and the device are suitable for a repeated data judgment scene of mass data processing.

Description

Mass data processing method and device
Technical Field
The present disclosure relates to the field of computers, and in particular, to a method and an apparatus for processing mass data.
Background
The web crawler captures corresponding web page content based on the web page address. The web page address is also called a Uniform Resource Locator (URL). Before crawling, it is necessary to determine whether the web page to be crawled has been crawled, so as to avoid repeatedly crawling the same web page.
Because the number of the web pages is huge, in order to improve the judgment efficiency, in some related technologies, the hash value of the web page address already grabbed by the web crawler and the hash value of the web page address to be grabbed are calculated, and if the hash value of the web page address to be grabbed is the same as the hash value of any one web page address already grabbed, it is determined that the web page address to be grabbed is the web page address already grabbed by the web crawler and belongs to the repeated web page address. Accordingly, the repeated webpage addresses cannot be added into the webpage address queue to be grabbed, and therefore the same webpage is prevented from being grabbed repeatedly.
Disclosure of Invention
The inventor finds that a hash algorithm compression maps an input with an arbitrary length into an output with a fixed length (the output is a hash value), and a 'hash collision' problem that different network addresses correspond to the same hash value exists, so that a webpage address which is not actually captured may be erroneously determined as a captured network address, that is, the repeated data determination is erroneous. If the web page address to be crawled is directly compared with the web page address which is crawled, although the problem of error judgment of repeated data can be avoided, under the condition of large number of web page addresses, the judgment efficiency is lowered to an unacceptable degree.
In view of this, the present disclosure provides a solution for solving the problem of error in determining duplicate data, and the solution can also consider the determination efficiency, and can be applied to mass data processing.
Some embodiments of the present disclosure provide a method for processing mass data, including:
acquiring a first element to be processed;
calculating a hash value of the first element;
determining a storage block corresponding to the hash value of the first element from the hash table;
searching whether a first element exists in a storage block corresponding to the hash value of the first element;
if the first element is found, outputting a result that the first element is a repeated element.
In some embodiments, the hash value of the first element comprises: the first element performs hash calculation based on a first hash function to obtain a first hash value, and the first element performs hash calculation based on at least one second hash function to obtain at least one second hash value;
in the case where it is determined that the first element is not a repeating element:
if the first element contains an initial value in the values of the bits mapped in the bit array, recording the first element into a first storage block corresponding to a first hash value of a hash table, and changing the values of the bits mapped in the bit array of the first element into a non-initial value;
and if the values of a plurality of bits mapped in the bit array by the first element are all non-initial values, recording the first element into a second storage block corresponding to a second hash value of the hash table.
In some embodiments, a first element to be processed is read from a scratch pad;
if the first storage block corresponding to the first hash value of the hash table does not have an empty storage bit, storing an element of any first storage bit in the first storage block into a temporary storage area, and recording the first element to the first storage bit;
or if the second storage block corresponding to the second hash value of the hash table does not have an empty storage bit, storing an element of any second storage bit in the second storage block into the buffer area, and recording the first element into the second storage bit.
In some embodiments, the altering method comprises:
changing the value of the first hash value or the plurality of bits mapped by the first element in the bit array to a preset value, or,
the first hash value or the first element is incremented by 1 respectively for a plurality of bits mapped in the bit array.
In some embodiments, in a case that the changing method is to increase the value of the plurality of bits mapped in the bit array by 1 for the first element, the method further includes:
acquiring a second element to be deleted, wherein the second element is subjected to hash calculation based on a first hash function to obtain a third hash value, and the second element is respectively subjected to hash calculation based on at least one second hash function to obtain at least one fourth hash value;
judging whether the values of a plurality of bits mapped in the bit array by the second element are all non-initial values;
if the values of a plurality of bits mapped in the bit array by the second element are all non-initial values, searching whether the second element exists in a third storage block corresponding to a third hash value of the hash table, deleting the searched second element from the third storage block, and respectively reducing the values of the plurality of bits mapped in the bit array by 1;
and if the second element contains an initial value in the values of the bits mapped in the bit array or the second element is not searched in the third storage block, searching whether the second element exists in a fourth storage block corresponding to a fourth hash value of the hash table, and deleting the searched second element from the fourth storage block.
In some embodiments, if the second element is not found in the hash table, finding whether the second element exists in the cache region, and deleting the found second element from the cache region;
or if the second element is not found in the hash table or the cache region, searching whether the second element exists in the database, and deleting the found second element from the database.
In some embodiments, searching for the presence of the first element from the memory block corresponding to the hash value of the first element comprises:
judging whether values of a plurality of bits mapped in the bit array of the first element are all non-initial values;
if the values of a plurality of bits mapped in the bit array by the first element are all non-initial values, searching whether the first element exists in a first storage block corresponding to a first hash value of a hash table;
and if the first element contains an initial value in the values of the bits mapped in the bit array or the first element is not searched in the first storage block, searching whether the first element exists in a second storage block corresponding to a second hash value of the hash table.
In some embodiments, further comprising:
before the judging step, searching whether a first element exists in a temporary storage area;
or if the first element is not searched in the hash table, searching whether the first element exists in the cache region;
or if the first element is not found in the hash table or the cache area, searching whether the first element exists in the database.
In some embodiments, when an element of any second storage bit in the second storage block is stored in the cache region, if the cache region is full, the element with the least number of accesses or the element which is not accessed for the longest time in the cache region is stored in the database, and the element of the second storage bit is stored in the position where the element with the least number of accesses or the element which is not accessed for the longest time in the cache region is located.
In some embodiments, elements in the cache that have been accessed more than a preset number of times are deleted from the cache and stored to the scratch pad.
In some embodiments, the mapping of elements to bits in the bit array is based on bloom filter, counting bloom filter, or cuckoo filter implementations,
the element is a web page address or a directory.
In some embodiments, the memory block includes a plurality of memory bits used to record the element.
Some embodiments of the present disclosure provide a mass data processing apparatus, including:
a memory; and
a processor coupled to the memory, the processor configured to execute the mass data processing method of any of the foregoing embodiments based on instructions stored in the memory.
Some embodiments of the present disclosure propose a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the mass data processing method in any of the foregoing embodiments.
Drawings
The drawings that will be used in the description of the embodiments or the related art will be briefly described below. The present disclosure will be more clearly understood from the following detailed description, which proceeds with reference to the accompanying drawings,
it is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without undue inventive faculty.
Fig. 1 is a schematic diagram of a memory structure according to some embodiments of the present disclosure.
Fig. 2 is a schematic diagram of a method for finding elements suitable for mass data processing according to some embodiments of the present disclosure.
Fig. 3 is a schematic diagram of a method for finding elements suitable for mass data processing according to some embodiments of the present disclosure.
Fig. 4 is a schematic diagram of a method of recording elements suitable for mass data processing according to some embodiments of the present disclosure.
Fig. 5 is a schematic diagram of a method for deleting an element suitable for mass data processing according to some embodiments of the present disclosure.
Fig. 6 is a schematic diagram of a mass data processing device according to some embodiments of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure.
In the present disclosure, "first", "second", "third", "fourth", and the like are described only for distinguishing different objects, and are not used to indicate the meanings of size, timing, and the like.
For ease of description, the following nomenclature is used:
the element (set to e) is subjected to hash calculation based on a hash function (set to h) to obtain a hash value (set to h (e)).
The first element (set to e1) is hashed based on a first hash function (set to h1) to obtain a first hash value (set to h1(e 1)).
The first element respectively performs hash calculation based on at least one second hash function (set to h2) to obtain at least one second hash value (set to h2(e 1)).
The second element (set to e2) is hashed based on the first hash function to obtain a third hash value (set to h1(e 2)).
The second elements are respectively subjected to hash calculation based on at least one second hash function to obtain at least one fourth hash value (set as h2(e 2)).
The element may be, for example, a web page address or directory, etc.
Fig. 1 is a schematic diagram of a memory structure according to some embodiments of the present disclosure.
As shown in fig. 1, the memory structure of this embodiment includes: the hash table 11 may optionally further include a filter 12, a scratch pad 13, a buffer 14 or a database 15.
The hash table 11 is used to store existing elements. The hash table 11 includes a large number of data blocks (blocks). The data block may be addressed according to the hash value of the element. Each memory block includes one or more memory bits (slots) for recording elements. Multiple storage bits of the same memory block may store multiple elements with the same hash value. In some embodiments, Hash Table 11 may employ, for example, a Cuckoo Hash Table (Cuckoo Hash Table).
The hash function used by the hash table 11 to address may be one or more. When the hash table 11 uses a plurality of hash functions, the hash table 11 may determine which hash function is used for addressing in combination with the mapping of the bit array of the filter 12. The subsequent searching process and recording process of the combined element are specifically described. Further, the hash table may be in the form of one table or in the form of a plurality of tables. A table may correspond to one or more hash functions.
The filter 12 may map any element e to a number of bits in the bit array through a predetermined algorithm. In some embodiments, a plurality of hash values for an element e are computed, each hash value mapping to one bit in a bit array, thereby enabling mapping of the element e to multiple bits in the bit array. The element e may perform hash calculation based on a plurality of hash functions, respectively, to obtain a plurality of hash values. The hash function used by the filter 12 for mapping and the hash function used by the hash table 11 may be the same or different, and if the hash functions are the same, the calculation amount and the number of memory accesses may be further reduced. For example, the filter 12 and the hash table 11 both use a hash function h1, where the element is url, then h1(url) is calculated in the first step, g1(h1(url)), g2(h1(url)), and g3(h1(url)) are calculated by using functions of several mapping bits set by the filter (these functions may also be hash functions), such as g1, g2, g3, etc., and the respective calculation results are mapped to corresponding bits in the bit array. In some embodiments, the mapping of an element to multiple bits in a bit-array may be implemented based on, for example, a Bloom Filter (Bloom Filter), a Counting Bloom Filter (Counting Bloom Filter), or a Cuckoo Filter (Cuckoo Filter).
The process of recording the element e, which is not a repeated element, to the hash table is referred to as a recording process of the element. In the recording process of the element, the storage position of the element e in the hash table 11 is determined according to the value of the plurality of bits of the element e mapped into the bit array. For example, if the values of a plurality of bits in the bit array to which the element e is mapped contain an initial value, the element e may be recorded in the hash table in the data block corresponding to h1(e), and if the values of a plurality of bits in the bit array to which the element e is mapped are all non-initial values, the element e may be recorded in the hash table in the data block corresponding to h2 (e). In the initial state, all bits of the bit array have the same value and are set to an initial value (e.g., 0). If element e is recorded in the hash table 11 in the data block corresponding to h1(e), the value of the corresponding bit in the bit array to which element e (including element e itself or the hash value h1(e) of element e) is mapped) is changed to a non-initial value. The changing method is as follows: the mapped values of the bits are changed to a preset value (for example, fixed to 1), or the mapped values of the bits are respectively increased by 1 (the bit values have a counting function, and may be equal to 1 or greater than 1 after being changed).
The process of determining whether an element is a repeated element is referred to as a find process of the element. In the lookup process of an element, according to the values of multiple bits in the bit array to which the element e is mapped, the existence state of the data block corresponding to h1(e) in the hash table 11 of the element e can be determined. Based on the hash characteristics and the mapping rules, the presence state includes absence or possible presence. If the value of a plurality of bits in the bit array to which a certain element e is mapped contains an initial value, the element e does not exist in the data block corresponding to h1(e) in the hash table 11, may exist in the data block corresponding to h2(e) in the hash table 11, and may exist in a buffer or a database if the buffer and the database are provided. If the values of the bits in the bit array to which a certain element e is mapped are all non-initial values, the element e may exist in the data block corresponding to h1(e) in the hash table 11, because when the element e is recorded in the data block corresponding to h1(e) in the hash table 11, and when a plurality of other elements e 'including the bit to which the element e is mapped are recorded in the data block corresponding to h1 (e') in the hash table 11, the values of the bits to which the element e is mapped are all non-initial values.
The scratch area 13 is used to scratch non-duplicate elements that need to be recorded into the hash table 11. The buffer 13 may be implemented based on a First-in-First-out (FIFO) queue.
The buffer 14 is used for buffering elements to be inserted into the database and elements to be searched out from the database. The cache region 14 may be in the form of a cache table. The cache 14 may be updated with a least recently used algorithm or a least recently used algorithm. A buffer area 14 is arranged between the hash table 11 and the database 15, and elements with frequent access are buffered, so that the access times of the database can be reduced.
The database 15 is used to store elements that are evicted from the cache 14. The database 15 is, for example, a Key-Value database.
Fig. 2 is a schematic diagram of a method for finding elements suitable for mass data processing according to some embodiments of the present disclosure.
As shown in fig. 2, the method of this embodiment includes:
in step 21, a first element to be processed is obtained to determine whether the first element is a repeated element.
If a scratch pad is set, step 22, it is looked up from the scratch pad if the first element is present. If the first element is found from the scratch, the result is output that the first element is a duplicate element (step 29), if the first element is not found from the scratch or if no scratch is set, step 23 is performed.
At step 23, a hash value of the first element is calculated.
For example, based on a preset hash function, a hash operation is performed on the first element to obtain a hash value of the first element.
In step 24, a storage block corresponding to the hash value of the first element is determined from the hash table.
Different hash values correspond to different storage blocks, so that the corresponding storage blocks can be determined according to the hash values.
In step 25, whether the first element exists is searched from the storage block corresponding to the hash value of the first element.
If the first element is found from the memory block, the result is output that the first element is a repeated element (step 29). If the first element is not found from the memory block and the buffer or database is not set, the result is output that the first element is not a duplicate element (step 28). If a buffer or database is set, step 26 or step 27 may also be continued in case the first element is not found from the memory block. If both a buffer and a database are provided, step 26 may be performed first, and step 27 may be performed if the first element is not found from the buffer.
In step 26, it is looked up from the buffer whether the first element is present.
If the first element is found from the buffer, the result is output that the first element is a repeated element (step 29). In the case where the database is not set, if the first element is not found from the buffer, a result that the first element is not a duplicate element is output (step 28). In the case of the setup database, if the first element is not found from the cache, execution continues with step 27.
Furthermore, if the first element is found from the cache, the number of accesses to the first element is increased by 1. In some embodiments, elements in the cache region whose number of accesses exceeds a preset number may be deleted from the cache region and stored in the scratch region, so as to further improve element search efficiency.
In step 27 it is looked up from the database if the first element is present.
If the first element is found from the database, the result is output that the first element is a repeated element (step 29). If the first element is not found from the database, the result is output that the first element is not a duplicate element (step 28).
At step 28, the result is output that the first element is not a repeat element.
In step 29, the result is output that the first element is a repeated element.
According to the embodiment, the storage position of the first element in the hash table is located through the hash value of the first element, then the first element is searched in the elements stored in the storage block at the storage position, and because the hash value of the element is searched instead of the hash value of the element, whether the first element is a repeated element can be accurately judged, the problem of judgment error of the repeated element caused by hash collision is solved, and the positioning enables the searching to be carried out only in a small number of elements, so that the judgment efficiency of the repeated element can be considered at the same time, and the method and the device can be applied to a repeated data judgment scene of mass data processing.
In addition, the hierarchical storage structure can not only better utilize the locality of data access and ensure the element searching efficiency, but also improve the problem of memory overflow by combining the internal memory storage and the external memory storage.
Fig. 3 is a schematic diagram of a method for finding elements suitable for mass data processing according to some embodiments of the present disclosure.
As shown in fig. 3, the method of this embodiment includes:
in step 31, a first element to be processed is obtained to determine whether the first element is a repeated element.
If a scratch pad is set, step 32, it is looked up from the scratch pad if the first element is present. If the first element is found from the scratch, the result is output that the first element is a duplicate element (step 39), and if the first element is not found from the scratch or if no scratch is set, step 33 is performed.
In step 33, it is determined whether the values of the bits mapped in the bit array by the first element are all non-initial values.
If the values of the bits mapped by the first element in the bit array are all non-initial values (according to the foregoing analysis, it is indicated that the first element e1 may exist in the data block corresponding to the hash table h1(e1)), the first storage block corresponding to the first hash value h1(e1) of the hash table is searched for whether the first element exists (step 34).
If the first element contains an initial value in the values of the bits mapped in the bit array (according to the foregoing analysis, it is stated that the first element e1 does not exist in the data block corresponding to the hash table h1(e1)) or if the first element is not found in the first storage block, it is found whether the first element exists in the second storage block corresponding to the second hash value h2(e1) of the hash table (step 35). If there are multiple second hash functions h2, step 35 needs to look up whether there is the first element from all the second storage blocks corresponding to h2(e 1).
Accordingly, if the first element is found from the first storage block or the second storage block, a result is output that the first element is a repeated element (step 39). If the first element is not found from the memory block and the buffer or database is not set, the result is output that the first element is not a duplicate element (step 38). If a buffer or database is set, step 36 or step 37 may also be continued in case the first element is not found from the memory block. If both the buffer and the database are set, step 36 may be performed first, and step 37 may be performed if the first element is not found from the buffer.
At step 36, a lookup is made from the buffer for the presence of the first element.
If the first element is found from the buffer, the result is output that the first element is a repeated element (step 39). In the case where the database is not set, if the first element is not found from the buffer, a result that the first element is not a duplicate element is output (step 38). In the case of a setup database, if the first element is not found from the cache, execution continues with step 37.
Furthermore, if the first element is found from the cache, the number of accesses to the first element is increased by 1. In some embodiments, elements in the cache region whose number of accesses exceeds a preset number may be deleted from the cache region and stored in the scratch region, so as to further improve element search efficiency.
In step 37, the database is looked up for the presence of the first element.
If the first element is found from the database, the result is output that the first element is a repeated element (step 39). If the first element is not found from the database, the result is output that the first element is not a duplicate element (step 38).
At step 38, the result is output that the first element is not a repeat element.
In step 39, the result is output that the first element is a repeated element.
In the above embodiment, according to the values of the bits mapped in the bit array by the element, which hash function indicated storage location of the hash table is used for searching the element is determined, so that the bit array control-based element searching scheme of the multi-hash function is implemented.
Fig. 4 is a schematic diagram of a method of recording elements suitable for mass data processing according to some embodiments of the present disclosure.
As shown in fig. 4, the method of this embodiment includes:
at step 41, an element that is not a duplicate element is acquired (set to the first element e1) to record the first element to the hash table.
If a scratch pad is set, the first element determined by the previous embodiment to be not a repeat element may be stored in the scratch pad and read from the scratch pad.
In step 42, it is determined whether the values of the bits mapped in the bit array by the first element are all non-initial values.
If the first element contains an initial value in the values of the bits mapped in the bit array, the first element is recorded in a first storage block corresponding to the first hash value of the hash table, and the values of the bits mapped in the bit array by the first element are changed into non-initial values (see steps 43-46).
And if the values of a plurality of bits mapped in the bit array by the first element are all non-initial values, recording the first element into a second storage block corresponding to a second hash value of the hash table (see the step for details).
At step 43, the first storage block corresponding to the first hash value h1(e1) is looked up from the hash table.
If the first memory block has an empty memory location, an empty memory location is randomly selected and the first element is recorded to the memory location (step 44). Then, the values of the bits mapped in the bit array of the first element are changed to non-initial values, and the changing method is referred to above (step 45).
If the first memory block does not have an empty memory location, the element of any first memory location in the first memory block is stored to the scratch pad and the first element is recorded to the first memory location (step 46). Therefore, the elements with the hash collision can obtain the opportunity of recording the elements into the hash table again, and the elimination of the elements caused by multiple hash collisions is avoided.
In step 47, the second storage block corresponding to the second hash value h2(e1) is looked up from the hash table.
If the second memory block has an empty memory bit, an empty memory bit is randomly selected and the first element is recorded to the memory bit (step 48).
If the second memory block does not have an empty memory location, an element of any second memory location in the second memory block is stored in the buffer, and the first element is recorded in the second memory location (step 49).
In addition, when any element of the second storage bit in the second storage block is stored in the cache region, if the cache region is not full, the element of the second storage bit is stored in the cache region, and the initial access frequency of the element is set to be 0; and if the cache region is full, storing the element with the least access times or the element which is not accessed for the longest time in the cache region into the database, and storing the element with the second storage position into the position of the element with the least access times or the position of the element which is not accessed for the longest time in the cache region.
After steps 45, 46, 48 or 49, step 410 is performed, i.e. the first element is removed from the scratch pad.
In step 411, optionally, it is determined whether there are more elements in the scratch pad.
If the number of iterations is less than or equal to the preset number T, the process jumps to step 41 to continue execution, and if the number of iterations is greater than the preset number T, all elements in the scratch pad may be added to the cache (step 412). And under the condition that the temporary storage area has no elements, ending the process.
In the above embodiment, according to the values of the bits of the element mapped in the bit array, which hash function indicates based on the hash table is determined to record the element to the storage location, so that an element recording scheme of the multi-hash function based on bit array control is implemented, which may improve the space utilization of the hash table.
In the case where the filter employs a count bloom filter, if element e is recorded to the data block corresponding to h1(e) in hash table 11, the value of a plurality of bits in the bit array to which element e is mapped is increased by 1. For such a method of recording an element, the present disclosure also proposes a method of deleting an element, which is described below with reference to fig. 5. In the related art, there is no solution that can implement element deletion.
Fig. 5 is a schematic diagram of a method for deleting an element suitable for mass data processing according to some embodiments of the present disclosure.
As shown in fig. 5, the method of this embodiment includes:
in step 51, if the temporary storage area is set, the element to be deleted is searched in the temporary storage area and set as the second element e2, and if the second element e2 exists, the second element e2 is deleted.
In step 52, it is determined whether the values of the bits mapped in the bit array by the second element are all non-initial values.
If the values of the bits mapped by the second element in the bit array are all non-initial values (according to the foregoing analysis, it is indicated that the second element e2 may exist in the data block corresponding to the third hash value h1(e2) of the hash table), the third storage block corresponding to the third hash value h1(e2) of the hash table is searched for whether the second element exists (step 53). Then, the searched second element is deleted from the third storage block, and the values of the bits mapped in the bit array by the second element are respectively reduced by 1 (step 54), and then the process ends.
If the second element has an initial value in the values of the bits mapped in the bit array (according to the foregoing analysis, it is indicated that the second element e2 does not exist in the data block corresponding to the fourth hash value h2(e2) of the hash table) or if the second element is not found in the third storage block, whether the second element exists in the fourth storage block corresponding to the fourth hash value h2(e2) of the hash table is found (step 55), and the found second element is deleted from the fourth storage block (step 56).
If a buffer or database is set, if the second element is not found in the fourth storage block of the hash table, the process may further continue to step 57 or step 58. If both a buffer and a database are set, step 57 may be performed before step 58.
In step 57, whether the second element exists is searched from the buffer, and if the second element exists, the searched second element is deleted from the buffer.
At step 58, the database is searched for the presence of the second element, and if so, the searched second element is deleted from the database.
The above-described embodiment, the count-bloom-based filter realizes the function of deleting an element from the storage structure of record elements. Taking a business scene of capturing a webpage as an example, after a certain webpage address 1 is deleted from a storage structure recording the already captured webpage addresses, the webpage address 1 can be captured again, so as to meet the business requirement of capturing a certain webpage again.
Fig. 6 is a schematic diagram of a mass data processing device according to some embodiments of the present disclosure.
As shown in fig. 6, the apparatus of this embodiment includes:
a memory 61; and a processor 62 coupled to the memory, the processor 62 configured to perform a mass data processing related method in any of the foregoing embodiments based on instructions stored in the memory.
The memory 61 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (14)

1. A mass data processing method is characterized by comprising the following steps:
acquiring a first element to be processed;
calculating a hash value of the first element;
determining a storage block corresponding to the hash value of the first element from the hash table;
searching whether a first element exists in a storage block corresponding to the hash value of the first element;
if the first element is found, outputting a result that the first element is a repeated element.
2. The method of claim 1,
the hash value of the first element includes: the first element performs hash calculation based on a first hash function to obtain a first hash value, and the first element performs hash calculation based on at least one second hash function to obtain at least one second hash value;
in the case where it is determined that the first element is not a repeating element:
if the first element contains an initial value in the values of the bits mapped in the bit array, recording the first element into a first storage block corresponding to a first hash value of a hash table, and changing the values of the bits mapped in the bit array of the first element into a non-initial value;
and if the values of a plurality of bits mapped in the bit array by the first element are all non-initial values, recording the first element into a second storage block corresponding to a second hash value of the hash table.
3. The method of claim 2, wherein the first element to be processed is read from the scratch pad;
if the first storage block corresponding to the first hash value of the hash table does not have an empty storage bit, storing an element of any first storage bit in the first storage block into a temporary storage area, and recording the first element to the first storage bit;
or if the second storage block corresponding to the second hash value of the hash table does not have an empty storage bit, storing an element of any second storage bit in the second storage block into the buffer area, and recording the first element into the second storage bit.
4. The method of claim 2, wherein the altering method comprises:
changing the value of the first hash value or the plurality of bits mapped by the first element in the bit array to a preset value, or,
the first hash value or the first element is incremented by 1 respectively for a plurality of bits mapped in the bit array.
5. The method of claim 2, wherein in a case that the changing method is to increase a value of a plurality of bits mapped in the bit array by 1 for the first element, further comprising:
acquiring a second element to be deleted, wherein the second element is subjected to hash calculation based on a first hash function to obtain a third hash value, and the second element is respectively subjected to hash calculation based on at least one second hash function to obtain at least one fourth hash value;
judging whether the values of a plurality of bits mapped in the bit array by the second element are all non-initial values;
if the values of a plurality of bits mapped in the bit array by the second element are all non-initial values, searching whether the second element exists in a third storage block corresponding to a third hash value of the hash table, deleting the searched second element from the third storage block, and respectively reducing the values of the plurality of bits mapped in the bit array by 1;
and if the second element contains an initial value in the values of the bits mapped in the bit array or the second element is not searched in the third storage block, searching whether the second element exists in a fourth storage block corresponding to a fourth hash value of the hash table, and deleting the searched second element from the fourth storage block.
6. The method of claim 5,
if the second element is not found in the hash table, whether the second element exists is found in the cache region, and the found second element is deleted from the cache region;
or if the second element is not found in the hash table or the cache region, searching whether the second element exists in the database, and deleting the found second element from the database.
7. The method of claim 2, wherein searching for the presence of the first element from the memory block corresponding to the hash value of the first element comprises:
judging whether values of a plurality of bits mapped in the bit array of the first element are all non-initial values;
if the values of a plurality of bits mapped in the bit array by the first element are all non-initial values, searching whether the first element exists in a first storage block corresponding to a first hash value of a hash table;
and if the first element contains an initial value in the values of the bits mapped in the bit array or the first element is not searched in the first storage block, searching whether the first element exists in a second storage block corresponding to a second hash value of the hash table.
8. The method of claim 7, further comprising:
before the judging step, searching whether a first element exists in a temporary storage area;
or if the first element is not searched in the hash table, searching whether the first element exists in the cache region;
or if the first element is not found in the hash table or the cache area, searching whether the first element exists in the database.
9. The method of claim 3,
when any element of the second storage bit in the second storage block is stored in the cache region, if the cache region is full, the element with the least number of accesses in the cache region or the element which is not accessed for the longest time is stored in the database, and the element of the second storage bit is stored in the position where the element with the least number of accesses in the cache region is located or the position where the element which is not accessed for the longest time is located.
10. The method of claim 3,
and deleting the elements with the access times exceeding the preset times in the cache region from the cache region and storing the elements in the temporary storage region.
11. The method of claim 2,
the mapping of elements to bits in the bit array is based on bloom filter, counting bloom filter or cuckoo filter implementations,
the element is a web page address or a directory.
12. The method of any of claims 1-11, wherein the memory block comprises a plurality of memory bits for recording the element.
13. A mass data processing apparatus, comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the mass data processing method of any one of claims 1-12 based on instructions stored in the memory.
14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the mass data processing method of any one of claims 1 to 12.
CN201910207946.8A 2019-03-19 2019-03-19 Mass data processing method and device Pending CN111723266A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910207946.8A CN111723266A (en) 2019-03-19 2019-03-19 Mass data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910207946.8A CN111723266A (en) 2019-03-19 2019-03-19 Mass data processing method and device

Publications (1)

Publication Number Publication Date
CN111723266A true CN111723266A (en) 2020-09-29

Family

ID=72563202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910207946.8A Pending CN111723266A (en) 2019-03-19 2019-03-19 Mass data processing method and device

Country Status (1)

Country Link
CN (1) CN111723266A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114786141A (en) * 2022-04-29 2022-07-22 恒玄科技(上海)股份有限公司 Message filtering method and device in Bluetooth wireless mesh network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0916474A (en) * 1995-06-30 1997-01-17 Fujitsu Ltd Device and method for controlling input/output
JPH11288387A (en) * 1998-12-11 1999-10-19 Fujitsu Ltd Disk cache device
CN101604337A (en) * 2009-07-13 2009-12-16 中兴通讯股份有限公司 Device and method is stored, searched to a kind of hash table
CN101826107A (en) * 2010-04-02 2010-09-08 华为技术有限公司 Hash data processing method and device
CN102467458A (en) * 2010-11-05 2012-05-23 英业达股份有限公司 Method for establishing index of data block
CN104794162A (en) * 2015-03-25 2015-07-22 中国人民大学 Real-time data storage and query method
CN105208075A (en) * 2015-08-12 2015-12-30 新华通讯社 Data collection method and device based on high-dispersion Hash algorithm
CN107766469A (en) * 2017-09-29 2018-03-06 北京金山安全管理系统技术有限公司 A kind of method for caching and processing and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0916474A (en) * 1995-06-30 1997-01-17 Fujitsu Ltd Device and method for controlling input/output
JPH11288387A (en) * 1998-12-11 1999-10-19 Fujitsu Ltd Disk cache device
CN101604337A (en) * 2009-07-13 2009-12-16 中兴通讯股份有限公司 Device and method is stored, searched to a kind of hash table
CN101826107A (en) * 2010-04-02 2010-09-08 华为技术有限公司 Hash data processing method and device
CN102467458A (en) * 2010-11-05 2012-05-23 英业达股份有限公司 Method for establishing index of data block
CN104794162A (en) * 2015-03-25 2015-07-22 中国人民大学 Real-time data storage and query method
CN105208075A (en) * 2015-08-12 2015-12-30 新华通讯社 Data collection method and device based on high-dispersion Hash algorithm
CN107766469A (en) * 2017-09-29 2018-03-06 北京金山安全管理系统技术有限公司 A kind of method for caching and processing and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张伟;孙涛;刘振斌;: "基于Hash存储的高效DNS缓存系统", 计算机工程与设计, no. 08 *
李天亮;石磊;: "一种有效的混合式P2P Web缓存系统HCache", 计算机应用, no. 06 *
许亚平;李卓;刘开华;马东来;杨奕康;: "基于改进型MBF的命名数据网PIT存储结构研究", 重庆邮电大学学报(自然科学版), no. 01 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114786141A (en) * 2022-04-29 2022-07-22 恒玄科技(上海)股份有限公司 Message filtering method and device in Bluetooth wireless mesh network
CN114786141B (en) * 2022-04-29 2023-11-21 恒玄科技(上海)股份有限公司 Message filtering method and device in Bluetooth wireless mesh network

Similar Documents

Publication Publication Date Title
US11036799B2 (en) Low RAM space, high-throughput persistent key value store using secondary memory
CN107491523B (en) Method and device for storing data object
KR102289332B1 (en) Merge Tree Garbage Metrics
KR102290835B1 (en) Merge tree modifications for maintenance operations
US11461027B2 (en) Deduplication-aware load balancing in distributed storage systems
JP5996088B2 (en) Cryptographic hash database
US10114908B2 (en) Hybrid table implementation by using buffer pool as permanent in-memory storage for memory-resident data
KR20190111124A (en) KVS Tree
KR20190119080A (en) Stream Selection for Multi-Stream Storage
US20100228914A1 (en) Data caching system and method for implementing large capacity cache
CN109144413A (en) A kind of metadata management method and device
CN112579595A (en) Data processing method and device, electronic equipment and readable storage medium
CN111831691B (en) Data reading and writing method and device, electronic equipment and storage medium
CN112148736A (en) Method, device and storage medium for caching data
KR20230026946A (en) Key value storage device with hashing
Tulkinbekov et al. CaseDB: Lightweight key-value store for edge computing environment
CN111625531B (en) Merging device based on programmable device, data merging method and database system
US20220342888A1 (en) Object tagging
CN111723266A (en) Mass data processing method and device
CN116909939A (en) LSM tree-based key value separation storage engine garbage recycling method, system and equipment
CN113641681B (en) Space self-adaptive mass data query method
CN114416741A (en) KV data writing and reading method and device based on multi-level index and storage medium
CN110825652B (en) Method, device and equipment for eliminating cache data on disk block
CN113419792A (en) Event processing method and device, terminal equipment and storage medium
CN107506156B (en) Io optimization method of block device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination