CN111723266B - Mass data processing method and device - Google Patents

Mass data processing method and device Download PDF

Info

Publication number
CN111723266B
CN111723266B CN201910207946.8A CN201910207946A CN111723266B CN 111723266 B CN111723266 B CN 111723266B CN 201910207946 A CN201910207946 A CN 201910207946A CN 111723266 B CN111723266 B CN 111723266B
Authority
CN
China
Prior art keywords
hash
value
storage
found
bit array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910207946.8A
Other languages
Chinese (zh)
Other versions
CN111723266A (en
Inventor
余伟伟
闫创
任莉强
邢淇翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Wodong Tianjun Information Technology Co Ltd
Priority to CN201910207946.8A priority Critical patent/CN111723266B/en
Publication of CN111723266A publication Critical patent/CN111723266A/en
Application granted granted Critical
Publication of CN111723266B publication Critical patent/CN111723266B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The disclosure provides a method and a device for processing mass data, and relates to the field of computers. The hash value of the first element is used for positioning the first element in the storage position of the hash table, then the first element is searched in the elements stored in the storage block at the storage position, and as the searched hash value is the element instead of the element, whether the first element is a repeated element can be accurately judged, the problem of error judgment of the repeated element caused by 'hash conflict' is solved, and the elements are searched in a small number of elements by positioning, so that the judgment efficiency of the repeated element can be simultaneously considered. The method and the device are suitable for repeated data judgment scenes of mass data processing.

Description

Mass data processing method and device
Technical Field
The present disclosure relates to the field of computers, and in particular, to a method and apparatus for processing mass data.
Background
The web crawler crawls corresponding web page content based on the web page address. The web page address is also referred to as a uniform resource locator (Uniform Resource Locator, URL). Before crawling, it is necessary to determine whether the web page to be crawled has been crawled, so as to avoid repeatedly crawling the same web page.
Because the number of web pages is huge, in order to improve the judging efficiency, in some related technologies, the hash value of the web page address that the web crawler has already grabbed and the hash value of the web page address to be grabbed are calculated, and if the hash value of the web page address to be grabbed is the same as the hash value of any one of the web page addresses that the web crawler has already grabbed, it is judged that the web page address to be grabbed is the web page address that the web crawler has grabbed, and the web page address belongs to the repeated web page addresses. Accordingly, the repeated webpage address is not added to the webpage address queue to be grabbed, so that repeated grabbing of the same webpage is avoided.
Disclosure of Invention
The inventors have found that hash(s) algorithms compressively map an arbitrary length of input to a fixed length of output (which is a hash value), there is a problem of "hash collision" in which different network addresses correspond to the same hash value, which may cause a web page address that has not actually been grabbed to be erroneously determined as an already grabbed network address, i.e., repeated data determination errors. Assuming that the web page address to be grabbed is directly compared with the web page address already grabbed, although the problem of repeated data judgment errors can be avoided, under the condition that the number of the web page addresses is large, the judgment efficiency is low to an unacceptable degree.
In view of this, the present disclosure proposes a repeated data solution that can solve the problem of repeated data determination errors, and can give consideration to determination efficiency, and can be applied to mass data processing.
Some embodiments of the present disclosure provide a method for processing mass data, including:
Acquiring a first element to be processed;
calculating a hash value of the first element;
Determining a storage block corresponding to the hash value of the first element from the hash table;
Searching whether the first element exists in a storage block corresponding to the hash value of the first element;
If the first element is found, outputting the result that the first element is a repeating element.
In some embodiments, the hash value of the first element comprises: the first element performs hash calculation based on the first hash function to obtain a first hash value, and performs hash calculation based on at least one second hash function to obtain at least one second hash value;
in the event that it is determined that the first element is not a repeating element:
If the value of the plurality of bits mapped in the bit array of the first element contains an initial value, recording the first element into a first storage block corresponding to a first hash value of the hash table, and changing the value of the plurality of bits mapped in the bit array of the first element into a non-initial value;
And if the values of the plurality of bits mapped in the bit array of the first element are all non-initial values, recording the first element into a second storage block corresponding to a second hash value of the hash table.
In some embodiments, a first element to be processed is read from a scratch pad;
if the first storage block corresponding to the first hash value of the hash table has no empty storage bit, storing the element of any one first storage bit in the first storage block into a temporary storage area, and recording the first element into the first storage bit;
Or if the second storage block corresponding to the second hash value of the hash table has no empty storage bit, storing the element of any second storage bit in the second storage block into the cache area, and recording the first element into the second storage bit.
In some embodiments, the altering method comprises:
Changing the first hash value or the value of the plurality of bits mapped in the bit array of the first element to a preset value, or
The first hash value or the value of the plurality of bits mapped in the bit array of the first element is increased by 1, respectively.
In some embodiments, in a case that the changing method is to increase the value of the plurality of bits mapped in the bit array by 1, the method further includes:
Obtaining a second element to be deleted, wherein the second element performs hash calculation based on the first hash function to obtain a third hash value, and performs hash calculation based on at least one second hash function to obtain at least one fourth hash value;
judging whether the values of a plurality of bits mapped in the bit array of the second element are all non-initial values;
if the values of the plurality of bits mapped in the bit array of the second element are all non-initial values, searching whether the second element exists in a third storage block corresponding to a third hash value of the hash table, deleting the searched second element from the third storage block, and respectively reducing the values of the plurality of bits mapped in the bit array of the second element by 1;
if the second element contains an initial value in the values of the plurality of bits mapped in the bit array or the second element is not found in the third storage block, searching whether the second element exists in the fourth storage block corresponding to the fourth hash value of the hash table, and deleting the found second element from the fourth storage block.
In some embodiments, if the second element is not found from the hash table, searching for the presence or absence of the second element from the cache, and deleting the found second element from the cache;
Or if the second element is not found from the hash table or the cache region, searching whether the second element exists in the database, and deleting the found second element from the database.
In some embodiments, searching for the presence of the first element from the storage block corresponding to the hash value of the first element includes:
Judging whether the values of a plurality of bits mapped in the bit array of the first element are all non-initial values or not;
If the values of the plurality of bits mapped in the bit array of the first element are all non-initial values, searching whether the first element exists in a first storage block corresponding to a first hash value of the hash table;
If the first element has an initial value in the values of the plurality of bits mapped in the bit array or the first element is not found in the first storage block, searching whether the first element exists in the second storage block corresponding to the second hash value of the hash table.
In some embodiments, further comprising:
Before the judging step, searching whether the first element exists in the temporary storage area;
or if the first element is not found in the hash table, searching whether the first element exists in the cache area;
or if the first element is not found from the hash table or cache, it is found from the database if the first element is present.
In some embodiments, when an element of any one second storage bit in the second storage block is stored in the cache area, if the cache area is full, an element with the least number of accesses or an element with the least number of accesses in the cache area is stored in the database, and an element of the second storage bit is stored in a position where the element with the least number of accesses in the cache area or the element with the least number of accesses is located in the position where the element with the least number of accesses is located in the database.
In some embodiments, elements in the cache that are accessed more than a predetermined number of times are deleted from the cache and stored in the scratch pad.
In some embodiments, the mapping of elements to bits of the bit array is implemented based on a bloom filter, a count bloom filter, or a cuckoo filter,
Wherein the element is a web page address or a directory.
In some embodiments, the memory block includes a plurality of memory bits used to record the elements.
Some embodiments of the present disclosure propose a mass data processing device, including:
A memory; and
A processor coupled to the memory, the processor configured to perform the mass data processing method of any of the foregoing embodiments based on instructions stored in the memory.
Some embodiments of the present disclosure propose a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the mass data processing method of any of the previous embodiments.
Drawings
The drawings that are required for use in the description of the embodiments or the related art will be briefly described below. The present disclosure will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings,
It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without inventive faculty.
Fig. 1 is a schematic diagram of a memory structure according to some embodiments of the present disclosure.
Fig. 2 is a schematic diagram of a method of finding elements suitable for mass data processing according to some embodiments of the present disclosure.
Fig. 3 is a schematic diagram of a method of finding elements suitable for mass data processing according to some embodiments of the present disclosure.
Fig. 4 is a schematic diagram of a method of recording elements suitable for mass data processing according to some embodiments of the present disclosure.
Fig. 5 is a schematic diagram of a method of deleting elements suitable for mass data processing according to some embodiments of the present disclosure.
Fig. 6 is a schematic diagram of a mass data processing device according to some embodiments of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure.
In the present disclosure, the descriptions of "first", "second", "third", "fourth", and the like are used only to distinguish different objects, and are not used to indicate meanings of size or timing, etc.
For ease of description, the following nomenclature is made:
the element (set as e) performs hash calculation based on the hash function (set as h) to obtain a hash value (set as h (e)).
The first element (set as e 1) performs hash calculation based on the first hash function (set as h 1) to obtain a first hash value (set as h1 (e 1)).
The first element performs hash computation based on at least one second hash function (set as h 2) to obtain at least one second hash value (set as h2 (e 1)).
The second element (set to e 2) performs hash calculation based on the first hash function to obtain a third hash value (set to h1 (e 2)).
The second elements are hashed based on at least one second hash function, respectively, to obtain at least one fourth hash value (set as h2 (e 2)).
The element may be, for example, a web page address or directory, etc.
Fig. 1 is a schematic diagram of a memory structure according to some embodiments of the present disclosure.
As shown in fig. 1, the storage structure of this embodiment includes: hash table 11, optionally, may also include a filter 12, a scratch pad 13, a cache 14, or a database 15.
The hash table 11 is used to store existing elements. The hash table 11 includes a large number of data blocks (blocks). The data block may be addressed according to the hash value of the element. Each memory block includes one or more memory bits (slots) for recording elements. Multiple storage bits of the same storage block may store multiple elements with the same hash value. In some embodiments, hash table 11 may employ, for example, a cuckoo hash table (Cuckoo Hash Table).
The hash function used by hash table 11 for addressing may be one or more. When there are a plurality of hash functions used in hash table 11, hash table 11 may determine which hash function to address in conjunction with the mapping of the number of bits of filter 12. The following description is specific to the search process and the recording process of the element. Furthermore, the hash table may be in the form of one table or in the form of a plurality of tables. A table may correspond to one or more hash functions.
The filter 12 may map an arbitrary element e to a plurality of bits in the bit array by a preset algorithm. In some embodiments, a plurality of hash values of element e are calculated, each hash value mapped to one bit in the bit array, thereby enabling mapping of the element e to a plurality of bits in the bit array. The element e may be, for example, a plurality of hash values obtained by performing hash computation based on a plurality of hash functions. The hash function used by the mapping of the filter 12 may be the same as or different from the hash function used by the hash table 11, and if the hash function is the same, the calculation amount and the memory access times may be further reduced. For example, the filter 12 and the hash table 11 each use a hash function h1, with url as an element, and in a first step, h1 (url) is calculated, and the respective calculation results are mapped to corresponding bits in the bit array by functions of several mapping bits set by the filter (these functions may also be hash functions), such as g1, g2, g3, and so on, and g1 (h 1 (url)), g2 (h 1 (url)), and g3 (h 1 (url)). In some embodiments, the mapping of elements to the plurality of bits in the bit array may be implemented based on, for example, a Bloom Filter (Bloom Filter), a count Bloom Filter (Counting Bloom Filter), or a Cuckoo Filter (Cuckoo Filter).
The process of recording the element e, which is not a repeated element, to the hash table is referred to as an element recording process. In the recording process of the element, the storage position of the element e in the hash table 11 is determined according to the values of the plurality of bits mapped to the bit array by the element e. For example, if the values of the bits in the bit array to which the element e is mapped contain initial values, the element e may be recorded in the data block corresponding to h1 (e) in the hash table, and if the values of the bits in the bit array to which the element e is mapped are all non-initial values, the element e may be recorded in the data block corresponding to h2 (e) in the hash table. In the initial state, all bits of the bit array have the same value and are set to an initial value (e.g., 0). If element e is recorded in the data block corresponding to h1 (e) in hash table 11, the value of the corresponding bit in the bit array to which element e (including element e itself or hash value h1 (e) of element e) is mapped is changed to a non-initial value. The modification method is, for example: the value of the plurality of bits mapped is changed to a preset value (for example, is fixedly changed to 1), or the value of the plurality of bits mapped is respectively increased by 1 (the value of the bits has a counting function, and can be equal to 1 or more than 1 after the change).
The process of determining whether an element is a repeating element is referred to as the element's lookup process. In the process of searching the element, the existence state of the element e in the data block corresponding to h1 (e) in the hash table 11 can be determined according to the values of a plurality of bits in the bit array mapped to by the element e. Based on the hash characteristics and the mapping rules, the presence status includes absence or possible presence. If the value of a plurality of bits in the bit array mapped to by an element e contains an initial value, the element e does not exist in the data block corresponding to h1 (e) in the hash table 11, may exist in the data block corresponding to h2 (e) in the hash table 11, and may also exist in a buffer or a database if the buffer and the database are provided. If the values of the bits in the bit array mapped to by an element e are all non-initial values, the element e may exist in the data block corresponding to h1 (e) in the hash table 11, because the element e is recorded in the data block corresponding to h1 (e) in the hash table 11, and the values of the bits mapped to by the element e are all non-initial values when the other elements e 'containing the bits mapped to by the element e are recorded in the data block corresponding to h1 (e') in the hash table 11.
The temporary storage area 13 is used for temporarily storing non-duplicate elements to be recorded in the hash table 11. The register 13 may be implemented based on a first-in-first-out (First Input First Output, FIFO) queue.
The buffer 14 is used for buffering elements to be inserted into the database and elements found from the database. The buffer 14 may be in the form of a buffer table. The cache area 14 may be updated using a least recently used algorithm or a least recently used algorithm. The buffer 14 is set between the hash table 11 and the database 15, and elements with frequent access are buffered, so that the access times of the database can be reduced.
The database 15 is used to store elements that are removed from the buffer 14. The database 15 is, for example, a Key-Value database.
Fig. 2 is a schematic diagram of a method of finding elements suitable for mass data processing according to some embodiments of the present disclosure.
As shown in fig. 2, the method of this embodiment includes:
In step 21, a first element to be processed is acquired to determine whether the first element is a repeating element.
In step 22, if a scratch pad is set, it is looked up from the scratch pad if the first element is present. If the first element is found from the scratch pad, the result is output that the first element is a repeated element (step 29), if the first element is not found from the scratch pad or if the scratch pad is not set, step 23 is performed.
In step 23, a hash value of the first element is calculated.
For example, based on a preset hash function, a hash operation is performed on the first element, so as to obtain a hash value of the first element.
In step 24, a memory block corresponding to the hash value of the first element is determined from the hash table.
Wherein different hash values correspond to different memory blocks, so that the corresponding memory blocks can be determined from the hash values.
In step 25, it is found whether the first element exists in the storage block corresponding to the hash value of the first element.
If the first element is found from the memory block, the result is output that the first element is a duplicate element (step 29). If the first element is not found from the memory block and no buffers or databases are provided, the result is output that the first element is not a duplicate element (step 28). If a buffer or database is provided, step 26 or step 27 may also continue if the first element is not found from the memory block. If both the buffer and the database are provided, step 26 may be performed first, and step 27 may be performed again if the first element is not found from the buffer.
In step 26, it is looked up from the cache if the first element is present.
If the first element is found from the cache, the result is output that the first element is a duplicate element (step 29). In the case where the database is not set, if the first element is not found from the cache area, a result is output that the first element is not a duplicate element (step 28). In case of setting up the database, if the first element is not found from the cache, the process continues with step 27.
In addition, if the first element is found from the cache, the number of accesses to the first element is increased by 1. In some embodiments, elements with access times exceeding a preset number of times in the buffer may be deleted from the buffer and stored in the temporary storage area, so as to further improve the efficiency of element searching.
In step 27, it is looked up from the database whether the first element is present.
If the first element is found from the database, the output is the result of the first element being a duplicate element (step 29). If the first element is not found from the database, the result is output that the first element is not a duplicate element (step 28).
In step 28, the result is output that the first element is not a repeating element.
In step 29, the result is output that the first element is a repeating element.
In the above embodiment, the first element is located at the storage position of the hash table through the hash value of the first element, and then the first element is searched in the elements stored in the storage block at the storage position, and because the searched hash value is the element instead of the element, whether the first element is a repeated element can be accurately judged, the problem of error judgment of the repeated element caused by 'hash collision' is solved, and the searching is performed in only a few elements through the positioning, so that the judgment efficiency of the repeated element can be simultaneously considered, and the method is applicable to repeated data judgment scenes of mass data processing.
In addition, the hierarchical storage structure can better utilize the locality of data access, ensure the element searching efficiency and improve the problem of memory overflow by combining the memory storage with the memory storage.
Fig. 3 is a schematic diagram of a method of finding elements suitable for mass data processing according to some embodiments of the present disclosure.
As shown in fig. 3, the method of this embodiment includes:
in step 31, a first element to be processed is acquired to determine whether the first element is a repeating element.
In step 32, if a scratch pad is set, it is looked up from the scratch pad if the first element is present. If the first element is found from the scratch pad, the result is output that the first element is a repeated element (step 39), if the first element is not found from the scratch pad or if the scratch pad is not set, step 33 is performed.
In step 33, it is determined whether the values of the plurality of bits mapped in the bit array by the first element are all non-initial values.
If the values of the bits mapped in the bit array of the first element are all non-initial values (according to the above analysis, it is explained that the first element e1 may exist in the data block corresponding to the hash table h1 (e 1)), it is searched for whether the first element exists in the first storage block corresponding to the first hash value h1 (e 1) of the hash table (step 34).
If the first element has an initial value in the values of the plurality of bits mapped in the bit array (according to the above analysis, it is explained that the first element e1 does not exist in the data block corresponding to the hash table h1 (e 1)) or if the first element is not found from the first memory block, it is found whether the first element exists from the second memory block corresponding to the second hash value h2 (e 1) of the hash table (step 35). If there are more second hash functions h2, step 35 needs to find out whether there are first elements in the second memory blocks corresponding to all h2 (e 1).
Accordingly, if the first element is found from the first memory block or the second memory block, the result is output that the first element is a duplicate element (step 39). If the first element is not found from the memory block and no buffers or databases are provided, the result is output that the first element is not a duplicate element (step 38). If a buffer or database is provided, step 36 or step 37 may also continue if the first element is not found from the memory block. If both the buffer and the database are provided, step 36 may be performed first, and step 37 may be performed again if the first element is not found from the buffer.
In step 36, it is looked up from the cache if the first element is present.
If the first element is found from the cache, the result is output that the first element is a duplicate element (step 39). In the case where the database is not set, if the first element is not found from the cache area, the result that the first element is not a duplicate element is output (step 38). In case of setting up the database, if the first element is not found from the cache, the execution continues with step 37.
In addition, if the first element is found from the cache, the number of accesses to the first element is increased by 1. In some embodiments, elements with access times exceeding a preset number of times in the buffer may be deleted from the buffer and stored in the temporary storage area, so as to further improve the efficiency of element searching.
In step 37, it is looked up from the database whether the first element is present.
If the first element is found from the database, the output is the result of the first element being a duplicate element (step 39). If the first element is not found from the database, the result is output that the first element is not a duplicate element (step 38).
In step 38, the result is output that the first element is not a repeating element.
In step 39, the result is output that the first element is a repeating element.
In the above embodiment, according to the values of the plurality of bits mapped by the elements in the bit array, the storage location indicated by which hash function of the hash table is determined to find the elements, thereby realizing the element finding scheme of the multi-hash function based on the bit array control.
Fig. 4 is a schematic diagram of a method of recording elements suitable for mass data processing according to some embodiments of the present disclosure.
As shown in fig. 4, the method of this embodiment includes:
in step 41, an element (set as a first element e 1) that is not a repeating element is acquired so as to record the first element to the hash table.
If a scratch pad is provided, the first element determined in the previous embodiment to be not a duplicate element may be deposited into the scratch pad and read from the scratch pad.
In step 42, it is determined whether the values of the plurality of bits mapped in the bit array by the first element are all non-initial values.
If the value of the plurality of bits mapped in the bit array of the first element contains an initial value, the first element is recorded in a first memory block corresponding to the first hash value of the hash table, and the value of the plurality of bits mapped in the bit array of the first element is changed to a non-initial value (see steps 43-46 for details).
If the values of the bits mapped in the bit array of the first element are all non-initial values, the first element is recorded in a second storage block corresponding to a second hash value of the hash table (see step for details).
In step 43, a first memory block corresponding to the first hash value h1 (e 1) is found from the hash table.
If the first memory block has an empty memory bit, an empty memory bit is randomly selected and the first element is recorded to the memory bit (step 44). Then, the values of the plurality of bits mapped in the bit array by the first element are changed to non-initial values, and the changing method refers to the above (step 45).
If the first memory block does not have an empty memory bit, an element of any of the first memory bits in the first memory block is deposited into the temporary storage area and the first element is recorded into the first memory bit (step 46). Therefore, the elements in the hash collision can acquire the opportunity of recording the elements in the hash table again, and element elimination caused by the hash collision for a plurality of times is avoided.
In step 47, a second memory block corresponding to the second hash value h2 (e 1) is looked up from the hash table.
If the second memory block has an empty memory bit, an empty memory bit is randomly selected and the first element is recorded to the memory bit (step 48).
If the second memory block does not have an empty memory bit, an element of any one of the second memory bits in the second memory block is stored in the buffer and the first element is recorded in the second memory bit (step 49).
In addition, when any element of the second storage bit in the second storage block is stored in the buffer area, if the buffer area is not full, the element of the second storage bit is stored in the buffer area, and the initial access frequency of the element is set to be 0; if the cache area is full, the element with the least access times or the longest non-access element in the cache area is stored in the database, and the element with the second storage bit is stored in the position of the element with the least access times or the position of the longest non-access element in the cache area.
After step 45, 46, 48 or 49, step 410 is performed, i.e. the first element is removed from the scratch pad.
In step 411, it is optionally determined whether the scratch pad also has elements.
If the number of iterations is less than or equal to the preset number T, the process jumps to step 41 to continue, and if the number of iterations is greater than the preset number T, all elements of the scratch pad may be added to the scratch pad (step 412). And when the temporary storage area has no element, ending the flow.
In the above embodiment, according to the values of the plurality of bits mapped by the element in the bit array, the storage location indicated by which hash function based on the hash table is determined to record the element, and the element recording scheme of the multi-hash function based on the bit array control is implemented, which improves the space utilization rate of the hash table.
In the case where the filter employs a counting bloom filter, if the element e is recorded to the data block corresponding to h1 (e) in the hash table 11, the value of a plurality of bits in the bit array to which the element e is mapped increases by 1. For this method of recording elements, the present disclosure also proposes a method of deleting elements, described below in conjunction with fig. 5. In the related art, there is no solution capable of realizing element deletion.
Fig. 5 is a schematic diagram of a method of deleting elements suitable for mass data processing according to some embodiments of the present disclosure.
As shown in fig. 5, the method of this embodiment includes:
In step 51, if a temporary storage area is set, the element to be deleted is searched in the temporary storage area, and the element is set as a second element e2, and if the second element e2 exists, the second element e2 is deleted.
In step 52, it is determined whether the values of the plurality of bits mapped in the bit array of the second element are all non-initial values.
If the values of the bits mapped in the bit array of the second element are all non-initial values (according to the above analysis, it is explained that the second element e2 may exist in the data block corresponding to the third hash value h1 (e 2) of the hash table), it is searched for whether the second element exists in the third memory block corresponding to the third hash value h1 (e 2) of the hash table (step 53). Then, the second element found is deleted from the third memory block, and the values of the plurality of bits mapped in the bit array of the second element are respectively reduced by 1 (step 54), and then the flow ends.
If the second element has an initial value in the values of the plurality of bits mapped in the bit array (according to the foregoing analysis, it is explained that the second element e2 does not exist in the data block corresponding to the fourth hash value h2 (e 2) of the hash table) or if the second element is not found from the third memory block, it is found whether the second element exists from the fourth memory block corresponding to the fourth hash value h2 (e 2) of the hash table (step 55), and the found second element is deleted from the fourth memory block (step 56).
If a buffer or database is provided, if the second element is not found from the fourth memory block of the hash table, the process may continue to step 57 or step 58. If both the cache and the database are provided, step 57 may be performed before step 58.
In step 57, it is found whether the second element exists in the buffer, and if so, the second element found is deleted from the buffer.
In step 58, it is found whether a second element exists from the database, and if so, the second element found is deleted from the database.
The above-described embodiments implement the function of deleting elements from the storage structure of the record element based on a count bloom filter. Taking the service scene of capturing the web page as an example, after deleting a certain web page address 1 from the storage structure of the web page address which is recorded and captured, the web page address 1 can be captured again, so as to meet the service requirement of capturing a certain web page again.
Fig. 6 is a schematic diagram of a mass data processing device according to some embodiments of the present disclosure.
As shown in fig. 6, the apparatus of this embodiment includes:
A memory 61; and a processor 62 coupled to the memory, the processor 62 being configured to perform the mass data processing related method of any of the foregoing embodiments based on instructions stored in the memory.
The memory 61 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), and other programs.
It will be appreciated by those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The foregoing description of the preferred embodiments of the present disclosure is not intended to limit the disclosure, but rather to enable any modification, equivalent replacement, improvement or the like, which fall within the spirit and principles of the present disclosure.

Claims (11)

1. A method for processing mass data, comprising:
Acquiring a first element to be processed;
calculating a hash value of the first element;
Determining a storage block corresponding to a hash value of a first element from a hash table, wherein the hash value of the first element comprises: the first element performs hash calculation based on the first hash function to obtain a first hash value, and performs hash calculation based on at least one second hash function to obtain at least one second hash value; in the event that it is determined that the first element is not a repeating element: if the value of the plurality of bits mapped in the bit array of the first element contains an initial value, recording the first element into a first storage block corresponding to a first hash value of the hash table, and changing the value of the plurality of bits mapped in the bit array of the first element into a non-initial value; if the values of the plurality of bits mapped in the bit array of the first element are all non-initial values, recording the first element into a second storage block corresponding to a second hash value of the hash table;
Searching whether the first element exists in the storage block corresponding to the hash value of the first element comprises the following steps: judging whether the values of a plurality of bits mapped in the bit array of the first element are all non-initial values or not; if the values of the plurality of bits mapped in the bit array of the first element are all non-initial values, searching whether the first element exists in a first storage block corresponding to a first hash value of the hash table; if the value of the plurality of bits mapped in the bit array of the first element contains an initial value or the first element is not found in the first storage block, searching whether the first element exists in the second storage block corresponding to the second hash value of the hash table; before the judging step, searching whether the first element exists in the temporary storage area; or if the first element is not found in the hash table, searching whether the first element exists in the cache area; or if the first element is not found from the hash table or the cache region, searching whether the first element exists in the database;
If the first element is found, outputting the result that the first element is a repeating element.
2. The method of claim 1, wherein the first element to be processed is read from a scratch pad;
if the first storage block corresponding to the first hash value of the hash table has no empty storage bit, storing the element of any one first storage bit in the first storage block into a temporary storage area, and recording the first element into the first storage bit;
Or if the second storage block corresponding to the second hash value of the hash table has no empty storage bit, storing the element of any second storage bit in the second storage block into the cache area, and recording the first element into the second storage bit.
3. The method of claim 1, wherein the altering method comprises:
Changing the first hash value or the value of the plurality of bits mapped in the bit array of the first element to a preset value, or
The first hash value or the value of the plurality of bits mapped in the bit array of the first element is increased by 1, respectively.
4. The method of claim 1, further comprising, in the case where the altering method is to increase the value of the plurality of bits mapped in the bit array by the first element by 1:
Obtaining a second element to be deleted, wherein the second element performs hash calculation based on the first hash function to obtain a third hash value, and performs hash calculation based on at least one second hash function to obtain at least one fourth hash value;
judging whether the values of a plurality of bits mapped in the bit array of the second element are all non-initial values;
if the values of the plurality of bits mapped in the bit array of the second element are all non-initial values, searching whether the second element exists in a third storage block corresponding to a third hash value of the hash table, deleting the searched second element from the third storage block, and respectively reducing the values of the plurality of bits mapped in the bit array of the second element by 1;
if the second element contains an initial value in the values of the plurality of bits mapped in the bit array or the second element is not found in the third storage block, searching whether the second element exists in the fourth storage block corresponding to the fourth hash value of the hash table, and deleting the found second element from the fourth storage block.
5. The method of claim 4, wherein,
If the second element is not found in the hash table, searching whether the second element exists in the cache area, and deleting the found second element from the cache area;
Or if the second element is not found from the hash table or the cache region, searching whether the second element exists in the database, and deleting the found second element from the database.
6. The method of claim 2, wherein,
And when the buffer area is full, storing the element with the least access frequency or the longest non-access element in the buffer area into a database, and storing the element with the second storage bit into the position of the element with the least access frequency or the position of the longest non-access element in the buffer area.
7. The method of claim 2, wherein,
And deleting the elements with the access times exceeding the preset times from the cache area and storing the elements in the temporary storage area.
8. The method of claim 1, wherein,
The mapping of elements to bits of the bit array is based on a bloom filter, a count bloom filter or a cuckoo filter,
Wherein the element is a web page address or a directory.
9. A method as claimed in any one of claims 1 to 8, wherein the memory block comprises a plurality of memory bits for recording elements.
10. A mass data processing apparatus, comprising:
A memory; and
A processor coupled to the memory, the processor configured to perform the mass data processing method of any of claims 1-9 based on instructions stored in the memory.
11. A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the mass data processing method of any of claims 1-10.
CN201910207946.8A 2019-03-19 2019-03-19 Mass data processing method and device Active CN111723266B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910207946.8A CN111723266B (en) 2019-03-19 2019-03-19 Mass data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910207946.8A CN111723266B (en) 2019-03-19 2019-03-19 Mass data processing method and device

Publications (2)

Publication Number Publication Date
CN111723266A CN111723266A (en) 2020-09-29
CN111723266B true CN111723266B (en) 2024-08-16

Family

ID=72563202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910207946.8A Active CN111723266B (en) 2019-03-19 2019-03-19 Mass data processing method and device

Country Status (1)

Country Link
CN (1) CN111723266B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114786141B (en) * 2022-04-29 2023-11-21 恒玄科技(上海)股份有限公司 Message filtering method and device in Bluetooth wireless mesh network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604337A (en) * 2009-07-13 2009-12-16 中兴通讯股份有限公司 Device and method is stored, searched to a kind of hash table
CN101826107A (en) * 2010-04-02 2010-09-08 华为技术有限公司 Hash data processing method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0916474A (en) * 1995-06-30 1997-01-17 Fujitsu Ltd Device and method for controlling input/output
JPH11288387A (en) * 1998-12-11 1999-10-19 Fujitsu Ltd Disk cache device
CN102467458B (en) * 2010-11-05 2014-08-06 英业达股份有限公司 Method for establishing index of data block
CN104794162B (en) * 2015-03-25 2018-02-23 中国人民大学 Real-time data memory and querying method
CN105208075B (en) * 2015-08-12 2018-07-31 新华通讯社 A kind of data collection strategy method and device based on high dispersive hash algorithm
CN107766469A (en) * 2017-09-29 2018-03-06 北京金山安全管理系统技术有限公司 A kind of method for caching and processing and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604337A (en) * 2009-07-13 2009-12-16 中兴通讯股份有限公司 Device and method is stored, searched to a kind of hash table
CN101826107A (en) * 2010-04-02 2010-09-08 华为技术有限公司 Hash data processing method and device

Also Published As

Publication number Publication date
CN111723266A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
US11036799B2 (en) Low RAM space, high-throughput persistent key value store using secondary memory
CN107533551B (en) Big data statistics at data Block level
CN107491523B (en) Method and device for storing data object
US10114908B2 (en) Hybrid table implementation by using buffer pool as permanent in-memory storage for memory-resident data
US10198363B2 (en) Reducing data I/O using in-memory data structures
EP3229142B1 (en) Read cache management method and device based on solid state drive
KR20190111124A (en) KVS Tree
KR20190119080A (en) Stream Selection for Multi-Stream Storage
KR20190117001A (en) Merge Tree Modifications for Maintenance Operations
US20100228914A1 (en) Data caching system and method for implementing large capacity cache
CN112262379B (en) Storing data items and identifying stored data items
WO2017161540A1 (en) Data query method, data object storage method and data system
US11169968B2 (en) Region-integrated data deduplication implementing a multi-lifetime duplicate finder
CN112579595A (en) Data processing method and device, electronic equipment and readable storage medium
CN110858210B (en) Data query method and device
CN112148736B (en) Method, device and storage medium for caching data
CN111831691B (en) Data reading and writing method and device, electronic equipment and storage medium
CN113641681B (en) Space self-adaptive mass data query method
US20220342888A1 (en) Object tagging
CN111723266B (en) Mass data processing method and device
CN116594562A (en) Data processing method and device, equipment and storage medium
CN113419792A (en) Event processing method and device, terminal equipment and storage medium
CN115509437A (en) Storage system, network card, processor, data access method, device and system
US9824105B2 (en) Adaptive probabilistic indexing with skip lists
US11150827B2 (en) Storage system and duplicate data management method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant