CN113377812B

CN113377812B - Order duplicate removal method and device for big data

Info

Publication number: CN113377812B
Application number: CN202110027862.3A
Authority: CN
Inventors: 唐明; 谭吉湘; 杨陆; 王晓宇
Original assignee: Beijing Data Driven Technology Co ltd
Current assignee: Beijing Data Driven Technology Co ltd
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2024-06-18
Anticipated expiration: 2041-01-08
Also published as: CN113377812A

Abstract

The invention provides a big data order deduplication method and device, which are applied to a server and comprise the following steps: receiving order information sent by a client, wherein the order information comprises a key field, a bill file name and order data; judging whether a bloom filter exists in the cache according to the key field; if not, loading a bloom filter; if so, judging whether the bill file name exists in the bloom filter; if not, storing the order data in a database and cache and adding the billing filename to the bloom filter; if the bill file name exists, confirming the bill file name through the cache and the database to obtain a confirmation result; the key fields comprise hardware identification and sales time, memory occupation can be reduced through a bloom filter, and the aim of efficiently removing duplicate data is fulfilled.

Description

Order duplicate removal method and device for big data

Technical Field

The invention relates to the technical field of deduplication, in particular to a method and a device for order deduplication of big data.

Background

In recent years, with the rapid development of the internet and information industry, data generated each year grows exponentially, and due to the complexity of internet service, repeated information submission by users, retries of clients, and upstream service failures, etc., repeated data uploading may be caused.

In order to avoid data disorder caused by repeated uploading, a buffer layer is added, and a unique identification field of data is stored in the buffer layer. Firstly, inquiring the cache layer, and if the unique identification field of the data can be inquired, repeating the data; if the data is not found, the database is queried for the unique identification field of the data to confirm. Under the circumstance of large data volume, the duplication elimination method can cause the unique identification field of the data stored in the cache to occupy a large amount of memory resources, so that the cost is relatively high.

Disclosure of Invention

Therefore, the present invention aims to provide a method and a device for order deduplication of big data, which can reduce memory occupation through a bloom filter and achieve the purpose of efficient deduplication of data.

In a first aspect, an embodiment of the present invention provides a method for de-duplication of orders of big data, applied to a server, where the method includes:

Receiving order information sent by a client, wherein the order information comprises a key field, a bill file name and order data;

Judging whether a bloom filter exists in the cache according to the key field;

If not, loading the bloom filter;

If so, judging whether the bill file name exists in the bloom filter;

if not, storing the order data in a database and the cache, and adding the billing filename to the bloom filter;

If yes, confirming the bill file name through the cache and the database to obtain a confirmation result;

wherein the key fields include hardware identification and sales time.

Further, the step of confirming the bill file name through the cache and the database to obtain a confirmation result includes:

inquiring whether the bill file name exists in the cache;

if yes, discarding the order information;

if not, inquiring whether the bill file name exists in the database;

if yes, discarding the order information;

If not, storing the order data into the database and the cache, adding the bill file name into the bloom filter, and sending response information of successful warehousing to the client.

Further, the loading the bloom filter includes:

Judging whether the persistence information of the bloom filter exists in the database according to the key field;

if so, acquiring the time of updating the bloom filter last in the persistence information;

taking the time of updating the bloom filter last as a starting time;

Searching all increment orders from the starting time to the current time in the database, and adding the bill file names corresponding to all increment orders into the bloom filter;

if not, all orders of the current day corresponding to the hardware identification are searched from the database, and all orders of the current day are added into the bloom filter.

Further, the current time is the corresponding time when the bloom filter is searched from the database.

Further, the method further comprises:

and storing order information of the bloom filter into the database within a preset time interval.

In a second aspect, an embodiment of the present invention provides a big data order deduplication apparatus, applied to a server, where the apparatus includes:

The receiving unit is used for receiving order information sent by the client, wherein the order information comprises a key field, a bill file name and order data;

the first judging unit is used for judging whether a bloom filter exists in the cache according to the key field;

a loading unit for loading the bloom filter in the absence;

A second judging unit, configured to judge whether the bill file name exists in the bloom filter in the presence of the bill file name;

A storage unit for storing the order data in a database and the cache, and adding the bill filename to the bloom filter if not present;

the confirming unit is used for confirming the bill file name through the cache and the database under the condition of existence, so as to obtain a confirming result;

wherein the key fields include hardware identification and sales time.

Further, the confirmation unit is specifically configured to:

inquiring whether the bill file name exists in the cache;

if yes, discarding the order information;

if not, inquiring whether the bill file name exists in the database;

if yes, discarding the order information;

Further, the loading unit is specifically configured to:

taking the time of updating the bloom filter last as a starting time;

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, and a processor, where the memory stores a computer program executable on the processor, and where the processor implements a method as described above when executing the computer program.

In a fourth aspect, embodiments of the present invention provide a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method as described above.

The embodiment of the invention provides a big data order deduplication method and device, which are applied to a server and comprise the following steps: receiving order information sent by a client, wherein the order information comprises a key field, a bill file name and order data; judging whether a bloom filter exists in the cache according to the key field; if not, loading a bloom filter; if so, judging whether the bill file name exists in the bloom filter; if not, storing the order data in a database and cache and adding the billing filename to the bloom filter; if the bill file name exists, confirming the bill file name through the cache and the database to obtain a confirmation result; the key fields comprise hardware identification and sales time, memory occupation can be reduced through a bloom filter, and the aim of efficiently removing duplicate data is fulfilled.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a big data order deduplication method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for loading bloom filters according to an embodiment of the present invention;

Fig. 3 is a schematic diagram of an order deduplication apparatus for big data according to a second embodiment of the present invention.

Icon:

1-a receiving unit; 2-a first judgment unit; a 3-loading unit; 4-a second judgment unit; a 5-memory unit; 6-a confirmation unit.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The application reduces the memory occupation through the bloom filter and achieves the purpose of efficiently removing the duplicate data.

The bloom filter is a bit array with all the positions being 0 and a series of hash functions in an initial state, and can achieve the purposes of occupying a small amount of space and efficiently removing duplicate data. It maps the key field to a certain position in the bit array by means of a hash function, sets the position to 1, and identifies the presence of the key field.

When the hash functions map different key fields to the same position of the bit array, conflicts may be caused, so that a plurality of hash functions are adopted to map the key fields to a plurality of positions of the bit array, when whether the key fields exist or not is confirmed, whether the positions mapped by all the hash functions are 1 or not needs to be judged, if one position is not 1, the fact that the key fields do not exist can be determined, and otherwise, the existence of the key fields is possible. When the possible existence exists, a certain misjudgment rate exists, and the buffer memory or the database needs to be further queried for confirmation.

In order to facilitate understanding of the present embodiment, the following describes embodiments of the present invention in detail.

Embodiment one:

fig. 1 is a flowchart of an order deduplication method for big data according to an embodiment of the present invention.

Referring to fig. 1, the execution subject is a server, and the method includes the steps of:

step S101, receiving order information sent by a client, wherein the order information comprises a key field, a bill file name and order data;

Step S102, judging whether a bloom filter exists in a cache according to the key field; if not, executing step S103; if so, executing step S104;

step S103, loading a bloom filter;

Specifically, the server receives order information sent by the client, the order information including a key field, a billing file name, and order data, the key field including a hardware identification (MAC) and a sales time. The bill file name comprises a hardware identification (MAC) and an interception time, wherein the interception time is the time for a client to collect data, and the unit is ms. The hardware identification (MAC) typically includes 12 bits of data and capital letters, and the key fields of the bloom filter, such as BBBB5D3ED72320200718, are obtained after the hardware identification and sales time are concatenated.

When uploading an order, the client sends order information to the server, the server judges whether a bloom filter exists in the cache according to the key field, if so, the client judges whether the bill file name exists in the bloom filter, if not, the order is not repeated, and the client can directly store the order, otherwise, the server further inquires the cache and confirms the database. If no bloom filter is present, the bloom filter needs to be loaded.

Step S104, judging whether the bill file name exists in the bloom filter; if not, executing step S105; if so, executing step S106;

Step S105, storing order data into a database and a cache, and adding a bill file name into a bloom filter;

And step S106, confirming the bill file name through the cache and the database to obtain a confirmation result.

Further, step S106 includes the steps of:

Step S201, inquiring whether a bill file name exists in the cache; if so, step S202 is performed; if not, executing step S203;

Step S202, discarding the order information;

Here, if the bill file name is queried in the cache, it is indicated that this is a repeated order, discard processing is required, and processing information is returned to the client; if no bill file name is queried in the cache, further validation in the database is required.

Step S203, inquiring whether a bill file name exists from a database; if so, step S204 is performed; if not, then step S205 is performed;

step S204, discarding the order information;

step S205, the order data is stored in a database and a cache, the bill file name is added in a bloom filter, and response information of successful warehousing is sent to the client.

Here, if the bill file name can be queried in the database, it is indicated that this is a repeated bill, discard processing is required, and processing information is returned to the client; if the bill file name is not inquired in the database, the bloom filter is indicated to have misjudgment, the new order is obtained, the order data is stored in the database and the cache, the bill file name is added into the bloom filter, and response information of successful warehousing is sent to the client.

Further, referring to fig. 2, step S103 includes the steps of:

step S301, judging whether persistence information of a bloom filter exists in a database according to the key field; if so, step S302 is performed; if not, then step S305 is performed;

step S302, obtaining the last time of updating the bloom filter in the persistence information;

step S303, taking the time of last updating the bloom filter as the starting time;

step S304, searching all increment orders from the starting time to the current time in a database, and adding bill file names corresponding to all increment orders into a bloom filter;

In step S305, all orders of the same day corresponding to the hardware identifier are searched from the database, and all orders of the same day are added into the bloom filter.

Specifically, the bloom filter is stored in the memory, and when the server is restarted, the data of the bloom filter is lost; and the identification of the bill file name of the order put in storage before restarting is not saved in the newly built bloom filter after restarting the server, so that judgment errors can be caused.

Thus, the order information for the bloom filter is stored in the database for a preset time interval, and the time to last update the bloom filter is added to the order information for the persistent bloom filter. Inquiring the persistence information in the database when the bloom filter is loaded after the server is restarted, and acquiring the time of updating the bloom filter last in the persistence information if the persistence information is inquired; taking the time of last updating the bloom filter as the starting time; searching all increment orders from the starting time to the current time in a database, and adding bill file names corresponding to all increment orders into a bloom filter; if the order is not searched in the database, searching all orders of the current day corresponding to the hardware identification from the database, and adding all orders of the current day into the bloom filter.

Further, the method further comprises:

And storing order information of the bloom filter into a database within a preset time interval.

Specifically, the order information of the bloom filter is stored in a database within a preset time interval, namely, key fields of the bloom filter, the bloom filter after serialization, the last update time of the bloom filter and the like are stored in the database.

The embodiment of the invention provides a big data order deduplication method, which is applied to a server and comprises the following steps: receiving order information sent by a client, wherein the order information comprises a key field, a bill file name and order data; judging whether a bloom filter exists in the cache according to the key field; if not, loading a bloom filter; if so, judging whether the bill file name exists in the bloom filter; if not, storing the order data in a database and cache and adding the billing filename to the bloom filter; if the bill file name exists, confirming the bill file name through the cache and the database to obtain a confirmation result; the key fields comprise hardware identification and sales time, memory occupation can be reduced through a bloom filter, and the aim of efficiently removing duplicate data is fulfilled.

Embodiment two:

Referring to fig. 3, applied to a server, the apparatus includes:

The receiving unit 1 is used for receiving order information sent by the client, wherein the order information comprises a key field, a bill file name and order data;

A first judging unit 2, configured to judge whether a bloom filter exists in the cache according to the key field;

A loading unit 3 for loading the bloom filter in the absence;

A second judging unit 4 for judging whether the bill file name exists in the bloom filter in the case of existence;

A storage unit 5 for storing order data in a database and a cache, and adding a billing filename to the bloom filter in the absence;

a confirmation unit 6, configured to confirm the bill file name through the cache and the database in the presence of the bill file name, to obtain a confirmation result;

wherein the key fields include hardware identification and sales time.

Further, the confirmation unit 6 is specifically configured to:

inquiring whether a bill file name exists in the cache;

if yes, discarding the order information;

if not, inquiring whether the bill file name exists in the database;

if yes, discarding the order information;

If not, the order data is stored in a database and a cache, and the bill file name is added in a bloom filter, and response information of successful warehousing is sent to the client.

Further, the loading unit 3 is specifically configured to:

judging whether persistence information of the bloom filter exists in the database according to the key field;

taking the time of last updating the bloom filter as the starting time;

Searching all increment orders from the starting time to the current time in a database, and adding bill file names corresponding to all increment orders into a bloom filter;

The embodiment of the invention provides a big data order deduplication device, which is applied to a server and comprises the following components: receiving order information sent by a client, wherein the order information comprises a key field, a bill file name and order data; judging whether a bloom filter exists in the cache according to the key field; if not, loading a bloom filter; if so, judging whether the bill file name exists in the bloom filter; if not, storing the order data in a database and cache and adding the billing filename to the bloom filter; if the bill file name exists, confirming the bill file name through the cache and the database to obtain a confirmation result; the key fields comprise hardware identification and sales time, memory occupation can be reduced through a bloom filter, and the aim of efficiently removing duplicate data is fulfilled.

The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the order duplication elimination method of big data provided by the embodiment when executing the computer program.

The embodiment of the invention also provides a computer readable medium with non-volatile program code executable by a processor, wherein the computer readable medium stores a computer program, and the computer program executes the steps of the order duplication elimination method of big data in the embodiment when being executed by the processor.

The computer program product provided by the embodiment of the present invention includes a computer readable storage medium storing a program code, where instructions included in the program code may be used to perform the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment and will not be described herein.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

In addition, in the description of embodiments of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for de-duplication of orders for big data, applied to a server, the method comprising:

Judging whether a bloom filter exists in the cache according to the key field;

Loading the bloom filter if the bloom filter is not present in the cache;

if the bloom filter exists in the cache, judging whether the bill file name exists in the bloom filter or not;

storing the order data in a database and the cache if the billing filename does not exist in the bloom filter, and adding the billing filename to the bloom filter;

if the bill file name exists in the bloom filter, confirming the bill file name through the cache and the database to obtain a confirmation result;

wherein the key field comprises a hardware identifier and a sales time;

said loading said bloom filter, comprising:

If the persistence information of the bloom filter exists in the database, acquiring the time of updating the bloom filter last in the persistence information;

taking the time of updating the bloom filter last as a starting time;

if the persistence information of the bloom filter does not exist in the database, searching all orders of the same day corresponding to the hardware identifier from the database, and adding all orders of the same day into the bloom filter;

And the bill file name is confirmed through the cache and the database, and a confirmation result is obtained, wherein the confirmation result comprises the following steps:

inquiring whether the bill file name exists in the cache;

if the bill file name exists in the cache, discarding the order information;

If the bill file name does not exist in the cache, inquiring whether the bill file name exists in the database;

If the bill file name exists in the database, discarding the order information;

If the bill file name does not exist in the database, the order data are stored in the database and the cache, the bill file name is added into the bloom filter, and response information of successful warehousing is sent to the client.

2. The big data order deduplication method of claim 1, wherein the current time is a time corresponding to when the bloom filter is looked up from the database.

3. The big data order deduplication method of claim 1, further comprising:

4. An order deduplication device for big data, applied to a server, the device comprising:

a loading unit, configured to load the bloom filter if the bloom filter does not exist in the cache;

the second judging unit is used for judging whether the bill file name exists in the bloom filter or not under the condition that the bloom filter exists in the cache;

a storage unit configured to store the order data in a database and the cache, and add the bill filename to the bloom filter, in a case where the bill filename does not exist in the bloom filter;

the confirming unit is used for confirming the bill file name through the cache and the database under the condition that the bill file name exists in the bloom filter, so as to obtain a confirming result;

wherein the key field comprises a hardware identifier and a sales time;

the loading unit is specifically configured to:

taking the time of updating the bloom filter last as a starting time;

The confirmation unit is specifically configured to:

inquiring whether the bill file name exists in the cache;

if the bill file name exists in the cache, discarding the order information;

If the bill file name exists in the database, discarding the order information;

5. An electronic device comprising a memory, a processor, the memory having stored thereon a computer program executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1 to 3 when the computer program is executed.

6. A computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method of any one of claims 1 to 3.