WO2023087769A1

WO2023087769A1 - Method for deduplicating key field in real time on basis of distributed stream calculation engine flink

Info

Publication number: WO2023087769A1
Application number: PCT/CN2022/107574
Authority: WO
Inventors: 任丽超; 张俊杰; 冯宇波; 毛勇岗
Original assignee: 北京锐安科技有限公司
Priority date: 2021-11-16
Filing date: 2022-07-25
Publication date: 2023-05-25
Also published as: CN114048201A

Abstract

A method for deduplicating a key field in real time on the basis of a distributed stream calculation engine Flink. The method comprises: receiving target data, and determining, from the target data and on the basis of a configuration file, at least one key field to be deduplicated (S110), wherein the target data is structured data, and the configuration file comprises the key field to be deduplicated, which matches the target data; on the basis of an encryption algorithm, encrypting the at least one key field to be deduplicated, so as to determine encrypted data (S120); and setting a timer on the basis of Flink, taking a start time in the timer as a start time point, and deduplicating the encrypted data within a deduplication duration in the timer by using a Bloom filter (S130).

Description

Real-time deduplication method for key fields based on distributed stream computing engine Flink

This application claims the priority of the Chinese patent application with application number 202111352389.2 filed with China Patent Office on November 16, 2021, the entire content of the above application is incorporated by reference in this application.

technical field

The embodiment of the present application relates to the technical field of big data processing, for example, it relates to a real-time deduplication method for key fields based on the distributed stream computing engine Flink.

Background technique

Eliminating duplicate data is a type of requirement that is often encountered in actual business. In the field of big data, the deduplication of data helps to reduce storage space and improve server processing efficiency. In real-time computing, deduplication of key fields is an incremental and long-term process.

The real-time field deduplication method in related technologies is as follows: use Redis to send each piece of data in the real-time data stream to Redis for judgment or use a HashSet that is out of order and not repeated. However, if you use Redis, you need to connect to the Redis service through the network every time. The network speed is slower than the cache speed and the network may be unstable; When storing in HashSet, the more and more data, the processing efficiency will be greatly reduced, and it will also take up a lot of memory space.

Contents of the invention

The embodiment of this application provides a real-time deduplication method for key fields based on the distributed stream computing engine Flink.

In the first aspect, the embodiment of the present application provides a real-time deduplication method for key fields based on the distributed stream computing engine Flink, the method includes:

Receiving target data, and determining at least one key field to be deduplicated in the target data based on a configuration file; wherein, the target data is structured data; heavy key field;

Encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data;

A timer is set based on Flink, and the start time in the timer is taken as the starting time point, and the encrypted data is deduplicated by using a Bloom filter within the deduplication duration in the timer.

In the second aspect, the embodiment of the present application also provides a real-time deduplication device for key fields based on the distributed stream computing engine Flink, which includes:

The key field determination module is configured to receive the target data, and determine at least one key field to be deduplicated in the target data based on the configuration file; wherein, the target data is structured data; the configuration file includes the The key field to be deduplicated matched by the target data;

An encrypted data determination module configured to encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data;

The deduplication module is configured to set a timer based on Flink, with the start time in the timer as the starting time point, and within the deduplication duration in the timer, the Bloom filter is used to encrypt the The data is deduplicated.

In a third aspect, the embodiment of the present application also provides an electronic device, the device includes:

one or more processors;

storage means configured to store one or more programs,

When the one or more programs are executed by the one or more processors, so that the one or more processors implement the key of the Flink-based distributed stream computing engine as described in any one of the embodiments of this application Field deduplication method in real time.

In the fourth aspect, the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, any A real-time deduplication method for key fields based on the distributed stream computing engine Flink described in the item.

Description of drawings

FIG. 1 is a flow chart of a real-time deduplication method for key fields based on the distributed flow computing engine Flink provided by the embodiment of the present application;

FIG. 2 is a flow chart of another real-time deduplication method for key fields based on the distributed stream computing engine Flink provided by the embodiment of the present application;

FIG. 3 is a schematic structural diagram of a real-time deduplication device for key fields based on the distributed stream computing engine Flink provided by the embodiment of the present application;

Fig. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

The application will be described below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, only some structures related to the present application are shown in the drawings but not all structures.

Fig. 1 is a flowchart of a real-time deduplication method for key fields based on the distributed stream computing engine Flink provided by an embodiment of the present application, and the method can be executed by a real-time key field deduplication device based on the distributed stream computing engine Flink , the device may be realized by software and/or hardware, and the device may be configured in an electronic device for real-time deduplication of key fields based on the distributed stream computing engine Flink. The method can be applied to the scene of real-time deduplication of key fields of massive structured data. As shown in FIG. 1 , the embodiment of the present application is as follows.

S110: Receive target data, and determine at least one key field to be deduplicated in the target data based on the configuration file.

Wherein, the target data is structured data; the configuration file includes key fields to be deduplicated that match the target data.

For example, the target data may be a piece of structured data to be deduplicated in massive data. The target data includes multiple key fields, and the target data has its own business characteristics, that is, the target data belongs to a certain type of business data. The key fields to be deduplicated of various business data that need to be deduplicated are pre-configured in the configuration file. Among them, the business data is different, and the key fields that need to be deduplicated are also different. For example, the key fields to be deduplicated where the business data is user Internet access information are different from the key fields to be deduplicated where the business data is train ticket information. An embodiment may determine at least one key field to be deduplicated in the target data based on a configuration file according to an actual deduplication requirement. In one embodiment, the target data can be accessed through the Kafka message system, and the target data in the Kafka message system can be consumed through Flink. Among them, Kafka is a high-throughput distributed publish-subscribe message system, which can handle all action flow data of consumers in the website. Apache Flink is an open source stream processing framework developed by the Apache Software Foundation. Its core is a distributed stream data flow engine written in Java and Scala. Flink executes arbitrary streaming data programs in a data parallel and pipeline manner, and Flink's pipeline runtime system can execute batch and stream processing programs. In addition, Flink's runtime itself supports the execution of iterative algorithms.

Among them, structured data is also called row data, which is logically expressed and realized by a two-dimensional table structure, strictly follows the data format and length specification, and is mainly stored and managed by a relational database.

S120: Encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data.

Wherein, the encryption algorithm may be MD5 (Message Digest Algorithm 5, message digest algorithm) encryption algorithm. In one embodiment, the key fields to be deduplicated determined based on the configuration file may be spliced according to the key field configuration strategy using logical AND or logical OR to obtain the key field string to be deduplicated, Encrypt the key field string to be deduplicated based on the MD5 encryption algorithm to determine the MD5 value of the key field string to be deduplicated, that is, the encrypted data. Wherein, the MD5 value can be 32-bit hexadecimal data.

S130: Set a timer based on Flink, take the starting time in the timer as the starting time point, and use a Bloom filter to deduplicate the encrypted data during the deduplication duration in the timer .

Among them, since this embodiment is real-time deduplication of massive structured data, the target data is continuously connected to the Kafka message system. In one embodiment, after the encrypted data is determined, the KeyBy operator of Flink is used to group the determined encrypted data, that is, the MD5 value of the key field string, and the data with the same MD5 value is divided into one group, and one slot of Flink The data of one or more groups will be processed, that is, the data in the same group must be in a slot of Flink. Data with the same MD5 value can be stored on one server.

After the grouping of encrypted data is completed, this embodiment can implement real-time deduplication of key fields in memory based on a timer and a Bloom filter. For example, this embodiment can call related methods of Flink to set timers. Wherein, the timer may include a counting start time, such as the current time. A timer can also include a deduplication duration. Deduplication duration indicates the duration of performing deduplication operations on structured data accessed by the Kafka messaging system. The deduplication duration may be determined according to actual needs, for example, the deduplication duration may be 5 minutes, and the deduplication duration may also be 10 minutes, for example.

For data processing scenarios with less stringent requirements on accuracy and higher requirements on time and space, Bloom filters can be used for global deduplication. Bloom filter includes hash algorithm and bitmap. For the data of string structure, multiple hash algorithms are used for storage based on bitmap. Since only 0 or 1 is used for storage, it can save a lot of money compared with Hashset that stores MD5 values. The storage space is especially suitable for deduplication in tens of billions of data.

In an embodiment, before using the Bloom filter, related parameters of the Bloom filter can be set first, for example, the related parameter can be the tolerance rate of false positives of the Bloom filter, and the related parameter can also be the Bloom filter The maximum amount of data processing that the server can handle within the deduplication duration. After setting the relevant parameters, this embodiment uses the Bloom filter to deduplicate the encrypted data. Whenever a piece of encrypted data enters, it is judged whether the encrypted data has appeared. If it is determined that the encrypted data already exists in the Bloom filter , the encrypted data will be deduplicated and discarded in the real-time data processing stream. If it is determined that the encrypted data does not exist in the Bloom filter, the encrypted data is inserted into the Bloom filter, and a piece of target data is inserted into the database storing the target data based on the encrypted data. When the time reaches the deduplication duration in the timer, this embodiment can reset the Bloom filter, and perform the next round of real-time key field deduplication.

In the embodiment of the present application, target data is received, and at least one key field to be deduplicated in the target data is determined based on the configuration file; wherein, the target data is structured data; the configuration file includes keywords to be deduplicated that match the target data segment; based on the encryption algorithm, encrypt each key field to be deduplicated to determine the encrypted data; set the timer based on Flink, take the starting time in the timer as the starting time point, and within the deduplication duration in the timer, Bloom filter is used to deduplicate encrypted data. By executing the embodiments of the present application, efficient real-time deduplication of key fields of massive data can be realized, storage space can be saved, and data processing efficiency can be improved.

Fig. 2 is a flowchart of another real-time deduplication method for key fields based on the distributed stream computing engine Flink provided by the embodiment of the present application. This embodiment is adjusted on the basis of the above-mentioned embodiments. As shown in Figure 2, the real-time deduplication method for key fields based on the distributed flow computing engine Flink in the embodiment of this application is as follows.

S210: Receive target data, and determine at least one key field to be deduplicated in the target data based on the configuration file.

S220: Use the logical relationship contained in the target data to splice the key fields to be deduplicated to obtain a string of key fields to be deduplicated.

Exemplarily, for business data with different characteristics, different key fields need to be deduplicated, and different data sources. Business data with the same feature has different attribute fields. For example, some source data has values for user names, some source data has values for user login accounts, and some source data has values for both user names and login accounts. When configuring key fields for these data support or relationship. The key field configuration strategy is as follows: all key fields to be deduplicated with the condition of and are concatenated into field name + field value. All the key fields to be deduplicated with the condition of or are spliced into the first non-empty field name + field value. If the key fields to be deduplicated in or have no value, the last key field to be deduplicated is spliced Name + empty value, and finally get the string to be concatenated with deduplication key fields. For example, the key field to be deduplicated determined according to the target data includes three parts: the first part: bank card number, payment card number, receiver card number, device number, client MAC (Media Access Control, Media Access Control) address; the second part : payment account, payment ID (Identity); third part: user login account and user ID. Among them, the key fields in the first part are connected with the condition and, the key fields to be deduplicated in the second part are connected with the condition or, and the key fields to be deduplicated in the third part are also connected with each other. Use the condition or to connect, and use the condition and to splice the first part, the second part and the third part to get the key field string to be deduplicated. Among them, each key field to be deduplicated is the splicing of field name and field value. It should be noted that the above example is only an example. In practical applications, the splicing strategy of the key fields to be deduplicated can be configured according to the actual needs using the relationship of logical and or logical or, for example, configuring for all key fields in a piece of data The key field splicing strategy, or you can configure the key field splicing strategy for some key fields in a piece of data.

S230: Encrypt the key field character string to be deduplicated based on the encryption algorithm to determine encrypted data.

In this embodiment, an encryption algorithm may be used to encrypt the string of key fields to be deduplicated obtained by concatenating the key fields to be deduplicated according to the key field configuration policy, to obtain encrypted data.

In one embodiment, the encryption algorithm includes MD5 encryption algorithm.

Among them, MD5 is an encryption algorithm that uses one-way encryption. For MD5, there are two characteristics that are very important. The first is any two pieces of plaintext data, and the encrypted ciphertext cannot be the same; the second is any piece of plaintext data. After plaintext data is encrypted, the result will never change. The former means that it is impossible for any two pieces of plaintext to be encrypted to obtain the same ciphertext, and the latter means that if specific data is encrypted, the resulting ciphertext must be the same. MD5 message digest algorithm belongs to the category of Hash algorithm. The MD5 algorithm operates on an input message of any length and produces a 32-bit hexadecimal message digest.

Therefore, by using the MD5 encryption algorithm to encrypt the string of the key field to be deduplicated and determine the encrypted data, the length of the string of the key field to be deduplicated can be reduced, and an excessively long string can be avoided when there are many key fields to be deduplicated The situation can improve the deduplication efficiency.

S240: Set a timer based on Flink, take the starting time in the timer as the starting time point, and use a Bloom filter to deduplicate the encrypted data during the deduplication duration in the timer .

In an embodiment, the setting process of the Bloom filter includes: setting a tolerance false positive rate of the Bloom filter; and determining a maximum data processing capacity of the Bloom filter according to the deduplication duration.

Wherein, the tolerable misjudgment rate can be the probability of allowing the Bloom filter to deduplicate errors. The higher the tolerable misjudgment rate is set, the less processing time the processor needs; on the contrary, the lower the tolerable misjudgment rate is set, the processor more processing time is required. The maximum data processing volume affects the memory usage rate, and the maximum data processing volume can be determined according to the deduplication duration and the daily data processing volume of the Bloom filter. Exemplarily, if the deduplication duration is set to 5 minutes and the amount of data processed by the Bloom filter is 2 billion per day, then configure the maximum data processing amount of the Bloom filter=(2 billion ÷ (24×60))×5 =6944445. And set the tolerance rate of misjudgment to 0.0001, then the number of hash functions to be executed can be determined based on the tolerance of misjudgment rate, and each time a hash operation is performed, a corresponding 0 or 1 is generated and stored in the Bloom filter. Furthermore, according to the open source Bloom filter source code, the memory usage of the Bloom filter can be calculated.

Therefore, by setting the tolerable misjudgment rate of the Bloom filter; according to the deduplication duration, the maximum data processing capacity of the Bloom filter is determined. The Bloom filter can be configured according to actual needs, real-time deduplication of key fields based on the Bloom filter can be realized, memory space can be saved, and data processing efficiency can be improved.

In one embodiment, using a Bloom filter to deduplicate the encrypted data includes: discarding the encrypted data if it is determined based on the Bloom filter that the encrypted data already exists.

For example, this embodiment can call the mightContain() method of the Bloom filter to determine whether encrypted data already exists in the Bloom filter. If the encrypted data already exists, it indicates that the encrypted data is repeated, and the encrypted data is processed in real time. There is no need to add the encrypted data to the database for deduplication and discarding.

If it is determined based on the Bloom filter that the encrypted data already exists, the encrypted data is discarded. Real-time deduplication of deduplication key fields can be achieved through the Bloom filter, avoiding the addition of duplicate records in the database, reducing memory usage and improving deduplication efficiency.

In an embodiment, using a Bloom filter to deduplicate the encrypted data includes: if it is determined based on the Bloom filter that the encrypted data does not exist, adding the encrypted data to the Bloom filter, and add the target data to the target database according to the encrypted data.

For example, call the mightContain() method of the Bloom filter to determine whether encrypted data already exists in the Bloom filter, and if the encrypted data has not appeared, call the put() method to insert it into the Bloom filter. And it is only necessary to update the target data into the target database according to the encrypted data and the target data corresponding to the encrypted data. Wherein, the target database may be a database storing massive data. During the real-time stream processing of data, the MD5 value of the encrypted data generated according to the key field to be deduplicated will be assigned to the specified field. If the Bloom filter is reset before the reset time is reached due to operations such as task restart, then It will lead to incomplete deduplication on the real-time processing stream. At this time, the storage link can judge whether to add or update the data according to the MD5 value, which can ensure the realization of double deduplication.

If it is determined based on the Bloom filter that the encrypted data does not exist, the encrypted data is added to the Bloom filter, and the target data is added to the target database according to the encrypted data. Real-time deduplication of key fields through Bloom filters can be realized, which can reduce memory usage and improve deduplication efficiency.

In an embodiment, after the encrypted data is deduplicated by using the Bloom filter, it further includes: if it is detected that the count value of the timer and the deduplication duration satisfy the preset constraint condition, then The Bloom filter is reset; the control timer is counted again according to the deduplication duration, and the Bloom filter is controlled to perform the deduplication operation.

For example, the preset constraint condition can be that the count value is consistent with the deduplication duration, and the preset constraint condition can also be that the difference between the count value and the timing start time is equal to the deduplication duration, and the preset constraint condition can be based on actual needs to set. If this embodiment detects that the count value of the timer and the deduplication duration meet the preset constraint conditions, all encrypted data in the Bloom filter will be cleared, and the timer of Flink will be controlled to restart the next round of timing, while controlling The Bloom filter continues a new round of deduplication according to the deduplication duration.

Thus, if it is detected that the count value of the timer and the deduplication duration meet the preset constraint conditions, the Bloom filter is reset; the timer is controlled to time again according to the deduplication duration, and the Bloom filter is controlled Perform deduplication. It can avoid the continuous growth of data in the Bloom filter and affect the memory usage rate, can realize efficient real-time deduplication of key fields of massive data, can save storage space, and improve data processing efficiency.

In the embodiment of the present application, the target data is received, and at least one key field to be deduplicated in the target data is determined based on the configuration file; the logical relationship contained in the target data is used to splice each key field to be deduplicated, and the key field to be deduplicated is obtained. Deduplicate key field strings; encrypt the deduplicated key field strings based on the encryption algorithm to determine the encrypted data; set the timer based on Flink, with the starting time in the timer as the starting time point, in the timer During the deduplication duration, the Bloom filter is used to deduplicate encrypted data. By executing this embodiment, it is possible to efficiently deduplicate key fields of massive data in real time, save storage space, and improve data processing efficiency.

Fig. 3 is a schematic structural diagram of a real-time deduplication device for key fields based on the distributed flow computing engine Flink provided by the embodiment of the present application. The device can be implemented by software and/or hardware, and the device can be configured for The key fields of Flink, a distributed stream computing engine, are deduplicated in real-time in electronic devices. As shown in Figure 3, the device includes:

The key field determination module 310 is configured to receive target data, and determine at least one key field to be deduplicated in the target data based on the configuration file; wherein, the target data is structured data; the configuration file includes The key field to be deduplicated matched with the target data;

The encrypted data determination module 320 is configured to encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine the encrypted data;

The deduplication module 330 is configured to set a timer based on Flink, with the start time in the timer as the starting time point, and within the deduplication duration in the timer, use a Bloom filter to filter the Encrypt data for deduplication.

In one embodiment, the encrypted data determination module 320 includes a key field string determination unit configured to use the logical relationship contained in the target data to splice the at least one key field to be deduplicated to obtain the key field to be deduplicated. The key field character string to be repeated; the encrypted data determining unit is configured to encrypt the key field string to be deduplicated based on the encryption algorithm to determine the encrypted data.

In one embodiment, the encryption algorithm includes MD5 encryption algorithm.

In an embodiment, the device further includes a reset module, configured to, after the encrypted data is deduplicated by the Bloom filter, if it is detected that the count value of the timer is equal to the deduplication duration If the preset constraint condition is met, the Bloom filter is reset; the timer is controlled to count again according to the deduplication duration, and the Bloom filter is controlled to perform the deduplication operation.

In an embodiment, the deduplication module 330 is configured to discard the encrypted data if it is determined based on the Bloom filter that the encrypted data already exists.

In one embodiment, the deduplication module 330 is configured to: if it is determined based on the Bloom filter that the encrypted data does not exist, then add the encrypted data to the Bloom filter, and according to the The encrypted data adds the target data to the target database.

The device provided in the above embodiments can execute the real-time deduplication method for key fields based on the distributed flow computing engine Flink provided in any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.

Fig. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. As shown in Fig. 4, the device includes:

One or more processors 410, one processor 410 is taken as an example in FIG. 4;

memory 420;

The device may also include: an input device 430 and an output device 440 .

The processor 410, the memory 420, the input device 430 and the output device 440 in the device may be connected through a bus or in other ways. In FIG. 4, connection through a bus is taken as an example.

The memory 420 (also referred to as a storage device) is a non-transitory computer-readable storage medium that can be used to store software programs, computer-executable programs and modules, such as a distributed flow computing engine in the embodiment of the present application The program instruction/module corresponding to Flink's key field real-time deduplication method. The processor 410 executes various functional applications and data processing of the computer device by running the software programs, instructions, and modules stored in the memory 420, that is, a keyword based on the distributed stream computing engine Flink that implements the above method embodiments Segment real-time deduplication method, namely:

The memory 420 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the computer device, and the like. In addition, the memory 420 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the storage 420 may optionally include storages that are remotely located relative to the processor 410, and these remote storages may be connected to the terminal device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 430 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the computer device. The output device 440 may include a display device such as a display screen.

The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored. When the program is executed by a processor, a key field real-time Deduplication method, that is:

Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. Examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more conductors, portable computer disks, hard disks, Random Access Memory (RAM, Random Access Memory), Read Only Memory (ROM, Read-Only Memory), erasable programmable read-only memory (EPROM, Erasable Programmable Read-Only Memory), flash memory, optical fiber, portable compact disk read-only memory (CD-ROM, Compact Disc Read-Only Memory), optical storage components, magnetic storage devices, or any suitable combination of the above. In this document, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

The storage medium may be a non-transitory storage medium.

A computer readable signal medium may include a data signal carrying computer readable program code in baseband or as part of a carrier wave. Such propagated data signals may take many forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device. .

The program code contained on the computer-readable medium can be transmitted by any appropriate medium, including—but not limited to—wireless, electric wire, optical cable, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.

Computer program code for performing the operations of the present application may be written in one or more programming languages or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural Programming Language - such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN, Local Area Network) or a wide area network (WAN, Wide Area Network), or it can be connected to an external computer ( For example, use an Internet service provider to connect via the Internet).

The foregoing are some embodiments of the present application. Those skilled in the art will understand that the present application is not limited to the specific embodiments described here, and various changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present application. Therefore, although the present application has been described through the above embodiments, the present application is not limited to the above embodiments, and can also include more other equivalent embodiments without departing from the concept of the present invention, and the scope of the present application consists of The scope of the appended claims determines.

Claims

A real-time deduplication method for key fields based on the distributed stream computing engine Flink, including:

Receiving target data, and determining at least one key field to be deduplicated in the target data based on a configuration file; wherein, the target data is structured data; heavy key field;

Encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data;

A timer is set based on Flink, and the start time in the timer is taken as the starting time point, and the encrypted data is deduplicated by using a Bloom filter within the deduplication duration in the timer.
The method according to claim 1, wherein, encrypting the at least one key field to be deduplicated based on an encryption algorithm, and determining the encrypted data includes:

Splicing the at least one key field to be deduplicated by using the logical relationship contained in the target data to obtain a string of key fields to be deduplicated;

Encrypt the key field character string to be deduplicated based on the encryption algorithm to determine encrypted data.
The method according to claim 1, wherein the encryption algorithm comprises a message digest MD5 encryption algorithm.
The method according to claim 1, said method further comprising:

Resetting the Bloom filter in response to detecting that the count value of the timer and the deduplication duration satisfy a preset constraint condition;

The control timer counts again according to the deduplication duration, and controls the Bloom filter to perform the deduplication operation.
The method according to claim 1, wherein, using a Bloom filter to deduplicate the encrypted data comprises:

The encrypted data is discarded in response to determining that the encrypted data already exists based on the Bloom filter.
The method according to claim 1, wherein, using a Bloom filter to deduplicate the encrypted data comprises:

In response to determining the absence of the encrypted data based on the Bloom filter, adding the encrypted data to the Bloom filter, and adding the target data to a target database based on the encrypted data.
The method according to claim 1, wherein the setting process of the Bloom filter comprises:

Set the tolerable misjudgment rate of the Bloom filter;

According to the deduplication duration, the maximum data processing capacity of the Bloom filter is determined.
A real-time deduplication device for key fields based on the distributed stream computing engine Flink, including:

The key field determination module is configured to receive the target data, and determine at least one key field to be deduplicated in the target data based on the configuration file; wherein, the target data is structured data; the configuration file includes the The key field to be deduplicated matched by the target data;

An encrypted data determination module configured to encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data;

The deduplication module is configured to set a timer based on Flink, with the start time in the timer as the starting time point, and within the deduplication duration in the timer, the Bloom filter is used to encrypt the The data is deduplicated.
An electronic device comprising:

one or more processors;

storage means configured to store one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors realize the key of the distributed stream computing engine Flink based on any one of claims 1-7 Field deduplication method in real time.
A computer-readable storage medium, the computer-readable storage medium is stored with a computer program, and when the computer program is executed by a processor, it realizes the Flink-based distributed flow computing engine according to any one of claims 1-7 Real-time deduplication method for key fields.