WO2023087769A1 - Method for deduplicating key field in real time on basis of distributed stream calculation engine flink - Google Patents

Method for deduplicating key field in real time on basis of distributed stream calculation engine flink Download PDF

Info

Publication number
WO2023087769A1
WO2023087769A1 PCT/CN2022/107574 CN2022107574W WO2023087769A1 WO 2023087769 A1 WO2023087769 A1 WO 2023087769A1 CN 2022107574 W CN2022107574 W CN 2022107574W WO 2023087769 A1 WO2023087769 A1 WO 2023087769A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
deduplication
deduplicated
bloom filter
encrypted data
Prior art date
Application number
PCT/CN2022/107574
Other languages
French (fr)
Chinese (zh)
Inventor
任丽超
张俊杰
冯宇波
毛勇岗
Original Assignee
北京锐安科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京锐安科技有限公司 filed Critical 北京锐安科技有限公司
Publication of WO2023087769A1 publication Critical patent/WO2023087769A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Definitions

  • the embodiment of the present application relates to the technical field of big data processing, for example, it relates to a real-time deduplication method for key fields based on the distributed stream computing engine Flink.
  • Eliminating duplicate data is a type of requirement that is often encountered in actual business.
  • the deduplication of data helps to reduce storage space and improve server processing efficiency.
  • deduplication of key fields is an incremental and long-term process.
  • the real-time field deduplication method in related technologies is as follows: use Redis to send each piece of data in the real-time data stream to Redis for judgment or use a HashSet that is out of order and not repeated. However, if you use Redis, you need to connect to the Redis service through the network every time. The network speed is slower than the cache speed and the network may be unstable; When storing in HashSet, the more and more data, the processing efficiency will be greatly reduced, and it will also take up a lot of memory space.
  • the embodiment of this application provides a real-time deduplication method for key fields based on the distributed stream computing engine Flink.
  • the embodiment of the present application provides a real-time deduplication method for key fields based on the distributed stream computing engine Flink, the method includes:
  • Target data Receiving target data, and determining at least one key field to be deduplicated in the target data based on a configuration file; wherein, the target data is structured data; heavy key field;
  • a timer is set based on Flink, and the start time in the timer is taken as the starting time point, and the encrypted data is deduplicated by using a Bloom filter within the deduplication duration in the timer.
  • the embodiment of the present application also provides a real-time deduplication device for key fields based on the distributed stream computing engine Flink, which includes:
  • the key field determination module is configured to receive the target data, and determine at least one key field to be deduplicated in the target data based on the configuration file; wherein, the target data is structured data; the configuration file includes the The key field to be deduplicated matched by the target data;
  • An encrypted data determination module configured to encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data
  • the deduplication module is configured to set a timer based on Flink, with the start time in the timer as the starting time point, and within the deduplication duration in the timer, the Bloom filter is used to encrypt the The data is deduplicated.
  • the embodiment of the present application also provides an electronic device, the device includes:
  • processors one or more processors
  • storage means configured to store one or more programs
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, any A real-time deduplication method for key fields based on the distributed stream computing engine Flink described in the item.
  • FIG. 1 is a flow chart of a real-time deduplication method for key fields based on the distributed flow computing engine Flink provided by the embodiment of the present application;
  • FIG. 2 is a flow chart of another real-time deduplication method for key fields based on the distributed stream computing engine Flink provided by the embodiment of the present application;
  • FIG. 3 is a schematic structural diagram of a real-time deduplication device for key fields based on the distributed stream computing engine Flink provided by the embodiment of the present application;
  • Fig. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Fig. 1 is a flowchart of a real-time deduplication method for key fields based on the distributed stream computing engine Flink provided by an embodiment of the present application, and the method can be executed by a real-time key field deduplication device based on the distributed stream computing engine Flink , the device may be realized by software and/or hardware, and the device may be configured in an electronic device for real-time deduplication of key fields based on the distributed stream computing engine Flink.
  • the method can be applied to the scene of real-time deduplication of key fields of massive structured data.
  • the embodiment of the present application is as follows.
  • S110 Receive target data, and determine at least one key field to be deduplicated in the target data based on the configuration file.
  • the target data is structured data
  • the configuration file includes key fields to be deduplicated that match the target data.
  • the target data may be a piece of structured data to be deduplicated in massive data.
  • the target data includes multiple key fields, and the target data has its own business characteristics, that is, the target data belongs to a certain type of business data.
  • the key fields to be deduplicated of various business data that need to be deduplicated are pre-configured in the configuration file. Among them, the business data is different, and the key fields that need to be deduplicated are also different.
  • the key fields to be deduplicated where the business data is user Internet access information are different from the key fields to be deduplicated where the business data is train ticket information.
  • An embodiment may determine at least one key field to be deduplicated in the target data based on a configuration file according to an actual deduplication requirement.
  • the target data can be accessed through the Kafka message system, and the target data in the Kafka message system can be consumed through Flink.
  • Kafka is a high-throughput distributed publish-subscribe message system, which can handle all action flow data of consumers in the website.
  • Apache Flink is an open source stream processing framework developed by the Apache Software Foundation. Its core is a distributed stream data flow engine written in Java and Scala. Flink executes arbitrary streaming data programs in a data parallel and pipeline manner, and Flink's pipeline runtime system can execute batch and stream processing programs. In addition, Flink's runtime itself supports the execution of iterative algorithms.
  • structured data is also called row data, which is logically expressed and realized by a two-dimensional table structure, strictly follows the data format and length specification, and is mainly stored and managed by a relational database.
  • S120 Encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data.
  • the encryption algorithm may be MD5 (Message Digest Algorithm 5, message digest algorithm) encryption algorithm.
  • the key fields to be deduplicated determined based on the configuration file may be spliced according to the key field configuration strategy using logical AND or logical OR to obtain the key field string to be deduplicated, Encrypt the key field string to be deduplicated based on the MD5 encryption algorithm to determine the MD5 value of the key field string to be deduplicated, that is, the encrypted data.
  • the MD5 value can be 32-bit hexadecimal data.
  • S130 Set a timer based on Flink, take the starting time in the timer as the starting time point, and use a Bloom filter to deduplicate the encrypted data during the deduplication duration in the timer .
  • the target data is continuously connected to the Kafka message system.
  • the KeyBy operator of Flink is used to group the determined encrypted data, that is, the MD5 value of the key field string, and the data with the same MD5 value is divided into one group, and one slot of Flink
  • the data of one or more groups will be processed, that is, the data in the same group must be in a slot of Flink.
  • Data with the same MD5 value can be stored on one server.
  • this embodiment can implement real-time deduplication of key fields in memory based on a timer and a Bloom filter.
  • this embodiment can call related methods of Flink to set timers.
  • the timer may include a counting start time, such as the current time.
  • a timer can also include a deduplication duration.
  • Deduplication duration indicates the duration of performing deduplication operations on structured data accessed by the Kafka messaging system. The deduplication duration may be determined according to actual needs, for example, the deduplication duration may be 5 minutes, and the deduplication duration may also be 10 minutes, for example.
  • Bloom filters can be used for global deduplication.
  • Bloom filter includes hash algorithm and bitmap.
  • hash algorithm For the data of string structure, multiple hash algorithms are used for storage based on bitmap. Since only 0 or 1 is used for storage, it can save a lot of money compared with Hashset that stores MD5 values.
  • the storage space is especially suitable for deduplication in tens of billions of data.
  • related parameters of the Bloom filter can be set first, for example, the related parameter can be the tolerance rate of false positives of the Bloom filter, and the related parameter can also be the Bloom filter The maximum amount of data processing that the server can handle within the deduplication duration.
  • this embodiment uses the Bloom filter to deduplicate the encrypted data. Whenever a piece of encrypted data enters, it is judged whether the encrypted data has appeared. If it is determined that the encrypted data already exists in the Bloom filter , the encrypted data will be deduplicated and discarded in the real-time data processing stream.
  • the encrypted data is inserted into the Bloom filter, and a piece of target data is inserted into the database storing the target data based on the encrypted data.
  • this embodiment can reset the Bloom filter, and perform the next round of real-time key field deduplication.
  • target data is received, and at least one key field to be deduplicated in the target data is determined based on the configuration file; wherein, the target data is structured data; the configuration file includes keywords to be deduplicated that match the target data segment; based on the encryption algorithm, encrypt each key field to be deduplicated to determine the encrypted data; set the timer based on Flink, take the starting time in the timer as the starting time point, and within the deduplication duration in the timer, Bloom filter is used to deduplicate encrypted data.
  • Fig. 2 is a flowchart of another real-time deduplication method for key fields based on the distributed stream computing engine Flink provided by the embodiment of the present application. This embodiment is adjusted on the basis of the above-mentioned embodiments. As shown in Figure 2, the real-time deduplication method for key fields based on the distributed flow computing engine Flink in the embodiment of this application is as follows.
  • S210 Receive target data, and determine at least one key field to be deduplicated in the target data based on the configuration file.
  • S220 Use the logical relationship contained in the target data to splice the key fields to be deduplicated to obtain a string of key fields to be deduplicated.
  • different key fields need to be deduplicated, and different data sources.
  • Business data with the same feature has different attribute fields.
  • some source data has values for user names
  • some source data has values for user login accounts
  • some source data has values for both user names and login accounts.
  • the key field configuration strategy is as follows: all key fields to be deduplicated with the condition of and are concatenated into field name + field value. All the key fields to be deduplicated with the condition of or are spliced into the first non-empty field name + field value.
  • the key field to be deduplicated determined according to the target data includes three parts: the first part: bank card number, payment card number, receiver card number, device number, client MAC (Media Access Control, Media Access Control) address; the second part : payment account, payment ID (Identity); third part: user login account and user ID.
  • the key fields in the first part are connected with the condition and, the key fields to be deduplicated in the second part are connected with the condition or, and the key fields to be deduplicated in the third part are also connected with each other.
  • each key field to be deduplicated is the splicing of field name and field value. It should be noted that the above example is only an example.
  • the splicing strategy of the key fields to be deduplicated can be configured according to the actual needs using the relationship of logical and or logical or, for example, configuring for all key fields in a piece of data
  • S230 Encrypt the key field character string to be deduplicated based on the encryption algorithm to determine encrypted data.
  • an encryption algorithm may be used to encrypt the string of key fields to be deduplicated obtained by concatenating the key fields to be deduplicated according to the key field configuration policy, to obtain encrypted data.
  • the encryption algorithm includes MD5 encryption algorithm.
  • MD5 is an encryption algorithm that uses one-way encryption.
  • the first is any two pieces of plaintext data, and the encrypted ciphertext cannot be the same; the second is any piece of plaintext data. After plaintext data is encrypted, the result will never change.
  • the former means that it is impossible for any two pieces of plaintext to be encrypted to obtain the same ciphertext, and the latter means that if specific data is encrypted, the resulting ciphertext must be the same.
  • MD5 message digest algorithm belongs to the category of Hash algorithm. The MD5 algorithm operates on an input message of any length and produces a 32-bit hexadecimal message digest.
  • the MD5 encryption algorithm to encrypt the string of the key field to be deduplicated and determine the encrypted data, the length of the string of the key field to be deduplicated can be reduced, and an excessively long string can be avoided when there are many key fields to be deduplicated The situation can improve the deduplication efficiency.
  • S240 Set a timer based on Flink, take the starting time in the timer as the starting time point, and use a Bloom filter to deduplicate the encrypted data during the deduplication duration in the timer .
  • the setting process of the Bloom filter includes: setting a tolerance false positive rate of the Bloom filter; and determining a maximum data processing capacity of the Bloom filter according to the deduplication duration.
  • the tolerable misjudgment rate can be the probability of allowing the Bloom filter to deduplicate errors.
  • the number of hash functions to be executed can be determined based on the tolerance of misjudgment rate, and each time a hash operation is performed, a corresponding 0 or 1 is generated and stored in the Bloom filter. Furthermore, according to the open source Bloom filter source code, the memory usage of the Bloom filter can be calculated.
  • the Bloom filter can be configured according to actual needs, real-time deduplication of key fields based on the Bloom filter can be realized, memory space can be saved, and data processing efficiency can be improved.
  • using a Bloom filter to deduplicate the encrypted data includes: discarding the encrypted data if it is determined based on the Bloom filter that the encrypted data already exists.
  • this embodiment can call the mightContain() method of the Bloom filter to determine whether encrypted data already exists in the Bloom filter. If the encrypted data already exists, it indicates that the encrypted data is repeated, and the encrypted data is processed in real time. There is no need to add the encrypted data to the database for deduplication and discarding.
  • the encrypted data is discarded.
  • Real-time deduplication of deduplication key fields can be achieved through the Bloom filter, avoiding the addition of duplicate records in the database, reducing memory usage and improving deduplication efficiency.
  • using a Bloom filter to deduplicate the encrypted data includes: if it is determined based on the Bloom filter that the encrypted data does not exist, adding the encrypted data to the Bloom filter, and add the target data to the target database according to the encrypted data.
  • the target database may be a database storing massive data.
  • the MD5 value of the encrypted data generated according to the key field to be deduplicated will be assigned to the specified field. If the Bloom filter is reset before the reset time is reached due to operations such as task restart, then It will lead to incomplete deduplication on the real-time processing stream. At this time, the storage link can judge whether to add or update the data according to the MD5 value, which can ensure the realization of double deduplication.
  • the encrypted data is added to the Bloom filter, and the target data is added to the target database according to the encrypted data.
  • Real-time deduplication of key fields through Bloom filters can be realized, which can reduce memory usage and improve deduplication efficiency.
  • the encrypted data after the encrypted data is deduplicated by using the Bloom filter, it further includes: if it is detected that the count value of the timer and the deduplication duration satisfy the preset constraint condition, then The Bloom filter is reset; the control timer is counted again according to the deduplication duration, and the Bloom filter is controlled to perform the deduplication operation.
  • the preset constraint condition can be that the count value is consistent with the deduplication duration, and the preset constraint condition can also be that the difference between the count value and the timing start time is equal to the deduplication duration, and the preset constraint condition can be based on actual needs to set. If this embodiment detects that the count value of the timer and the deduplication duration meet the preset constraint conditions, all encrypted data in the Bloom filter will be cleared, and the timer of Flink will be controlled to restart the next round of timing, while controlling The Bloom filter continues a new round of deduplication according to the deduplication duration.
  • the Bloom filter is reset; the timer is controlled to time again according to the deduplication duration, and the Bloom filter is controlled Perform deduplication. It can avoid the continuous growth of data in the Bloom filter and affect the memory usage rate, can realize efficient real-time deduplication of key fields of massive data, can save storage space, and improve data processing efficiency.
  • the target data is received, and at least one key field to be deduplicated in the target data is determined based on the configuration file; the logical relationship contained in the target data is used to splice each key field to be deduplicated, and the key field to be deduplicated is obtained.
  • Fig. 3 is a schematic structural diagram of a real-time deduplication device for key fields based on the distributed flow computing engine Flink provided by the embodiment of the present application.
  • the device can be implemented by software and/or hardware, and the device can be configured for The key fields of Flink, a distributed stream computing engine, are deduplicated in real-time in electronic devices.
  • the device includes:
  • the key field determination module 310 is configured to receive target data, and determine at least one key field to be deduplicated in the target data based on the configuration file; wherein, the target data is structured data; the configuration file includes The key field to be deduplicated matched with the target data;
  • the encrypted data determination module 320 is configured to encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine the encrypted data;
  • the deduplication module 330 is configured to set a timer based on Flink, with the start time in the timer as the starting time point, and within the deduplication duration in the timer, use a Bloom filter to filter the Encrypt data for deduplication.
  • the encrypted data determination module 320 includes a key field string determination unit configured to use the logical relationship contained in the target data to splice the at least one key field to be deduplicated to obtain the key field to be deduplicated.
  • the key field character string to be repeated; the encrypted data determining unit is configured to encrypt the key field string to be deduplicated based on the encryption algorithm to determine the encrypted data.
  • the encryption algorithm includes MD5 encryption algorithm.
  • the device further includes a reset module, configured to, after the encrypted data is deduplicated by the Bloom filter, if it is detected that the count value of the timer is equal to the deduplication duration If the preset constraint condition is met, the Bloom filter is reset; the timer is controlled to count again according to the deduplication duration, and the Bloom filter is controlled to perform the deduplication operation.
  • a reset module configured to, after the encrypted data is deduplicated by the Bloom filter, if it is detected that the count value of the timer is equal to the deduplication duration If the preset constraint condition is met, the Bloom filter is reset; the timer is controlled to count again according to the deduplication duration, and the Bloom filter is controlled to perform the deduplication operation.
  • the deduplication module 330 is configured to discard the encrypted data if it is determined based on the Bloom filter that the encrypted data already exists.
  • the deduplication module 330 is configured to: if it is determined based on the Bloom filter that the encrypted data does not exist, then add the encrypted data to the Bloom filter, and according to the The encrypted data adds the target data to the target database.
  • the setting process of the Bloom filter includes: setting a tolerance false positive rate of the Bloom filter; and determining a maximum data processing capacity of the Bloom filter according to the deduplication duration.
  • the device provided in the above embodiments can execute the real-time deduplication method for key fields based on the distributed flow computing engine Flink provided in any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
  • Fig. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. As shown in Fig. 4, the device includes:
  • One or more processors 410, one processor 410 is taken as an example in FIG. 4;
  • the device may also include: an input device 430 and an output device 440 .
  • the processor 410, the memory 420, the input device 430 and the output device 440 in the device may be connected through a bus or in other ways. In FIG. 4, connection through a bus is taken as an example.
  • the memory 420 (also referred to as a storage device) is a non-transitory computer-readable storage medium that can be used to store software programs, computer-executable programs and modules, such as a distributed flow computing engine in the embodiment of the present application
  • the program instruction/module corresponding to Flink's key field real-time deduplication method.
  • the processor 410 executes various functional applications and data processing of the computer device by running the software programs, instructions, and modules stored in the memory 420, that is, a keyword based on the distributed stream computing engine Flink that implements the above method embodiments Segment real-time deduplication method, namely:
  • Target data Receiving target data, and determining at least one key field to be deduplicated in the target data based on a configuration file; wherein, the target data is structured data; heavy key field;
  • a timer is set based on Flink, and the start time in the timer is taken as the starting time point, and the encrypted data is deduplicated by using a Bloom filter within the deduplication duration in the timer.
  • the memory 420 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the computer device, and the like.
  • the memory 420 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
  • the storage 420 may optionally include storages that are remotely located relative to the processor 410, and these remote storages may be connected to the terminal device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 430 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the computer device.
  • the output device 440 may include a display device such as a display screen.
  • the embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored.
  • a key field real-time Deduplication method that is:
  • Target data Receiving target data, and determining at least one key field to be deduplicated in the target data based on a configuration file; wherein, the target data is structured data; heavy key field;
  • a timer is set based on Flink, and the start time in the timer is taken as the starting time point, and the encrypted data is deduplicated by using a Bloom filter within the deduplication duration in the timer.
  • the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof.
  • Examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more conductors, portable computer disks, hard disks, Random Access Memory (RAM, Random Access Memory), Read Only Memory (ROM, Read-Only Memory), erasable programmable read-only memory (EPROM, Erasable Programmable Read-Only Memory), flash memory, optical fiber, portable compact disk read-only memory (CD-ROM, Compact Disc Read-Only Memory), optical storage components, magnetic storage devices, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • the storage medium may be a non-transitory storage medium.
  • a computer readable signal medium may include a data signal carrying computer readable program code in baseband or as part of a carrier wave. Such propagated data signals may take many forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device. .
  • the program code contained on the computer-readable medium can be transmitted by any appropriate medium, including—but not limited to—wireless, electric wire, optical cable, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.
  • Computer program code for performing the operations of the present application may be written in one or more programming languages or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural Programming Language - such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN, Local Area Network) or a wide area network (WAN, Wide Area Network), or it can be connected to an external computer (for example, use an Internet service provider to connect via the Internet).
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for deduplicating a key field in real time on the basis of a distributed stream calculation engine Flink. The method comprises: receiving target data, and determining, from the target data and on the basis of a configuration file, at least one key field to be deduplicated (S110), wherein the target data is structured data, and the configuration file comprises the key field to be deduplicated, which matches the target data; on the basis of an encryption algorithm, encrypting the at least one key field to be deduplicated, so as to determine encrypted data (S120); and setting a timer on the basis of Flink, taking a start time in the timer as a start time point, and deduplicating the encrypted data within a deduplication duration in the timer by using a Bloom filter (S130).

Description

基于分布式流计算引擎Flink的关键字段实时去重方法Real-time deduplication method for key fields based on distributed stream computing engine Flink
本申请要求在2021年11月16日提交中国专利局、申请号为202111352389.2的中国专利申请的优先权,以上申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with application number 202111352389.2 filed with China Patent Office on November 16, 2021, the entire content of the above application is incorporated by reference in this application.
技术领域technical field
本申请实施例涉及大数据处理技术领域,例如涉及一种基于分布式流计算引擎Flink的关键字段实时去重方法。The embodiment of the present application relates to the technical field of big data processing, for example, it relates to a real-time deduplication method for key fields based on the distributed stream computing engine Flink.
背景技术Background technique
消除重复数据是实际业务中经常遇到的一类需求。在大数据领域,重复数据的删除有助于减少存储空间,并提高服务器的处理效率。在实时计算中,关键字段去重是一个增量和长期的过程。Eliminating duplicate data is a type of requirement that is often encountered in actual business. In the field of big data, the deduplication of data helps to reduce storage space and improve server processing efficiency. In real-time computing, deduplication of key fields is an incremental and long-term process.
相关技术中的实时字段去重方式如下:使用Redis将实时数据流中的每条数据都去Redis中进行判断或者使用无序不重复的HashSet。但是,如果使用Redis,每次都需要通过网络连接Redis服务,网络速度比缓存速度慢以及网络有可能存在不稳定性;如果使用HashSet,虽然不用考虑网络因素,但是将千万、亿级别的数据存入到HashSet时,数据越来越多,处理效率也就会大打折扣,同时也会占用大量的内存空间。The real-time field deduplication method in related technologies is as follows: use Redis to send each piece of data in the real-time data stream to Redis for judgment or use a HashSet that is out of order and not repeated. However, if you use Redis, you need to connect to the Redis service through the network every time. The network speed is slower than the cache speed and the network may be unstable; When storing in HashSet, the more and more data, the processing efficiency will be greatly reduced, and it will also take up a lot of memory space.
发明内容Contents of the invention
本申请实施例提供一种基于分布式流计算引擎Flink的关键字段实时去重方法。The embodiment of this application provides a real-time deduplication method for key fields based on the distributed stream computing engine Flink.
第一方面,本申请实施例提供了一种基于分布式流计算引擎Flink的关键字段实时去重方法,该方法包括:In the first aspect, the embodiment of the present application provides a real-time deduplication method for key fields based on the distributed stream computing engine Flink, the method includes:
接收目标数据,并基于配置文件确定所述目标数据中的至少一个待去重关键字段;其中,所述目标数据为结构化数据;所述配置文件中包括与所述目标数据匹配的待去重关键字段;Receiving target data, and determining at least one key field to be deduplicated in the target data based on a configuration file; wherein, the target data is structured data; heavy key field;
基于加密算法对所述至少一个待去重关键字段进行加密,确定加密数据;Encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data;
基于Flink设置计时器,以所述计时器中的起始时间为起始时间点,在所述计时器中的去重持续时间内,采用布隆过滤器对所述加密数据进行去重。A timer is set based on Flink, and the start time in the timer is taken as the starting time point, and the encrypted data is deduplicated by using a Bloom filter within the deduplication duration in the timer.
第二方面,本申请实施例还提供了一种基于分布式流计算引擎Flink的关键字段实时去重装置,该装置包括:In the second aspect, the embodiment of the present application also provides a real-time deduplication device for key fields based on the distributed stream computing engine Flink, which includes:
关键字段确定模块,设置为接收目标数据,并基于配置文件确定所述目标数据中的至少一个待去重关键字段;其中,所述目标数据为结构化数据;所述配置文件中包括与所述目标数据匹配的待去重关键字段;The key field determination module is configured to receive the target data, and determine at least one key field to be deduplicated in the target data based on the configuration file; wherein, the target data is structured data; the configuration file includes the The key field to be deduplicated matched by the target data;
加密数据确定模块,设置为基于加密算法对所述至少一个待去重关键字段进行加密,确定加密数据;An encrypted data determination module configured to encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data;
去重模块,设置为基于Flink设置计时器,以所述计时器中的起始时间为起始时间点,在所述计时器中的去重持续时间内,采用布隆过滤器对所述加密数据进行去重。The deduplication module is configured to set a timer based on Flink, with the start time in the timer as the starting time point, and within the deduplication duration in the timer, the Bloom filter is used to encrypt the The data is deduplicated.
第三方面,本申请实施例还提供了一种电子设备,该设备包括:In a third aspect, the embodiment of the present application also provides an electronic device, the device includes:
一个或多个处理器;one or more processors;
存储装置,设置为存储一个或多个程序,storage means configured to store one or more programs,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请实施例中任一项所述的基于分布式流计算引擎Flink的关键字段实时去重方法。When the one or more programs are executed by the one or more processors, so that the one or more processors implement the key of the Flink-based distributed stream computing engine as described in any one of the embodiments of this application Field deduplication method in real time.
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如本申请实施例中任一项所述的基于分布式流计算引擎Flink的关键字段实时去重方法。In the fourth aspect, the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, any A real-time deduplication method for key fields based on the distributed stream computing engine Flink described in the item.
附图说明Description of drawings
图1是本申请实施例提供的一种基于分布式流计算引擎Flink的关键字段实时去重方法的流程图;FIG. 1 is a flow chart of a real-time deduplication method for key fields based on the distributed flow computing engine Flink provided by the embodiment of the present application;
图2是本申请实施例提供的另一种基于分布式流计算引擎Flink的关键字段实时去重方法的流程图;FIG. 2 is a flow chart of another real-time deduplication method for key fields based on the distributed stream computing engine Flink provided by the embodiment of the present application;
图3是本申请实施例提供的一种基于分布式流计算引擎Flink的关键字段实时去重装置结构示意图;FIG. 3 is a schematic structural diagram of a real-time deduplication device for key fields based on the distributed stream computing engine Flink provided by the embodiment of the present application;
图4是本申请实施例提供的一种电子设备结构示意图。Fig. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面结合附图和实施例对本申请进行说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。The application will be described below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, only some structures related to the present application are shown in the drawings but not all structures.
图1是本申请实施例提供的基于分布式流计算引擎Flink的关键字段实时去重方法的流程图,所述方法可以由基于分布式流计算引擎Flink的关键字段实时去重装置来执行,所述装置可以由软件和/或硬件的方式实现,所述装置可以配置在用于基于分布式流计算引擎Flink的关键字段实时去重的电子设备中。所述方法可以应用于对海量结构化数据的关键字段进行实时去重的场景中。如图1所示,本申请实施例如下。Fig. 1 is a flowchart of a real-time deduplication method for key fields based on the distributed stream computing engine Flink provided by an embodiment of the present application, and the method can be executed by a real-time key field deduplication device based on the distributed stream computing engine Flink , the device may be realized by software and/or hardware, and the device may be configured in an electronic device for real-time deduplication of key fields based on the distributed stream computing engine Flink. The method can be applied to the scene of real-time deduplication of key fields of massive structured data. As shown in FIG. 1 , the embodiment of the present application is as follows.
S110:接收目标数据,并基于配置文件确定所述目标数据中的至少一个待去重关键字段。S110: Receive target data, and determine at least one key field to be deduplicated in the target data based on the configuration file.
其中,所述目标数据为结构化数据;所述配置文件中包括与所述目标数据匹配的待去重关键字段。Wherein, the target data is structured data; the configuration file includes key fields to be deduplicated that match the target data.
例如,目标数据可以是海量数据中一条待去重的结构化数据。目标数据中包括多个关键字段,并且目标数据有自己的业务特征,即目标数据属于某一种业务数据类型。配置文件中预先配置了需要进行关键字段去重的各种业务数据的待去重关键字段。其中,业务数据不同,需要去重的关键字段也不相同。例如,业务数据为用户上网信息的待去重关键字段与业务数据为火车票信息的待去重关键字段不相同。一实施例可以根据实际去重需要基于配置文件确定目标数据中的至少一个待去重关键字段。一实施例可以通过Kafka消息系统接入目标数据,并通过Flink消费Kafka消息系统中的目标数据。其中,Kafka是一种高吞吐量的分布式发布订阅消息系统,它可以处理消费者在网站中的所有动作流数据。Apache Flink是由Apache软件基金会开发的开源流处理框架,其核心是用Java和Scala编写的分布式流数据流引擎。Flink以数据并行和流水线方式执行任意流数据程序,Flink的流水线运行时系统可以执行批处理和流处理程序。此外,Flink的运行时本身也支持迭代算法的执行。For example, the target data may be a piece of structured data to be deduplicated in massive data. The target data includes multiple key fields, and the target data has its own business characteristics, that is, the target data belongs to a certain type of business data. The key fields to be deduplicated of various business data that need to be deduplicated are pre-configured in the configuration file. Among them, the business data is different, and the key fields that need to be deduplicated are also different. For example, the key fields to be deduplicated where the business data is user Internet access information are different from the key fields to be deduplicated where the business data is train ticket information. An embodiment may determine at least one key field to be deduplicated in the target data based on a configuration file according to an actual deduplication requirement. In one embodiment, the target data can be accessed through the Kafka message system, and the target data in the Kafka message system can be consumed through Flink. Among them, Kafka is a high-throughput distributed publish-subscribe message system, which can handle all action flow data of consumers in the website. Apache Flink is an open source stream processing framework developed by the Apache Software Foundation. Its core is a distributed stream data flow engine written in Java and Scala. Flink executes arbitrary streaming data programs in a data parallel and pipeline manner, and Flink's pipeline runtime system can execute batch and stream processing programs. In addition, Flink's runtime itself supports the execution of iterative algorithms.
其中,结构化数据也称作行数据,是由二维表结构来逻辑表达和实现的数据,严格地遵循数据格式与长度规范,主要通过关系型数据库进行存储和管理。Among them, structured data is also called row data, which is logically expressed and realized by a two-dimensional table structure, strictly follows the data format and length specification, and is mainly stored and managed by a relational database.
S120:基于加密算法对所述至少一个待去重关键字段进行加密,确定加密数据。S120: Encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data.
其中,加密算法可以是MD5(Message Digest Algorithm 5,消息摘要算法)加密算法。一实施例可以将基于配置文件确定的待去重关键字段,根据关键字段配置策略将各待去重关键字段使用逻辑与或者逻辑或进行拼接,得到待去重关键字段字符串,基于MD5加密算法对待去重关键字段字符串进行加密确定待去重关键字段字符串的MD5值,即加密数据。其中,MD5值可以是32位的十 六进制数据。Wherein, the encryption algorithm may be MD5 (Message Digest Algorithm 5, message digest algorithm) encryption algorithm. In one embodiment, the key fields to be deduplicated determined based on the configuration file may be spliced according to the key field configuration strategy using logical AND or logical OR to obtain the key field string to be deduplicated, Encrypt the key field string to be deduplicated based on the MD5 encryption algorithm to determine the MD5 value of the key field string to be deduplicated, that is, the encrypted data. Wherein, the MD5 value can be 32-bit hexadecimal data.
S130:基于Flink设置计时器,以所述计时器中的起始时间为起始时间点,在所述计时器中的去重持续时间内,采用布隆过滤器对所述加密数据进行去重。S130: Set a timer based on Flink, take the starting time in the timer as the starting time point, and use a Bloom filter to deduplicate the encrypted data during the deduplication duration in the timer .
其中,由于本实施例是对海量结构化数据的实时去重,因此,目标数据源源不断地接入到Kafka消息系统中。一实施例在确定加密数据后,通过Flink的KeyBy算子根据确定的加密数据,即关键字段字符串的MD5值进行分组,相同MD5值的数据被分到一个组内,Flink的一个插槽会处理一个或多个组的数据,即同一个组内的数据肯定在Flink的一个插槽中。可以实现MD5值相同的数据在一台服务器上。Among them, since this embodiment is real-time deduplication of massive structured data, the target data is continuously connected to the Kafka message system. In one embodiment, after the encrypted data is determined, the KeyBy operator of Flink is used to group the determined encrypted data, that is, the MD5 value of the key field string, and the data with the same MD5 value is divided into one group, and one slot of Flink The data of one or more groups will be processed, that is, the data in the same group must be in a slot of Flink. Data with the same MD5 value can be stored on one server.
在完成加密数据的分组后,本实施例可以基于计时器和布隆过滤器实现关键字段在内存中的实时去重。例如,本实施例可以调用Flink的相关方法设置计时器。其中,计时器可以包括计时起始时间,例如当前时间。计时器也可以包括去重持续时间。去重持续时间表示对由Kafka消息系统接入的结构化数据执行去重操作的持续时间。去重持续时间可以根据实际需要进行确定,去重持续时间例如可以是5min,去重持续时间例如也可以是10min。After the grouping of encrypted data is completed, this embodiment can implement real-time deduplication of key fields in memory based on a timer and a Bloom filter. For example, this embodiment can call related methods of Flink to set timers. Wherein, the timer may include a counting start time, such as the current time. A timer can also include a deduplication duration. Deduplication duration indicates the duration of performing deduplication operations on structured data accessed by the Kafka messaging system. The deduplication duration may be determined according to actual needs, for example, the deduplication duration may be 5 minutes, and the deduplication duration may also be 10 minutes, for example.
对查准度要求没有那么苛刻,而对时间、空间要求比较高的数据处理场景,可以使用布隆过滤器进行全局去重。布隆过滤器包括哈希算法与bitmap,对于字符串结构的数据通过多次哈希算法,基于bitmap作为存储,由于只用0或1存储,所以与存储MD5值的Hashset相比较,可以大量节省存储空间,也就特别适合在上百亿数据里面去做去重。For data processing scenarios with less stringent requirements on accuracy and higher requirements on time and space, Bloom filters can be used for global deduplication. Bloom filter includes hash algorithm and bitmap. For the data of string structure, multiple hash algorithms are used for storage based on bitmap. Since only 0 or 1 is used for storage, it can save a lot of money compared with Hashset that stores MD5 values. The storage space is especially suitable for deduplication in tens of billions of data.
一实施例中,在使用布隆过滤器之前,可以先对布隆过滤器的相关参数进行设置,例如,相关参数可以是布隆过滤器的容忍误判率,相关参数也可以是布隆过滤器在去重持续时间内可以处理的最大数据处理量。在设置完相关参数后,本实施例采用布隆过滤器对加密数据进行去重,每当一条加密数据进入时,判断该加密数据是否出现过,如果确定布隆过滤器中已经存在该加密数据,则将加密数据在数据处理实时流中将加密数据进行去重丢弃。如果确定布隆过滤器中不存在加密数据,则将该加密数据插入到布隆过滤器中,并且基于该加密数据向存储目标数据的数据库中插入一条目标数据。当时间到达计时器中的去重持续时间时,本实施例可以将布隆过滤器进行重置,执行下一轮的实时关键字段去重。In an embodiment, before using the Bloom filter, related parameters of the Bloom filter can be set first, for example, the related parameter can be the tolerance rate of false positives of the Bloom filter, and the related parameter can also be the Bloom filter The maximum amount of data processing that the server can handle within the deduplication duration. After setting the relevant parameters, this embodiment uses the Bloom filter to deduplicate the encrypted data. Whenever a piece of encrypted data enters, it is judged whether the encrypted data has appeared. If it is determined that the encrypted data already exists in the Bloom filter , the encrypted data will be deduplicated and discarded in the real-time data processing stream. If it is determined that the encrypted data does not exist in the Bloom filter, the encrypted data is inserted into the Bloom filter, and a piece of target data is inserted into the database storing the target data based on the encrypted data. When the time reaches the deduplication duration in the timer, this embodiment can reset the Bloom filter, and perform the next round of real-time key field deduplication.
本申请实施例,接收目标数据,并基于配置文件确定目标数据中的至少一个待去重关键字段;其中,目标数据为结构化数据;配置文件中包括与目标数 据匹配的待去重关键字段;基于加密算法对各待去重关键字段进行加密确定加密数据;基于Flink设置计时器,以计时器中的起始时间为起始时间点,在计时器中的去重持续时间内,采用布隆过滤器对加密数据进行去重。通过执行本申请实施例,可以实现对海量数据的关键字段进行高效地实时去重,可以节约存储空间,提高数据处理效率。In the embodiment of the present application, target data is received, and at least one key field to be deduplicated in the target data is determined based on the configuration file; wherein, the target data is structured data; the configuration file includes keywords to be deduplicated that match the target data segment; based on the encryption algorithm, encrypt each key field to be deduplicated to determine the encrypted data; set the timer based on Flink, take the starting time in the timer as the starting time point, and within the deduplication duration in the timer, Bloom filter is used to deduplicate encrypted data. By executing the embodiments of the present application, efficient real-time deduplication of key fields of massive data can be realized, storage space can be saved, and data processing efficiency can be improved.
图2是本申请实施例提供的另一种基于分布式流计算引擎Flink的关键字段实时去重方法的流程图,本实施例在上述实施例的基础上进行调整。如图2所示,本申请实施例中的基于分布式流计算引擎Flink的关键字段实时去重方法如下。Fig. 2 is a flowchart of another real-time deduplication method for key fields based on the distributed stream computing engine Flink provided by the embodiment of the present application. This embodiment is adjusted on the basis of the above-mentioned embodiments. As shown in Figure 2, the real-time deduplication method for key fields based on the distributed flow computing engine Flink in the embodiment of this application is as follows.
S210:接收目标数据,并基于配置文件确定所述目标数据中的至少一个待去重关键字段。S210: Receive target data, and determine at least one key field to be deduplicated in the target data based on the configuration file.
S220:采用所述目标数据中包含的逻辑关系将各所述待去重关键字段进行拼接,得到待去重关键字段字符串。S220: Use the logical relationship contained in the target data to splice the key fields to be deduplicated to obtain a string of key fields to be deduplicated.
示例性的,不同特征的业务数据,需要去重的关键字段不同,以及数据来源不同。同一特征的业务数据,属性字段不同,如有些来源数据的用户名有值,有些来源数据的用户登录账户有值,有些来源数据的用户名和登录账户都有值,对于这些数据配置关键字段时支持或的关系。关键字段配置策略如下:条件为and的所有待去重关键字段拼接为字段名称+字段值。条件为or的所有待去重关键字段拼接为第一个不为空的字段名称+字段值,若or中的待去重关键字段都无值,则拼接最后的待去重关键字段名称+空值,最后得到对待去重关键字段拼接的字符串。例如,根据目标数据确定的待去重关键字段包括三部分:第一部分:银行卡号、支付卡号、接收方卡号、设备号、客户端MAC(Media Access Control,媒体访问控制)地址;第二部分:支付账号、付款ID(Identity);第三部分:用户登录账号以及用户ID。其中,第一部分的各关键字段之间用条件and进行连接,第二部分的各待去重关键字段之间用条件or进行连接,第三部分的各待去重关键字段之间也用条件or进行连接,第一部分、第二部分与第三部分之间用条件and进行拼接得到待去重关键字段字符串。其中,各待去重关键字段为字段名称和字段值的拼接。需要说明的是,上述例子仅为一个示例,在实际应用中可以根据实际需要使用逻辑与或逻辑或的关系配置待去重关键字段的拼接策略,例如为一条数据中的所有关键字段配置关键字段拼接策略,或者也可以为一条数据中的部分关键字段配置关键字段拼接策略。Exemplarily, for business data with different characteristics, different key fields need to be deduplicated, and different data sources. Business data with the same feature has different attribute fields. For example, some source data has values for user names, some source data has values for user login accounts, and some source data has values for both user names and login accounts. When configuring key fields for these data support or relationship. The key field configuration strategy is as follows: all key fields to be deduplicated with the condition of and are concatenated into field name + field value. All the key fields to be deduplicated with the condition of or are spliced into the first non-empty field name + field value. If the key fields to be deduplicated in or have no value, the last key field to be deduplicated is spliced Name + empty value, and finally get the string to be concatenated with deduplication key fields. For example, the key field to be deduplicated determined according to the target data includes three parts: the first part: bank card number, payment card number, receiver card number, device number, client MAC (Media Access Control, Media Access Control) address; the second part : payment account, payment ID (Identity); third part: user login account and user ID. Among them, the key fields in the first part are connected with the condition and, the key fields to be deduplicated in the second part are connected with the condition or, and the key fields to be deduplicated in the third part are also connected with each other. Use the condition or to connect, and use the condition and to splice the first part, the second part and the third part to get the key field string to be deduplicated. Among them, each key field to be deduplicated is the splicing of field name and field value. It should be noted that the above example is only an example. In practical applications, the splicing strategy of the key fields to be deduplicated can be configured according to the actual needs using the relationship of logical and or logical or, for example, configuring for all key fields in a piece of data The key field splicing strategy, or you can configure the key field splicing strategy for some key fields in a piece of data.
S230:基于所述加密算法对所述待去重关键字段字符串进行加密,确定加密数据。S230: Encrypt the key field character string to be deduplicated based on the encryption algorithm to determine encrypted data.
本实施例可以采用加密算法对根据关键字段配置策略将待去重关键字段进行拼接得到的待去重关键字段字符串进行加密,得到加密数据。In this embodiment, an encryption algorithm may be used to encrypt the string of key fields to be deduplicated obtained by concatenating the key fields to be deduplicated according to the key field configuration policy, to obtain encrypted data.
在一实施例中,所述加密算法包括MD5加密算法。In one embodiment, the encryption algorithm includes MD5 encryption algorithm.
其中,MD5就是采用单向加密的加密算法,对于MD5而言,有两个特性是很重要的,第一是任意两段明文数据,加密以后的密文不能是相同的;第二是任意一段明文数据,经过加密以后,其结果永远是不变的。前者的意思是不可能有任意两段明文加密以后得到相同的密文,后者的意思是如果加密特定的数据,得到的密文一定是相同的。MD5消息摘要算法,属Hash算法一类。MD5算法对输入任意长度的消息进行运行,产生一个32位十六进制的消息摘要。Among them, MD5 is an encryption algorithm that uses one-way encryption. For MD5, there are two characteristics that are very important. The first is any two pieces of plaintext data, and the encrypted ciphertext cannot be the same; the second is any piece of plaintext data. After plaintext data is encrypted, the result will never change. The former means that it is impossible for any two pieces of plaintext to be encrypted to obtain the same ciphertext, and the latter means that if specific data is encrypted, the resulting ciphertext must be the same. MD5 message digest algorithm belongs to the category of Hash algorithm. The MD5 algorithm operates on an input message of any length and produces a 32-bit hexadecimal message digest.
由此,通过使用MD5加密算法对待去重关键字段字符串进行加密,确定加密数据,可以实现缩减待去重关键字段字符串的长度,避免去重关键字段比较多时产生过长字符串的状况,可以提高去重效率。Therefore, by using the MD5 encryption algorithm to encrypt the string of the key field to be deduplicated and determine the encrypted data, the length of the string of the key field to be deduplicated can be reduced, and an excessively long string can be avoided when there are many key fields to be deduplicated The situation can improve the deduplication efficiency.
S240:基于Flink设置计时器,以所述计时器中的起始时间为起始时间点,在所述计时器中的去重持续时间内,采用布隆过滤器对所述加密数据进行去重。S240: Set a timer based on Flink, take the starting time in the timer as the starting time point, and use a Bloom filter to deduplicate the encrypted data during the deduplication duration in the timer .
在一实施例中,布隆过滤器的设置过程,包括:对布隆过滤器的容忍误判率进行设置;根据所述去重持续时间,确定布隆过滤器的最大数据处理量。In an embodiment, the setting process of the Bloom filter includes: setting a tolerance false positive rate of the Bloom filter; and determining a maximum data processing capacity of the Bloom filter according to the deduplication duration.
其中,容忍误判率可以是允许布隆过滤器去重出错的概率,容忍误判率设置的越高,处理器所需处理时间越少;反之,容忍误判率设置的越低,处理器所需处理时间越多。最大数据处理量影响内存占用率,最大数据处理量可以根据去重持续时间以及布隆过滤器每天的数据处理量进行确定。示例性的,如果去重持续时间设置为5分钟,布隆过滤器每天处理的数据量为20亿,则配置布隆过滤器最大数据处理量=(20亿÷(24×60))×5=6944445。并且设置容忍误判率为0.0001,则基于容忍误判率可以确定需要执行的哈希函数次数,每进行一个哈希运算,则对应的生成0或者1存入布隆过滤器中。进而根据开源的布隆过滤器源码,可以计算出布隆过滤器的内存占用率。Wherein, the tolerable misjudgment rate can be the probability of allowing the Bloom filter to deduplicate errors. The higher the tolerable misjudgment rate is set, the less processing time the processor needs; on the contrary, the lower the tolerable misjudgment rate is set, the processor more processing time is required. The maximum data processing volume affects the memory usage rate, and the maximum data processing volume can be determined according to the deduplication duration and the daily data processing volume of the Bloom filter. Exemplarily, if the deduplication duration is set to 5 minutes and the amount of data processed by the Bloom filter is 2 billion per day, then configure the maximum data processing amount of the Bloom filter=(2 billion ÷ (24×60))×5 =6944445. And set the tolerance rate of misjudgment to 0.0001, then the number of hash functions to be executed can be determined based on the tolerance of misjudgment rate, and each time a hash operation is performed, a corresponding 0 or 1 is generated and stored in the Bloom filter. Furthermore, according to the open source Bloom filter source code, the memory usage of the Bloom filter can be calculated.
由此,通过对布隆过滤器的容忍误判率进行设置;根据去重持续时间,确定布隆过滤器的最大数据处理量。可以实现根据实际需要对布隆过滤器进行配置,可以实现基于布隆过滤器实现关键字段的实时去重,可以实现节约内存空间,提高数据处理效率。Therefore, by setting the tolerable misjudgment rate of the Bloom filter; according to the deduplication duration, the maximum data processing capacity of the Bloom filter is determined. The Bloom filter can be configured according to actual needs, real-time deduplication of key fields based on the Bloom filter can be realized, memory space can be saved, and data processing efficiency can be improved.
在一实施例中,采用布隆过滤器对所述加密数据进行去重,包括:若基于 所述布隆过滤器确定所述加密数据已经存在,则将所述加密数据进行丢弃。In one embodiment, using a Bloom filter to deduplicate the encrypted data includes: discarding the encrypted data if it is determined based on the Bloom filter that the encrypted data already exists.
例如,本实施例可以调用布隆过滤器的mightContain()方法确定布隆过滤器中是否已经存在加密数据,如果该加密数据已经存在,则表明加密数据重复,就将该加密数据在实时处理流中进行去重丢弃,也无需向数据库中增加该加密数据。For example, this embodiment can call the mightContain() method of the Bloom filter to determine whether encrypted data already exists in the Bloom filter. If the encrypted data already exists, it indicates that the encrypted data is repeated, and the encrypted data is processed in real time. There is no need to add the encrypted data to the database for deduplication and discarding.
若基于布隆过滤器确定加密数据已经存在,则将加密数据进行丢弃。可以实现通过布隆过滤器对待去重关键字段进行实时去重,避免数据库中重复记录的添加,可以降低内存占用率,提高去重效率。If it is determined based on the Bloom filter that the encrypted data already exists, the encrypted data is discarded. Real-time deduplication of deduplication key fields can be achieved through the Bloom filter, avoiding the addition of duplicate records in the database, reducing memory usage and improving deduplication efficiency.
在一实施例中,采用布隆过滤器对所述加密数据进行去重,包括:若基于所述布隆过滤器确定所述加密数据不存在,则将所述加密数据增加至所述布隆过滤器,并根据所述加密数据向目标数据库中增加所述目标数据。In an embodiment, using a Bloom filter to deduplicate the encrypted data includes: if it is determined based on the Bloom filter that the encrypted data does not exist, adding the encrypted data to the Bloom filter, and add the target data to the target database according to the encrypted data.
例如,调用布隆过滤器的mightContain()方法确定布隆过滤器中是否已经存在加密数据,如果该加密数据没有出现过,则调用put()方法将其插入布隆过滤器。并且根据该加密数据以及与该加密数据对应的目标数据将目标数据更新到目标数据库中即可。其中,目标数据库可以是存储海量数据的数据库。数据实时流处理过程中会将根据待去重关键字段生成的加密数据的MD5值赋值到指定字段,如果因为任务重启等操作,导致布隆过滤器未到达重置时间就进行重置,这样会导致实时处理流上的去重不彻底,此时入库环节可以根据MD5值判断是将数据进行新增还是更新,可以保证实现双重去重。For example, call the mightContain() method of the Bloom filter to determine whether encrypted data already exists in the Bloom filter, and if the encrypted data has not appeared, call the put() method to insert it into the Bloom filter. And it is only necessary to update the target data into the target database according to the encrypted data and the target data corresponding to the encrypted data. Wherein, the target database may be a database storing massive data. During the real-time stream processing of data, the MD5 value of the encrypted data generated according to the key field to be deduplicated will be assigned to the specified field. If the Bloom filter is reset before the reset time is reached due to operations such as task restart, then It will lead to incomplete deduplication on the real-time processing stream. At this time, the storage link can judge whether to add or update the data according to the MD5 value, which can ensure the realization of double deduplication.
若基于布隆过滤器确定加密数据不存在,则将加密数据增加至布隆过滤器,并根据加密数据向目标数据库中增加目标数据。可以实现通过布隆过滤器对关键字段进行实时去重,可以降低内存占用率,可以提高去重效率。If it is determined based on the Bloom filter that the encrypted data does not exist, the encrypted data is added to the Bloom filter, and the target data is added to the target database according to the encrypted data. Real-time deduplication of key fields through Bloom filters can be realized, which can reduce memory usage and improve deduplication efficiency.
在一实施例中,在采用布隆过滤器对所述加密数据进行去重之后,还包括:若检测到所述计时器的计数值与所述去重持续时间满足预设约束条件,则将所述布隆过滤器中进行重置;控制计时器根据去重持续时间再次计时,并控制布隆过滤器执行去重操作。In an embodiment, after the encrypted data is deduplicated by using the Bloom filter, it further includes: if it is detected that the count value of the timer and the deduplication duration satisfy the preset constraint condition, then The Bloom filter is reset; the control timer is counted again according to the deduplication duration, and the Bloom filter is controlled to perform the deduplication operation.
例如,预设约束条件可以是计数值与去重持续时间一致,预设约束条件也可以是计数值与计时起始时间之间的差值等于去重持续时间,预设约束条件可以根据实际需要进行设置。如果本实施例检测到计时器的计数值与去重持续时间满足预设约束条件,则将布隆过滤器中的所有加密数据进行清空,控制Flink的计时器重新下一轮的计时,同时控制布隆过滤器根据去重持续时间继续新一轮的去重操作。For example, the preset constraint condition can be that the count value is consistent with the deduplication duration, and the preset constraint condition can also be that the difference between the count value and the timing start time is equal to the deduplication duration, and the preset constraint condition can be based on actual needs to set. If this embodiment detects that the count value of the timer and the deduplication duration meet the preset constraint conditions, all encrypted data in the Bloom filter will be cleared, and the timer of Flink will be controlled to restart the next round of timing, while controlling The Bloom filter continues a new round of deduplication according to the deduplication duration.
由此,通过若检测到计时器的计数值与去重持续时间满足预设约束条件,则将布隆过滤器进行重置;控制计时器根据去重持续时间再次计时,并控制布隆过滤器执行去重操作。可以实现避免布隆过滤器中的数据持续增长而影响内存占用率,可以实现对海量数据的关键字段进行高效地实时去重,可以节约存储空间,提高数据处理效率。Thus, if it is detected that the count value of the timer and the deduplication duration meet the preset constraint conditions, the Bloom filter is reset; the timer is controlled to time again according to the deduplication duration, and the Bloom filter is controlled Perform deduplication. It can avoid the continuous growth of data in the Bloom filter and affect the memory usage rate, can realize efficient real-time deduplication of key fields of massive data, can save storage space, and improve data processing efficiency.
本申请实施例,接收目标数据,并基于配置文件确定所述目标数据中的至少一个待去重关键字段;采用目标数据中包含的逻辑关系将各待去重关键字段进行拼接,得到待去重关键字段字符串;基于加密算法对待去重关键字段字符串进行加密,确定加密数据;基于Flink设置计时器,以计时器中的起始时间为起始时间点,在计时器中的去重持续时间内,采用布隆过滤器对加密数据进行去重。通过执行本实施例,可以实现对海量数据的关键字段进行高效地实时去重,可以节约存储空间,提高数据处理效率。In the embodiment of the present application, the target data is received, and at least one key field to be deduplicated in the target data is determined based on the configuration file; the logical relationship contained in the target data is used to splice each key field to be deduplicated, and the key field to be deduplicated is obtained. Deduplicate key field strings; encrypt the deduplicated key field strings based on the encryption algorithm to determine the encrypted data; set the timer based on Flink, with the starting time in the timer as the starting time point, in the timer During the deduplication duration, the Bloom filter is used to deduplicate encrypted data. By executing this embodiment, it is possible to efficiently deduplicate key fields of massive data in real time, save storage space, and improve data processing efficiency.
图3是本申请实施例提供的基于分布式流计算引擎Flink的关键字段实时去重装置结构示意图,所述装置可以由软件和/或硬件的方式实现,所述装置可以配置在用于基于分布式流计算引擎Flink的关键字段实时去重的电子设备中。如图3所示,所述装置包括:Fig. 3 is a schematic structural diagram of a real-time deduplication device for key fields based on the distributed flow computing engine Flink provided by the embodiment of the present application. The device can be implemented by software and/or hardware, and the device can be configured for The key fields of Flink, a distributed stream computing engine, are deduplicated in real-time in electronic devices. As shown in Figure 3, the device includes:
关键字段确定模块310,设置为接收目标数据,并基于配置文件确定所述目标数据中的至少一个待去重关键字段;其中,所述目标数据为结构化数据;所述配置文件中包括与所述目标数据匹配的待去重关键字段;The key field determination module 310 is configured to receive target data, and determine at least one key field to be deduplicated in the target data based on the configuration file; wherein, the target data is structured data; the configuration file includes The key field to be deduplicated matched with the target data;
加密数据确定模块320,设置为基于加密算法对所述至少一个待去重关键字段进行加密,确定加密数据;The encrypted data determination module 320 is configured to encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine the encrypted data;
去重模块330,设置为基于Flink设置计时器,以所述计时器中的起始时间为起始时间点,在所述计时器中的去重持续时间内,采用布隆过滤器对所述加密数据进行去重。The deduplication module 330 is configured to set a timer based on Flink, with the start time in the timer as the starting time point, and within the deduplication duration in the timer, use a Bloom filter to filter the Encrypt data for deduplication.
在一实施例中,加密数据确定模块320,包括关键字段字符串确定单元,设置为采用所述目标数据中包含的逻辑关系将所述至少一个待去重关键字段进行拼接,得到待去重关键字段字符串;加密数据确定单元,设置为基于所述加密算法对所述待去重关键字段字符串进行加密,确定加密数据。In one embodiment, the encrypted data determination module 320 includes a key field string determination unit configured to use the logical relationship contained in the target data to splice the at least one key field to be deduplicated to obtain the key field to be deduplicated. The key field character string to be repeated; the encrypted data determining unit is configured to encrypt the key field string to be deduplicated based on the encryption algorithm to determine the encrypted data.
在一实施例中,所述加密算法包括MD5加密算法。In one embodiment, the encryption algorithm includes MD5 encryption algorithm.
在一实施例中,所述装置还包括重置模块,设置为在采用布隆过滤器对所述加密数据进行去重之后,若检测到所述计时器的计数值与所述去重持续时间 满足预设约束条件,则将所述布隆过滤器进行重置;控制计时器根据去重持续时间再次计时,并控制布隆过滤器执行去重操作。In an embodiment, the device further includes a reset module, configured to, after the encrypted data is deduplicated by the Bloom filter, if it is detected that the count value of the timer is equal to the deduplication duration If the preset constraint condition is met, the Bloom filter is reset; the timer is controlled to count again according to the deduplication duration, and the Bloom filter is controlled to perform the deduplication operation.
在一实施例中,去重模块330,是设置为:若基于所述布隆过滤器确定所述加密数据已经存在,则将所述加密数据进行丢弃。In an embodiment, the deduplication module 330 is configured to discard the encrypted data if it is determined based on the Bloom filter that the encrypted data already exists.
在一实施例中,去重模块330,是设置为:若基于所述布隆过滤器确定所述加密数据不存在,则将所述加密数据增加至所述布隆过滤器,并根据所述加密数据向目标数据库中增加所述目标数据。In one embodiment, the deduplication module 330 is configured to: if it is determined based on the Bloom filter that the encrypted data does not exist, then add the encrypted data to the Bloom filter, and according to the The encrypted data adds the target data to the target database.
在一实施例中,布隆过滤器的设置过程,包括:对布隆过滤器的容忍误判率进行设置;根据所述去重持续时间,确定布隆过滤器的最大数据处理量。In an embodiment, the setting process of the Bloom filter includes: setting a tolerance false positive rate of the Bloom filter; and determining a maximum data processing capacity of the Bloom filter according to the deduplication duration.
上述实施例所提供的装置可以执行本申请任意实施例所提供的基于分布式流计算引擎Flink的关键字段实时去重方法,具备执行方法相应的功能模块和有益效果。The device provided in the above embodiments can execute the real-time deduplication method for key fields based on the distributed flow computing engine Flink provided in any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
图4是本申请实施例提供的一种电子设备结构示意图,如图4所示,该设备包括:Fig. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. As shown in Fig. 4, the device includes:
一个或多个处理器410,图4中以一个处理器410为例;One or more processors 410, one processor 410 is taken as an example in FIG. 4;
存储器420; memory 420;
所述设备还可以包括:输入装置430和输出装置440。The device may also include: an input device 430 and an output device 440 .
所述设备中的处理器410、存储器420、输入装置430和输出装置440可以通过总线或者其他方式连接,图4中以通过总线连接为例。The processor 410, the memory 420, the input device 430 and the output device 440 in the device may be connected through a bus or in other ways. In FIG. 4, connection through a bus is taken as an example.
存储器420(也可记为存储装置)作为一种非暂态计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请实施例中的一种基于分布式流计算引擎Flink的关键字段实时去重方法对应的程序指令/模块。处理器410通过运行存储在存储器420中的软件程序、指令以及模块,从而执行计算机设备的各种功能应用以及数据处理,即实现上述方法实施例的一种基于分布式流计算引擎Flink的关键字段实时去重方法,即:The memory 420 (also referred to as a storage device) is a non-transitory computer-readable storage medium that can be used to store software programs, computer-executable programs and modules, such as a distributed flow computing engine in the embodiment of the present application The program instruction/module corresponding to Flink's key field real-time deduplication method. The processor 410 executes various functional applications and data processing of the computer device by running the software programs, instructions, and modules stored in the memory 420, that is, a keyword based on the distributed stream computing engine Flink that implements the above method embodiments Segment real-time deduplication method, namely:
接收目标数据,并基于配置文件确定所述目标数据中的至少一个待去重关键字段;其中,所述目标数据为结构化数据;所述配置文件中包括与所述目标数据匹配的待去重关键字段;Receiving target data, and determining at least one key field to be deduplicated in the target data based on a configuration file; wherein, the target data is structured data; heavy key field;
基于加密算法对所述至少一个待去重关键字段进行加密,确定加密数据;Encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data;
基于Flink设置计时器,以所述计时器中的起始时间为起始时间点,在所述计时器中的去重持续时间内,采用布隆过滤器对所述加密数据进行去重。A timer is set based on Flink, and the start time in the timer is taken as the starting time point, and the encrypted data is deduplicated by using a Bloom filter within the deduplication duration in the timer.
存储器420可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据计算机设备的使用所创建的数据等。此外,存储器420可以包括高速随机存取存储器,还可以包括非暂态性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态性固态存储器件。在一些实施例中,存储器420可选包括相对于处理器410远程设置的存储器,这些远程存储器可以通过网络连接至终端设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 420 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the computer device, and the like. In addition, the memory 420 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the storage 420 may optionally include storages that are remotely located relative to the processor 410, and these remote storages may be connected to the terminal device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
输入装置430可用于接收输入的数字或字符信息,以及产生与计算机设备的用户设置以及功能控制有关的键信号输入。输出装置440可包括显示屏等显示设备。The input device 430 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the computer device. The output device 440 may include a display device such as a display screen.
本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请实施例提供的一种基于分布式流计算引擎Flink的关键字段实时去重方法,也即:The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored. When the program is executed by a processor, a key field real-time Deduplication method, that is:
接收目标数据,并基于配置文件确定所述目标数据中的至少一个待去重关键字段;其中,所述目标数据为结构化数据;所述配置文件中包括与所述目标数据匹配的待去重关键字段;Receiving target data, and determining at least one key field to be deduplicated in the target data based on a configuration file; wherein, the target data is structured data; heavy key field;
基于加密算法对所述至少一个待去重关键字段进行加密,确定加密数据;Encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data;
基于Flink设置计时器,以所述计时器中的起始时间为起始时间点,在所述计时器中的去重持续时间内,采用布隆过滤器对所述加密数据进行去重。A timer is set based on Flink, and the start time in the timer is taken as the starting time point, and the encrypted data is deduplicated by using a Bloom filter within the deduplication duration in the timer.
可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM,Random Access Memory)、只读存储器(ROM,Read-Only Memory)、可擦式可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、闪存、光纤、便携式紧凑磁盘只读存储器(CD-ROM,Compact Disc Read-Only Memory)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. Examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more conductors, portable computer disks, hard disks, Random Access Memory (RAM, Random Access Memory), Read Only Memory (ROM, Read-Only Memory), erasable programmable read-only memory (EPROM, Erasable Programmable Read-Only Memory), flash memory, optical fiber, portable compact disk read-only memory (CD-ROM, Compact Disc Read-Only Memory), optical storage components, magnetic storage devices, or any suitable combination of the above. In this document, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
存储介质可以是非暂态(non-transitory)存储介质。The storage medium may be a non-transitory storage medium.
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括——但不限于——电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。A computer readable signal medium may include a data signal carrying computer readable program code in baseband or as part of a carrier wave. Such propagated data signals may take many forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device. .
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括——但不限于——无线、电线、光缆、RF(Radio Frequency,射频)等等,或者上述的任意合适的组合。The program code contained on the computer-readable medium can be transmitted by any appropriate medium, including—but not limited to—wireless, electric wire, optical cable, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.
可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN,Local Area Network)或广域网(WAN,Wide Area Network)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present application may be written in one or more programming languages or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural Programming Language - such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN, Local Area Network) or a wide area network (WAN, Wide Area Network), or it can be connected to an external computer ( For example, use an Internet service provider to connect via the Internet).
上述为本申请的一些实施例。本领域技术人员会理解,本申请不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种变化、重新调整和替代而不会脱离本申请的保护范围。因此,虽然通过以上实施例对本申请进行了说明,但是本申请不仅仅限于以上实施例,在不脱离本发明构思的情况下,还可以包括更多其他等效实施例,而本申请的范围由所附的权利要求范围决定。The foregoing are some embodiments of the present application. Those skilled in the art will understand that the present application is not limited to the specific embodiments described here, and various changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present application. Therefore, although the present application has been described through the above embodiments, the present application is not limited to the above embodiments, and can also include more other equivalent embodiments without departing from the concept of the present invention, and the scope of the present application consists of The scope of the appended claims determines.

Claims (10)

  1. 一种基于分布式流计算引擎Flink的关键字段实时去重方法,包括:A real-time deduplication method for key fields based on the distributed stream computing engine Flink, including:
    接收目标数据,并基于配置文件确定所述目标数据中的至少一个待去重关键字段;其中,所述目标数据为结构化数据;所述配置文件中包括与所述目标数据匹配的待去重关键字段;Receiving target data, and determining at least one key field to be deduplicated in the target data based on a configuration file; wherein, the target data is structured data; heavy key field;
    基于加密算法对所述至少一个待去重关键字段进行加密,确定加密数据;Encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data;
    基于Flink设置计时器,以所述计时器中的起始时间为起始时间点,在所述计时器中的去重持续时间内,采用布隆过滤器对所述加密数据进行去重。A timer is set based on Flink, and the start time in the timer is taken as the starting time point, and the encrypted data is deduplicated by using a Bloom filter within the deduplication duration in the timer.
  2. 根据权利要求1所述的方法,其中,基于加密算法对所述至少一个待去重关键字段进行加密,确定加密数据,包括:The method according to claim 1, wherein, encrypting the at least one key field to be deduplicated based on an encryption algorithm, and determining the encrypted data includes:
    采用所述目标数据中包含的逻辑关系将所述至少一个待去重关键字段进行拼接,得到待去重关键字段字符串;Splicing the at least one key field to be deduplicated by using the logical relationship contained in the target data to obtain a string of key fields to be deduplicated;
    基于所述加密算法对所述待去重关键字段字符串进行加密,确定加密数据。Encrypt the key field character string to be deduplicated based on the encryption algorithm to determine encrypted data.
  3. 根据权利要求1所述的方法,其中,所述加密算法包括消息摘要MD5加密算法。The method according to claim 1, wherein the encryption algorithm comprises a message digest MD5 encryption algorithm.
  4. 根据权利要求1所述的方法,所述方法还包括:The method according to claim 1, said method further comprising:
    响应于检测到所述计时器的计数值与所述去重持续时间满足预设约束条件,将所述布隆过滤器进行重置;Resetting the Bloom filter in response to detecting that the count value of the timer and the deduplication duration satisfy a preset constraint condition;
    控制计时器根据去重持续时间再次计时,并控制布隆过滤器执行去重操作。The control timer counts again according to the deduplication duration, and controls the Bloom filter to perform the deduplication operation.
  5. 根据权利要求1所述的方法,其中,采用布隆过滤器对所述加密数据进行去重,包括:The method according to claim 1, wherein, using a Bloom filter to deduplicate the encrypted data comprises:
    响应于基于所述布隆过滤器确定所述加密数据已经存在,将所述加密数据进行丢弃。The encrypted data is discarded in response to determining that the encrypted data already exists based on the Bloom filter.
  6. 根据权利要求1所述的方法,其中,采用布隆过滤器对所述加密数据进行去重,包括:The method according to claim 1, wherein, using a Bloom filter to deduplicate the encrypted data comprises:
    响应于基于所述布隆过滤器确定所述加密数据不存在,将所述加密数据增加至所述布隆过滤器,并根据所述加密数据向目标数据库中增加所述目标数据。In response to determining the absence of the encrypted data based on the Bloom filter, adding the encrypted data to the Bloom filter, and adding the target data to a target database based on the encrypted data.
  7. 根据权利要求1所述的方法,其中,布隆过滤器的设置过程,包括:The method according to claim 1, wherein the setting process of the Bloom filter comprises:
    对布隆过滤器的容忍误判率进行设置;Set the tolerable misjudgment rate of the Bloom filter;
    根据所述去重持续时间,确定布隆过滤器的最大数据处理量。According to the deduplication duration, the maximum data processing capacity of the Bloom filter is determined.
  8. 一种基于分布式流计算引擎Flink的关键字段实时去重装置,包括:A real-time deduplication device for key fields based on the distributed stream computing engine Flink, including:
    关键字段确定模块,设置为接收目标数据,并基于配置文件确定所述目标数据中的至少一个待去重关键字段;其中,所述目标数据为结构化数据;所述 配置文件中包括与所述目标数据匹配的待去重关键字段;The key field determination module is configured to receive the target data, and determine at least one key field to be deduplicated in the target data based on the configuration file; wherein, the target data is structured data; the configuration file includes the The key field to be deduplicated matched by the target data;
    加密数据确定模块,设置为基于加密算法对所述至少一个待去重关键字段进行加密,确定加密数据;An encrypted data determination module configured to encrypt the at least one key field to be deduplicated based on an encryption algorithm to determine encrypted data;
    去重模块,设置为基于Flink设置计时器,以所述计时器中的起始时间为起始时间点,在所述计时器中的去重持续时间内,采用布隆过滤器对所述加密数据进行去重。The deduplication module is configured to set a timer based on Flink, with the start time in the timer as the starting time point, and within the deduplication duration in the timer, the Bloom filter is used to encrypt the The data is deduplicated.
  9. 一种电子设备,包括:An electronic device comprising:
    一个或多个处理器;one or more processors;
    存储装置,设置为存储一个或多个程序,storage means configured to store one or more programs,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-7任一项所述的基于分布式流计算引擎Flink的关键字段实时去重方法。When the one or more programs are executed by the one or more processors, the one or more processors realize the key of the distributed stream computing engine Flink based on any one of claims 1-7 Field deduplication method in real time.
  10. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-7任一项所述的基于分布式流计算引擎Flink的关键字段实时去重方法。A computer-readable storage medium, the computer-readable storage medium is stored with a computer program, and when the computer program is executed by a processor, it realizes the Flink-based distributed flow computing engine according to any one of claims 1-7 Real-time deduplication method for key fields.
PCT/CN2022/107574 2021-11-16 2022-07-25 Method for deduplicating key field in real time on basis of distributed stream calculation engine flink WO2023087769A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111352389.2A CN114048201A (en) 2021-11-16 2021-11-16 Distributed stream computing engine Flink-based key field real-time deduplication method
CN202111352389.2 2021-11-16

Publications (1)

Publication Number Publication Date
WO2023087769A1 true WO2023087769A1 (en) 2023-05-25

Family

ID=80209065

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107574 WO2023087769A1 (en) 2021-11-16 2022-07-25 Method for deduplicating key field in real time on basis of distributed stream calculation engine flink

Country Status (2)

Country Link
CN (1) CN114048201A (en)
WO (1) WO2023087769A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892727A (en) * 2024-03-14 2024-04-16 中国电子科技集团公司第三十研究所 Real-time text data stream deduplication system and method

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114048201A (en) * 2021-11-16 2022-02-15 北京锐安科技有限公司 Distributed stream computing engine Flink-based key field real-time deduplication method
CN115086195B (en) * 2022-06-09 2024-02-02 北京锐安科技有限公司 Method, device, equipment and medium for determining message de-duplication time of shunt equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108121810A (en) * 2017-12-26 2018-06-05 北京锐安科技有限公司 A kind of data duplicate removal method, system, central server and distributed server
CN109828721A (en) * 2019-01-23 2019-05-31 平安科技(深圳)有限公司 Data-erasure method, device, computer equipment and storage medium
CN111258966A (en) * 2020-01-14 2020-06-09 软通动力信息技术有限公司 Data deduplication method, device, equipment and storage medium
US20200226112A1 (en) * 2019-01-16 2020-07-16 Sqream Technologies Ltd. System and method of Bloom Filter for Big Data
CN112491650A (en) * 2020-11-17 2021-03-12 中国平安财产保险股份有限公司 Method for dynamically analyzing call loop condition between services and related equipment
CN113377812A (en) * 2021-01-08 2021-09-10 北京数衍科技有限公司 Order duplication eliminating method and device for big data
CN113392082A (en) * 2021-04-06 2021-09-14 北京沃东天骏信息技术有限公司 Log duplicate removal method and device, electronic equipment and storage medium
CN114048201A (en) * 2021-11-16 2022-02-15 北京锐安科技有限公司 Distributed stream computing engine Flink-based key field real-time deduplication method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108121810A (en) * 2017-12-26 2018-06-05 北京锐安科技有限公司 A kind of data duplicate removal method, system, central server and distributed server
US20200226112A1 (en) * 2019-01-16 2020-07-16 Sqream Technologies Ltd. System and method of Bloom Filter for Big Data
CN109828721A (en) * 2019-01-23 2019-05-31 平安科技(深圳)有限公司 Data-erasure method, device, computer equipment and storage medium
CN111258966A (en) * 2020-01-14 2020-06-09 软通动力信息技术有限公司 Data deduplication method, device, equipment and storage medium
CN112491650A (en) * 2020-11-17 2021-03-12 中国平安财产保险股份有限公司 Method for dynamically analyzing call loop condition between services and related equipment
CN113377812A (en) * 2021-01-08 2021-09-10 北京数衍科技有限公司 Order duplication eliminating method and device for big data
CN113392082A (en) * 2021-04-06 2021-09-14 北京沃东天骏信息技术有限公司 Log duplicate removal method and device, electronic equipment and storage medium
CN114048201A (en) * 2021-11-16 2022-02-15 北京锐安科技有限公司 Distributed stream computing engine Flink-based key field real-time deduplication method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892727A (en) * 2024-03-14 2024-04-16 中国电子科技集团公司第三十研究所 Real-time text data stream deduplication system and method
CN117892727B (en) * 2024-03-14 2024-05-17 中国电子科技集团公司第三十研究所 Real-time text data stream deduplication system and method

Also Published As

Publication number Publication date
CN114048201A (en) 2022-02-15

Similar Documents

Publication Publication Date Title
WO2023087769A1 (en) Method for deduplicating key field in real time on basis of distributed stream calculation engine flink
CN111741016B (en) Method, computing device, and computer storage medium for managing application interfaces
US20180007144A1 (en) Event queuing and distribution system
US9288164B2 (en) Managing notifications across multiple devices
WO2021012553A1 (en) Data processing method and related device
WO2017185616A1 (en) File storage method and electronic equipment
US10630758B2 (en) Method and system for fulfilling server push directives on an edge proxy
CN106161633B (en) Transmission method and system for packed files based on cloud computing environment
JP2018515835A (en) Endpoint management system that provides application programming interface proxy services
US11539663B2 (en) System and method for midserver facilitation of long-haul transport of telemetry for cloud-based services
CN109918191B (en) Method and device for preventing frequency of service request
WO2018196650A1 (en) User feature data acquisition method and device, server, and medium
WO2021073510A1 (en) Statistical method and device for database
WO2015081908A2 (en) Method, device, and system for updating parameter value
US8375124B1 (en) Resumable upload for hosted storage systems
CN115039385B (en) Computer-implemented communication system and method for Internet of things
WO2020232195A1 (en) Method for midserver facilitation of long-haul transport of telemetry for cloud-based services
US10382551B2 (en) Cloud file processing method and apparatus
US11983169B2 (en) Optimization of database write operations by combining and parallelizing operations based on a hash value of primary keys
US20230336368A1 (en) Block chain-based data processing method and related apparatus
WO2010031297A1 (en) Method of wireless application protocol (wap) gateway pull service and system thereof
US20180314710A1 (en) Flattened document database with compression and concurrency
CN108471422B (en) Method, device, server and medium for judging remote login
CN110019671B (en) Method and system for processing real-time message
WO2021253177A1 (en) File restoration method, and terminal and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22894312

Country of ref document: EP

Kind code of ref document: A1