CN115794783A

CN115794783A - Data deduplication method, device, equipment and medium

Info

Publication number: CN115794783A
Application number: CN202211139209.7A
Authority: CN
Inventors: 潘永克; 简瑞峰
Original assignee: Traffic Control Technology TCT Co Ltd
Current assignee: Traffic Control Technology TCT Co Ltd
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2023-03-14

Abstract

The invention provides a data deduplication method, a data deduplication device, data deduplication equipment and a data deduplication medium, which relate to the technical field of rail transit and comprise the following steps: under the condition that a real-time service demand is received, service data in the distributed message system are consumed based on the real-time service demand, and service result data obtained after consumption are stored in an analytical database; the method comprises the steps that under the condition that a preset duplication removing condition is met, a preset configuration file matched with real-time business requirements is obtained, a target storage engine of an analytic database is configured based on the preset configuration file, and business result data stored in the analytic database are duplicated based on the configured target storage engine, so that automatic duplication removing operation after the data fall into the database analytic database is achieved through the configured preset duplication removing condition, the preset configuration file and the target storage engine, and data duplication removing efficiency is improved.

Description

Data deduplication method, device, equipment and medium

Technical Field

The invention relates to the technical field of rail transit, in particular to a data deduplication method, a data deduplication device, data deduplication equipment and a data deduplication medium.

Background

With the development of domestic economy, the urban rail transit in China is more and more informationized, digitalized and intelligentized. Various urban rail service data can be acquired and processed in real time or quasi-real time and result data storage is required in the urban rail transit operation process. The mass real-time data processing and storage facing to urban rail services can involve replacement and updating of the same data or part of the same data, so that the service requirement of data deduplication is met.

In the traditional method, in the process of removing the duplication of the data, the development of duplication removing program codes and the writing of duplication removing inquiry scripts are generally adopted, and the duplication removing task is started regularly by depending on a relevant timer to realize the duplication removing operation.

Disclosure of Invention

The invention provides a data deduplication method, a data deduplication device, data deduplication equipment and a data deduplication method medium, which are used for solving the defects that in the prior art, deduplication operation is realized and processing efficiency is low due to the fact that deduplication program codes are developed, deduplication inquiry scripts are compiled, and deduplication tasks are started regularly by means of a relevant timer.

The invention provides a data deduplication method, which comprises the following steps:

under the condition that a real-time service demand is received, service data stored in a distributed message system are consumed based on the real-time service demand, and service result data obtained after consumption are stored in an analytical database;

under the condition that a preset duplicate removal condition is met, acquiring a preset configuration file matched with the real-time service requirement, and configuring a target storage engine of the analytical database based on the preset configuration file;

and removing the duplicate of the business result data stored in the analytical database based on the configured target storage engine.

According to the data deduplication method provided by the present invention, the configuring the target storage engine of the analytic database based on the preset configuration file includes:

acquiring a preset configuration file matched with the real-time service requirement;

determining a pre-configured duplicate removal range and a duplicate removal field in the preset configuration file;

and configuring a partition key of a target storage engine of the analytical database based on the deduplication range, and configuring a sorting key of the target storage engine based on the deduplication field.

According to the data deduplication method provided by the invention, the determination mode for achieving the preset deduplication condition comprises the following steps:

under the condition that abnormal consumption of the service data in the distributed message system is monitored, judging that a preset duplicate removal condition is reached; or,

and under the condition that the restart backup of the database of the analytical database is monitored, judging that a preset deduplication condition is reached.

According to the data deduplication method provided by the present invention, before storing the service result data obtained after consumption into the analytic database, the method further includes:

selecting a target storage engine as a database engine of the analytical database at the initialization stage of the analytical database;

and receiving the deduplication rule configuration of the target storage engine by the user.

According to the data deduplication method provided by the invention, the consumption of the service data in the distributed message system based on the real-time service requirement comprises the following steps:

generating a query statement corresponding to the real-time service requirement;

and controlling a real-time computing engine to consume the service data in the distributed message system based on the query statement.

According to the data deduplication method provided by the present invention, before consuming the service data stored in the distributed message system based on the real-time service requirement, the method further includes:

and writing the service data collected from the real-time data system into the distributed message system in a partition mode.

The present invention also provides a data deduplication device, comprising:

the consumption unit is used for consuming the service data in the distributed message system based on the real-time service demand under the condition of receiving the real-time service demand and storing the service result data obtained after consumption into an analytical database;

the configuration unit is used for acquiring a preset configuration file matched with the real-time service requirement under the condition that a preset deduplication condition is achieved, and configuring a target storage engine of the analytical database based on the preset configuration file;

and the duplication removing unit is used for removing duplication of the business result data stored in the analytical database based on the configured target storage engine.

According to the data deduplication device provided by the present invention, the configuration unit is further configured to:

acquiring a preset configuration file matched with the real-time service requirement; determining a pre-configured duplicate removal range and a duplicate removal field in the preset configuration file;

configuring a partition key of a target storage engine of the analytical database based on the deduplication range, and configuring a sort key of the target storage engine based on the deduplication field.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the data deduplication method as described in any one of the above when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data deduplication method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a data deduplication method as described in any of the above.

According to the data duplication eliminating method, the data duplication eliminating device, the data duplication eliminating equipment and the data duplication eliminating medium, under the condition that the real-time service requirement is received, the service data in the distributed message system are consumed based on the real-time service requirement, and the service result data obtained after consumption are stored in the analytical database; the method comprises the steps of acquiring a preset configuration file matched with real-time service requirements under the condition that a preset deduplication condition is met, configuring a target storage engine of an analysis type database based on the preset configuration file, and deduplicating service result data stored in the analysis type database based on the configured target storage engine, so that automatic deduplication operation after the data fall into a database analysis type database is achieved through the configured preset deduplication condition, the preset configuration file and the target storage engine, and further data deduplication efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a data deduplication method provided by the present invention;

FIG. 2 is a schematic structural diagram of a data deduplication apparatus provided in the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

One data deduplication method of the present invention is described below in conjunction with fig. 1.

Fig. 1 is a schematic flow chart of a data deduplication method provided by the present invention, as shown in fig. 1, the method includes:

step 100, under the condition that a real-time service requirement is received, consuming service data stored in a distributed message system based on the real-time service requirement, and storing service result data obtained after consumption into an analytical database;

the distributed message system in this embodiment refers to a Kafka cluster, is a high-throughput distributed message queue system, and has the characteristics of high-level expansion and high throughput.

In the Kafka cluster, there is no concept of a "central master node", all nodes in the cluster are peer-to-peer, kafka classifies messages, that is, service data of different service types in real-time services are divided based on a set, each type of message is called a topic, and consumers can perform different processing on different topics.

In this step, the theme to be consumed in the distributed message system is determined based on the real-time service demand, for example, when the real-time service demand is to analyze and mine information of a train at an early and late point, extraction, conversion and dimension association operations are performed in corresponding line driving log theme data in the distributed message system to perform consumption, and the consumed service result data is stored in an analytic database.

It should be noted that the analytic database refers to a database that can perform work of discovering information data value such as online statistics, online data analysis, and immediate query on data.

Preferably, in this embodiment, the ClickHouse database is selected as the analysis-type database used in this embodiment, and the ClickHouse database has the following advantages compared with the conventional database:

firstly, the hardware resource cost is lower, the occupied resource is smaller, the writing performance is higher, and each server is supported to process hundreds of millions of lines and tens of gigabytes of data per second;

secondly, the ClickHouse database is a distributed real-time data analysis type database, supports linear expansion and has high reliability;

and thirdly, the method supports real-time data updating, near real-time calculation and provides rich function SQL functions for flexible DDL configuration.

Therefore, the ClickHouse database can complete the processing and storage of massive real-time service data in urban rail passenger flow prediction peak hours by combining the characteristics.

In this embodiment, in the process of storing data in the clickwouse database, data under each topic is stored in the form of a table.

200, acquiring a preset configuration file matched with the real-time service requirement under the condition of reaching a preset deduplication condition, and configuring a target storage engine of the analytical database based on the preset configuration file;

the preset deduplication condition refers to a deduplication condition configured in advance by a user, and data deduplication in the analytic database is automatically executed under the condition that the data condition stored in the analytic database reaches the deduplication condition configured in advance by the user.

In this embodiment, when it is monitored that the consumption of the service data in the distributed message system is abnormal, it is determined that a preset deduplication condition is reached.

Specifically, the abnormal consumption of the business data includes, but is not limited to, a task restart occurring in the consumption process and the consumption threshold of the quantity of the business data consumed in a time period.

In this embodiment, it may be determined that a preset deduplication condition is reached under the condition that it is monitored that the analysis type database has database reboot backup.

In addition, in this embodiment, the data deduplication in the analytic database may be automatically performed when the deduplication time set in advance is monitored, which is not limited.

Therefore, the user can configure the preset deduplication condition before the operation of the analytical database, and the automatic deduplication operation of the analytical database in the data storage process can be realized.

The preset configuration file refers to a file which is matched with a theme corresponding to the real-time service requirement and comprises engine configuration parameters such as a duplication elimination range and a duplication elimination field.

In this embodiment, since the analytic database stores the service data of a plurality of topics, and in practical application, the service result data obtained after the service data under each topic is consumed are also different, so that a user can set the engine configuration parameters required by the deduplication operation of the data under each topic before the data is stored in the analytic database, thereby implementing deduplication of the duplicated data of various topics.

Further, in this embodiment, in order to implement automatic filtering of data, preferably, the replacegmergee engine is selected as the target storage engine, so that duplicate items with the same sorting key value are asynchronously deleted based on the database background function, and automatic filtering of data is completed.

And step 300, based on the configured target storage engine, performing duplicate removal on the service result data stored in the analytical database.

In the step, in the deduplication process, the configured target storage engine acquires all data in the deduplication range from the specified deduplication range deduplication analysis type database, and then performs deduplication operation on all data in the deduplication range based on the specified deduplication field.

For example, when the specified deduplication fields are field 1, field 2, and field 3, the data under the three fields of all the data are sorted and compared, and when the data under the three fields are the same, it is determined that duplicate data occurs, one piece of data with the latest data storage timestamp is retained, and other duplicate data are deleted.

When the specified deduplication fields are all fields, for example, for all data in the passenger flow prediction table structure, the deduplication fields are all fields in the passenger flow prediction table structure: if the 11 field values are the same, namely the same time period on the same day, and a plurality of the same predicted values exist, the latest predicted value data is reserved, so that the data stored in the analysis type database is not repeated, and the version of the data is the latest version.

In the data deduplication method provided by this embodiment, in the process of storing the consumed service result data into the analytic database, the stored data and the consumption process of the service data are monitored in real time, and in the case of reaching the preset deduplication condition, the deduplication rule of the target storage engine of the analytic database is configured through the preset configuration file, so that the automatic deduplication operation after the database falling into the analytic database is realized through the configured preset deduplication condition, the preset configuration file and the target storage engine, and further, the data deduplication efficiency is improved.

Based on the above embodiment, the configuring the target storage engine of the analytical database based on the preset configuration file includes:

In this step, the duplication removal range refers to a range in which the business result data under the subject matched with the real-time business requirement stored in the analytical database is partitioned, and the duplication removal field refers to a field in the table structure to be duplicated in the partitioned data of the table structure to be duplicated.

For example, when the specified field of the deduplication range is the date field, the deduplication rule is to perform deduplication in the same date-time range and complete deduplication with the deduplication field.

Therefore, in the embodiment, the relevant configuration of the partition key and the sorting key of the target storage engine is performed through the duplication elimination range and the duplication elimination field in the preset configuration file set by the user, so that the automatic duplication elimination operation after the database falling analysis type database is realized through the automatic configuration of the configured duplication elimination rule, and the data duplication elimination efficiency is improved.

Based on the above embodiment, before storing the service result data obtained after consumption into the analytical database, the method further includes:

In this step, the database engine is set to the replacegmergee engine in the initialization phase. Further, after the setting of the engine is completed, when an engine configuration request triggered by a user is detected, a deduplication rule configuration interface corresponding to the replacegmergee engine is displayed, and a preset deduplication condition set by the user based on the deduplication rule configuration interface, a deduplication range under each topic and a deduplication parameter of a deduplication field are obtained, so that the replacement merrgetree engine is configured for the partition deduplication rule under each topic according to the deduplication parameter.

Therefore, in the embodiment, the database engine is configured as a replacegmergree engine in the initialization stage, and further automatic deduplication operation after the subsequent database falling analysis type database is realized.

Based on the above embodiment, the consuming the service data stored in the distributed message system based on the real-time service requirement includes:

Preferably, the real-time computing engine in this embodiment refers to a Flink computing engine, and the Flink computing engine is a distributed big data processing engine, and can perform stateful computing on a limited data stream and an infinite data stream, and can be deployed in various cluster environments to perform fast computing on data scales of various sizes.

In the step, an extraction object corresponding to the real-time business requirement is determined, then a query condition of the extraction object is determined according to the attribute dimension corresponding to the extraction, a query statement corresponding to the query condition is further generated, and then the query statement is uploaded to a Flink calculation engine to be converted into the Flink statement by the Flink calculation engine to be executed.

In the process of converting the Flink calculation engine into the Flink statement for execution, firstly extracting target business data from the distributed message system, and then sequentially carrying out conversion and dimension association processing to finally obtain the business result data obtained after consumption.

For convenience of understanding, the service data is taken as the driving log data in the rail transit system as an example for explanation: when the real-time service requirement is that the early-late point information of a train is broadcasted on a corresponding station platform in real time, firstly extracting the early-late point and arrival identification data from the driving log data, carrying out conversion of corresponding format data, then associating the converted data with a station dimension table and a platform dimension table to obtain unified detail data, and finally integrating the unified detail data to form the unified detail data and storing the unified detail data in a corresponding table structure list.

Further, after the deduplication is performed on the business result data stored in the analytic database based on the configured target storage engine, the method further includes:

and generating a duplicate removal log recorded with a duplicate removal range, a duplicate removal field, a duplicate removal operation time stamp and a duplicate removal result.

That is, after each deduplication, a deduplication log of corresponding deduplication process data is recorded, so that data tracing can be performed based on the deduplication log subsequently.

Based on the above embodiment, before consuming the service data stored in the distributed message system based on the real-time service requirement, the method further includes:

writing the service data collected from the real-time data system into the distributed message system in a subarea manner;

it should be noted that, in practical applications, various urban rail transit service data may be acquired, processed and result data stored in real time or quasi-real time during operation of urban rail transit, mass real-time data processing and storage oriented to urban rail services may involve replacement and update of the same data or part of the same data, so as to meet the service requirement of data deduplication.

In this embodiment, the real-time data system records generated real-time service data, such as line driving log data, during the operation of the urban rail transit. It should be noted that the service data may be data of different fields, may be user behavior data of a shopping website, may also be route driving log data in a track traffic system, and the like, and is not limited herein specifically.

Briefly introduced herein is a Kafka cluster, which is a distributed, partition-supported, multi-copy distributed messaging system with features such as high throughput, low latency, scalability, persistence, high concurrency.

Therefore, in the embodiment, the service data is accessed into the Kafka cluster for partitioned storage, so that the purpose that a large amount of service data can be subsequently processed in real time to meet various demand scenarios is achieved.

The following describes the data deduplication device provided by the present invention, and the data deduplication device described below and the data deduplication method described above may be referred to correspondingly.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a data deduplication apparatus provided in the present invention, and as shown in fig. 2, the data deduplication apparatus includes: the consumption unit 210 is configured to, in a case that a real-time service demand is received, consume service data in the distributed message system based on the real-time service demand, and store service result data obtained after consumption into the analytic database; a configuration unit 220, configured to obtain a preset configuration file matched with the real-time service requirement when a preset deduplication condition is met, and configure a target storage engine of the analytics database based on the preset configuration file; and a deduplication unit 230, configured to perform deduplication on the service result data stored in the analytic database based on the configured target storage engine.

Further, the configuration unit 220 is further configured to obtain a preset configuration file matched with the real-time service requirement; determining a duplication removal range and duplication removal fields which are configured in advance in the preset configuration file; configuring a partition key of a target storage engine of the analytical database based on the deduplication range, and configuring a sort key of the target storage engine based on the deduplication field.

Further, the configuration unit 220 is further configured to determine that a preset deduplication condition is reached under the condition that abnormal consumption of the service data in the distributed message system is monitored; or, under the condition that the database restart backup of the analytical database is monitored, judging that a preset deduplication condition is reached.

Further, the consuming unit 210 is further configured to select a target storage engine as a database engine of the analytic database in an initialization stage of the analytic database; and receiving the deduplication rule configuration of the target storage engine by the user.

Further, the consuming unit 210 is further configured to generate a query statement corresponding to the real-time service requirement; and controlling a real-time computing engine to consume the service data in the distributed message system based on the query statement.

Further, the deduplication unit 230 is further configured to generate a deduplication log in which a deduplication range, a deduplication field, a deduplication operation timestamp, and a deduplication result corresponding to the deduplication operation are recorded.

Further, the consuming unit 210 is also configured to write the service data collected from the real-time data system into the distributed message system in a partitioned manner.

The data deduplication device provided by the invention consumes the service data in the distributed message system based on the real-time service demand under the condition of receiving the real-time service demand, and stores the service result data obtained after consumption into the analytical database; the method comprises the steps of acquiring a preset configuration file matched with real-time service requirements under the condition that a preset deduplication condition is met, configuring a target storage engine of an analysis type database based on the preset configuration file, and deduplicating service result data stored in the analysis type database based on the configured target storage engine, so that automatic deduplication operation after the data fall into a database analysis type database is achieved through the configured preset deduplication condition, the preset configuration file and the target storage engine, and further data deduplication efficiency is improved.

Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor) 310, a communication Interface (communication Interface) 320, a memory (memory) 330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. The processor 310 may call logic instructions in the memory 330 to perform a data deduplication method comprising: under the condition that a real-time service demand is received, service data stored in a distributed message system are consumed based on the real-time service demand, and service result data obtained after consumption are stored in an analytical database; under the condition that a preset duplicate removal condition is met, acquiring a preset configuration file matched with the real-time service requirement, and configuring a target storage engine of the analytical database based on the preset configuration file; and performing duplicate removal on the business result data stored in the analytical database based on the configured target storage engine.

In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, a computer is capable of executing the data deduplication method provided by the above methods, and the method comprises: under the condition that a real-time service demand is received, service data stored in a distributed message system are consumed based on the real-time service demand, and service result data obtained after consumption are stored in an analytical database; under the condition that a preset duplicate removal condition is met, acquiring a preset configuration file matched with the real-time service requirement, and configuring a target storage engine of the analytical database based on the preset configuration file; and removing the duplicate of the business result data stored in the analytical database based on the configured target storage engine.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the data deduplication method provided by the above methods, the method comprising: under the condition that a real-time service demand is received, service data stored in a distributed message system are consumed based on the real-time service demand, and service result data obtained after consumption are stored in an analytical database; under the condition that a preset duplicate removal condition is met, acquiring a preset configuration file matched with the real-time service requirement, and configuring a target storage engine of the analytical database based on the preset configuration file; and performing duplicate removal on the business result data stored in the analytical database based on the configured target storage engine.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for data deduplication, comprising:

2. The data deduplication method of claim 1, wherein the configuring the target storage engine of the analytics database based on the preset configuration file comprises:

3. The data deduplication method of claim 1, wherein the determination manner of reaching the preset deduplication condition comprises:

and under the condition that the analysis type database is monitored to have database restart backup, judging that a preset deduplication condition is reached.

4. The data deduplication method of claim 1, wherein before storing the service result data obtained after the consuming into the analytic database, further comprising:

selecting a target storage engine as a database engine of the analytical database in an initialization stage of the analytical database;

5. The data deduplication method of claim 1, wherein the consuming the service data stored in the distributed message system based on the real-time service requirement comprises:

6. The data deduplication method according to any one of claims 1 to 5, wherein before consuming the service data stored in the distributed message system based on the real-time service demand, further comprising:

and writing the service data collected from the real-time data system into the distributed message system in a subarea mode.

7. A data deduplication apparatus, comprising:

8. The data deduplication apparatus of claim 7, wherein the configuration unit is further configured to:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the data deduplication method of any one of claims 1 to 6 when executing the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the data deduplication method as recited in any one of claims 1 through 6.