CN116303259A - Method, device, equipment and medium for avoiding data repetition - Google Patents

Method, device, equipment and medium for avoiding data repetition Download PDF

Info

Publication number
CN116303259A
CN116303259A CN202310294581.3A CN202310294581A CN116303259A CN 116303259 A CN116303259 A CN 116303259A CN 202310294581 A CN202310294581 A CN 202310294581A CN 116303259 A CN116303259 A CN 116303259A
Authority
CN
China
Prior art keywords
data
file
data source
source file
search engine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310294581.3A
Other languages
Chinese (zh)
Inventor
周明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
CCB Finetech Co Ltd
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN202310294581.3A priority Critical patent/CN116303259A/en
Publication of CN116303259A publication Critical patent/CN116303259A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • G06F16/152File search processing using file content signatures, e.g. hash values
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of big data calculation, in particular to a method, a device, equipment and a medium for avoiding data repetition, which are used for solving the problem that repeated data can be read out in a scene of using a plurality of data search engines to write the same data successively. The method comprises the steps of utilizing a plurality of data search engines to conduct data search to obtain a data source file corresponding to each data search engine; responding to a directory index determining instruction, calling a directory index determining unit preset for each data search engine to analyze the data source files, and generating directory indexes corresponding to each data source file, wherein the analysis rules of the directory index determining units preset for each data search engine are the same; and calling each data search engine to write the data source file into a data storage unit corresponding to the directory index according to the directory index corresponding to the determined data source file.

Description

Method, device, equipment and medium for avoiding data repetition
Technical Field
The present application relates to the field of big data computing technologies, and in particular, to a method, an apparatus, a device, and a medium for avoiding data repetition.
Background
Data on a Hadoop distributed file system (Hadoop Distributed File System, HDFS) is typically updated, inserted, deleted, etc. based on a single data engine. Hudi, a storage format for a data lake, provides the ability to update data and delete data and the ability to consume changing data over HDFS.
However, in a special service scenario where multiple engines are used to write data, the real-time streaming data engine is used to process streaming data acquired in a specific order (such as an order of occurrence of events) in real time and then write the streaming data into the Hudi data lake, and then the batch processing data engine is used to batch process some data missing from the real-time streaming data engine or data to be updated and then write the data into the Hudi data lake.
For a special traffic scenario in which multiple engines write data, taking data D1 as an example, several situations can occur in the middle:
1) The real-time stream processing data engine successfully writes D1, and the batch processing data engine executes batch processing without writing D1, so that the scene does not have data repetition when reading inquiry is performed.
2) The real-time stream processing data engine successfully writes the D1, the batch processing data engine executes batch processing including writing the D1, and when the D1 data processed by the batch processing data engine is completely consistent with the D1 data existing in the data lake under the scene, the D1 data can be written again in the data lake, so that repeated data appear when reading inquiry is conducted.
3) The real-time stream processing data engine does not successfully write D1, and there is no data repetition.
In the second scenario described above, a solution is needed to avoid reading duplicate identical data in a data lake.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a medium for avoiding data repetition, which are used for solving the problem that repeated data can be read out in a scene of using a plurality of data search engines to write the same data successively.
In a first aspect, the present application provides a method for avoiding data duplication, the method comprising:
performing data search by using a plurality of data search engines to obtain a data source file corresponding to each data search engine;
responding to a directory index determining instruction, calling a directory index determining unit preset for each data search engine to analyze the data source files, and generating directory indexes corresponding to each data source file, wherein the analysis rules of the directory index determining units preset for each data search engine are the same;
and calling each data search engine to write the data source file into a data storage unit corresponding to the directory index according to the directory index corresponding to the determined data source file.
In a possible embodiment, the directory index includes a Bucket partition identification Bucket id and a File group identification File id, and the method further includes:
and responding to the data reading instruction, determining a socket and a File Group where the read data source File is located, and calling each data search engine to read the data source File indexed by the socket id and the File id based on the data writing type of each data source File.
In one possible embodiment, a data search using a plurality of data search engines includes:
responding to a real-time data writing instruction, and utilizing a real-time stream processing data search engine to acquire a data source file corresponding to real-time data from a front-end server in real time according to a specific sequence;
responding to the time delay data writing instruction, and acquiring a data source file corresponding to the time delay data from a background storage space by utilizing a batch processing data search engine.
In one possible embodiment of the present invention,
a catalog index determining unit independent of each data search engine is preset and used as a catalog index determining unit shared by each data search engine; or alternatively
Corresponding directory index determining units are set in each data search engine in advance.
In one possible embodiment, invoking a directory index determining unit preset for each data search engine to parse the data source file, and determining a directory index corresponding to each data source file includes:
And calling a directory index determining unit preset for each data search engine, analyzing attribute information of the data source files according to a preset hash algorithm, and generating directory indexes corresponding to each data source file.
In a possible embodiment, the Bucket id is used to identify a Bucket partition where a data storage unit corresponding to the data source File is located, and the File id is used to identify a File Group corresponding to the data source File in the Bucket partition to which the data source File belongs, where a Bucket partition includes multiple File Group File groups.
In one possible embodiment, after generating the directory index corresponding to each data source file, the method further includes:
storing the catalog index corresponding to each data source file in a memory variable;
invoking each data search engine to determine a directory index corresponding to the data source file, including:
and calling each data search engine to access the memory variable, and determining the directory index corresponding to each data source file.
In one possible embodiment, based on the data writing type of each data source file, calling each data search engine to write the data source file into a data storage unit corresponding to the directory index according to the directory index corresponding to the determined data source file, including:
When the data writing type of the data source File is determined to be copy-on-write (COW), a corresponding data search engine is called to take the data source File as a File Slice File, and the data source File is written into a File Group indexed by a socket id and a File id in a format of a Base File.
In one possible embodiment, based on the data writing type of each data source file, calling each data search engine to write the data source file into a data storage unit corresponding to the directory index according to the directory index corresponding to the determined data source file, including:
when the data writing type of the data source File is determined to be the merging MOR during reading, a corresponding data search engine is called to write the data source File into a File Group indexed by the Bucket id and the File id in the form of a Log File.
In one possible embodiment, based on the data writing type of each data source File, calling each data search engine to read the data source File indexed by the socket id and the File id, including:
when the data writing type of the data source File is determined to be copy-on-write (COW), a corresponding data search engine is called to read the latest Base File in the File Group indexed by the Bucket id and the File id.
In one possible embodiment, the method further comprises:
when the data merging period is determined to be reached, merging the latest Log File in the File Group corresponding to the merging MOR when the data writing type is read with the Base File to obtain a new Base File;
based on the data writing type of each data source File, calling each data search engine to read the data source File indexed by the socket id and the File id, wherein the method comprises the following steps:
when the data writing type of the data source File is determined to be the merging MOR during reading, a corresponding data search engine is called to read the latest Base File in the File Group indexed by the Bucket id and the File id.
In a second aspect, the present application provides an apparatus for avoiding data repetition, the apparatus comprising:
the data searching module is used for searching data by utilizing a plurality of data searching engines to obtain a data source file corresponding to each data searching engine;
the directory index determining module is used for responding to the directory index determining instruction, calling a directory index determining unit preset for each data search engine to analyze the data source files and determining directory indexes corresponding to each data source file, wherein the analysis rules of the directory index determining units preset for each data search engine are the same;
And the data writing module is used for calling each data search engine to write the data source file into the data storage unit corresponding to the directory index according to the directory index corresponding to the determined data source file.
In a possible embodiment, the directory index includes a Bucket partition identification Bucket id and a File group identification File id, and the apparatus further includes:
the data reading module is used for responding to the data reading instruction, determining the socket and the File Group where the read data source File is located, and calling each data search engine to read the data source File indexed by the socket id and the File id based on the data writing type of each data source File.
In a possible embodiment, the data search module is further configured to:
responding to a real-time data writing instruction, and utilizing a real-time stream processing data search engine to acquire a data source file corresponding to real-time data from a front-end server in real time according to a specific sequence;
responding to the time delay data writing instruction, and acquiring a data source file corresponding to the time delay data from a background storage space by utilizing a batch processing data search engine.
In a possible embodiment, the apparatus presets a catalog index determination unit independent of each data search engine as a catalog index determination unit common to each data search engine; or alternatively
Corresponding directory index determining units are set in each data search engine in advance.
In a possible embodiment, the directory index determining module is further configured to call a directory index determining unit preset for each data search engine, parse attribute information of the data source file according to a preset hash algorithm, and generate a directory index corresponding to each data source file.
In a possible embodiment, the Bucket id is used to identify a Bucket partition where a data storage unit corresponding to the data source File is located, and the File id is used to identify a File Group corresponding to the data source File in the Bucket partition to which the data source File belongs, where a Bucket partition includes multiple File Group File groups.
In one possible embodiment, after generating the directory index corresponding to each data source file, the directory index determining module is further configured to:
storing the catalog index corresponding to each data source file in a memory variable;
and calling each data search engine to access the memory variable, and determining the directory index corresponding to each data source file.
In a possible embodiment, the data writing module is further configured to:
when the data writing type of the data source File is determined to be copy-on-write (COW), a corresponding data search engine is called to take the data source File as a File Slice File, and the data source File is written into a File Group indexed by a socket id and a File id in a format of a Base File.
In a possible embodiment, the data writing module is further configured to:
when the data writing type of the data source File is determined to be the merging MOR during reading, a corresponding data search engine is called to write the data source File into a File Group indexed by the Bucket id and the File id in the form of a Log File.
In a possible embodiment, the data reading module is further configured to:
when the data writing type of the data source File is determined to be copy-on-write (COW), a corresponding data search engine is called to read the latest Base File in the File Group indexed by the Bucket id and the File id.
In a possible embodiment, the data reading module is further configured to:
when the data merging period is determined to be reached, merging the latest Log File in the File Group corresponding to the merging MOR when the data writing type is read with the Base File to obtain a new Base File;
when the data writing type of the data source File is determined to be the merging MOR during reading, a corresponding data search engine is called to read the latest Base File in the File Group indexed by the Bucket id and the File id.
In a third aspect, the present application provides an electronic device, comprising:
a memory for storing program instructions;
A processor for invoking program instructions stored in the memory and executing the steps comprised by the method according to any of the first aspects in accordance with the obtained program instructions.
In a fourth aspect, the present application provides a computer readable storage medium storing a computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method of any one of the first aspects.
In a fifth aspect, the present application provides a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of any of the first aspects.
The technical scheme provided by the embodiment of the application at least brings the following beneficial effects:
the application aims to provide a method, a device, equipment and a medium for avoiding data repetition, which enable repeated data to generate the same directory index under the condition that a plurality of data search engines are used for writing data successively, write the repeated data written by the plurality of data search engines into the same data storage unit based on the same directory index, and ensure that the repeated data is prevented from being read by using a Hudi repeated reading mechanism.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, and it is obvious that the drawings that are described below are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for avoiding data duplication according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of Hudi file layout according to an embodiment of the present disclosure;
FIG. 3 is a block diagram of an apparatus for avoiding data duplication according to an embodiment of the present application;
fig. 4 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure. Embodiments and features of embodiments in this application may be combined with each other arbitrarily without conflict. Also, while a logical order of illustration is depicted in the flowchart, in some cases the steps shown or described may be performed in a different order than presented.
The terms first and second in the description and claims of the present application and in the above-described figures are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The term "plurality" in the present application may mean at least two, for example, two, three or more, and embodiments of the present application are not limited.
In the technical scheme, the data are collected, transmitted, used and the like, and all meet the requirements of national related laws and regulations.
Before describing the method for avoiding data repetition provided in the embodiments of the present application, for convenience of understanding, the following detailed description is first provided for the technical background of the embodiments of the present application.
For the purpose of illustrating the present embodiment, some concepts are explained below.
(1) Data lake
Enterprise-generated data is maintained in a platform, referred to as a "data lake," which is a data management scheme that supports the storage of multiple raw data formats, multiple computing engines.
(2)Hudi(Hadoop Upserts anD Incrementals)
Hudi is a storage format of a data lake for managing large analytics data set storage on a Hadoop distributed file system (Hadoop Distributed File System, HDFS), providing the ability to update data and delete data and the ability to consume changing data on the HDFS.
In the Hudi data storage format, data files are stored in the form of a plurality of File groups File Group, each identified by a unique File identification File id. For each File Group, there may be different File versions, so each File Group has multiple File slices File Slice, each File Slice is composed of a Base File stored in column parquet format and a Log File stored in line format, the Log File contains the insertion and update of the Base File since the Base File was generated, and Hudi will periodically merge the Base File and Log File to generate new Base File.
When Hudi data is read, the data in the same File Group are combined, so that the latest read Base File data is ensured.
(3) Stream processing and batch processing
Stream processing: processing of unbounded data streams is commonly referred to as stream processing, and is typically performed in real-time as the data is generated. Because the data input of an unbounded data stream is infinite, it must be processed continuously. The data needs to be processed immediately after being acquired, and it is impossible to wait until all the data arrives before processing. Processing unbounded data streams typically requires that events be acquired in a particular order (e.g., the order in which the events occur) so that the integrity of the inferred results can be guaranteed.
Batch processing: the processing of bounded data streams is often referred to as batch processing. Batch processing does not require orderly retrieval of data. In batch mode, the data stream is first persisted to a storage system (file system or object store), then the data of the entire dataset is read, sorted, counted or summarized, and finally the result is output.
(4) Data engine
Flink: a stream processing framework engine is applied to distributed, high-performance, high-availability data stream applications. Limited data streams and unlimited data, i.e. data streams with and without boundaries, can be processed, typically for real-time streaming computing.
Spark: the method is a big data parallel computing framework based on memory computing, can be used for constructing a large-scale low-delay data analysis application program, and is generally used for batch processing.
The data on HDFS is updated, inserted, deleted, etc. based on Spark/Flink.
Typically, the data on the HDFS is updated, inserted, deleted, etc., based on a single data engine. Hudi, a storage format for a data lake, provides the ability to update data and delete data and the ability to consume changing data over HDFS.
However, in a special service scenario where multiple engines are used to write data, a real-time streaming data engine, such as a link, is used to write streaming data obtained in a specific order (such as an order of occurrence of events) into a Hudi data lake after real-time processing, and then a batch data engine, such as a Spark, is used to batch-process some data missing by the link or data to be updated, and then write the data into the Hudi data lake.
For a special traffic scenario in which multiple engines write data, taking data D1 as an example, several situations can occur in the middle:
1) When the flank successfully writes the D1, and the Spark performs batch processing without writing the D1, the scenario will not have data repetition when performing a read query.
2) The step of successfully writing the D1 by the Flink, the step of performing batch processing by the Spark comprises the step of writing the D1, and when the D1 data processed by the Spark is completely consistent with the D1 data existing in the data lake, the step of re-writing the D1 data in the data lake to a File Group different from the existing D1 data can not be combined, so that repeated data appear when reading inquiry is performed.
3) Flink did not successfully write D1, and there was no data repetition.
Aiming at the problem that repeated identical data can be read from a data lake under the second scene, the application provides a method for avoiding data repetition, when a plurality of data search engines are used for writing data files into the data lake, each data search engine can use the same directory index for the identical data files, the repeated data files are written into the same File Group, and then the latest data in the File Group is read by each data search engine, so that the aim of avoiding the repeated data from being read is fulfilled.
As shown in fig. 1, a flowchart of a method for avoiding data repetition according to an embodiment of the present application is provided, where the method includes steps 11-13 as follows.
Step 11, performing data search by using a plurality of data search engines to obtain a data source file corresponding to each data search engine;
in the embodiment of the application, different data search engines are used for realizing different data search functions so as to obtain the data source files corresponding to the data search engines.
As a possible implementation, the data search is performed using a plurality of data search engines, including the following two cases:
Case 1: responding to a real-time data writing instruction, and utilizing a real-time stream processing data search engine to acquire a data source file corresponding to real-time data from a front-end server in real time according to a specific sequence;
for example, in a business scenario in which a banking transaction flow record is recorded and the banking transaction flow record is processed by a kernel in a specific period, a real-time streaming data search engine, such as a link, is required, and the link writes banking transaction flow record data obtained in real time from a front-end server into a data lake.
Case 2: responding to the time delay data writing instruction, and acquiring a data source file corresponding to the time delay data from a background storage space by utilizing a batch processing data search engine.
In the foregoing service scenario, because the link may miss some real-time data due to a disconnection or other reasons, the bank transaction flow record cannot be checked, and therefore, a batch processing data search engine, such as Spark, is required to obtain a data source file corresponding to the delay data from the background storage space for batch processing, and the Spark also writes Hudi data lakes after batch processing some delay data missed by the link or delay data to be updated.
Step 12, responding to a directory index determining instruction, calling a directory index determining unit preset for each data search engine to analyze the data source files and generating directory indexes corresponding to each data source file, wherein the analysis rules of the directory index determining units preset for each data search engine are the same;
Hudi maps a given Hudi record consistently to File Group/File id through a directory indexing mechanism, providing efficient data writing, updating. In this embodiment, after each data search engine searches for a corresponding data source File, a directory index determining unit preset for each data search engine is called to parse the data source File, and determine a directory index corresponding to each data source File, so that the data source File is written into a data storage unit corresponding to the directory index.
As a possible implementation manner, the directory index determining unit provided in the present application may be a directory index determining unit preset independently of each data search engine, where the directory index determining unit may be regarded as an independent pre-module that needs to be executed before each data search engine writes a data source file, where the pre-module may make the same data source file correspondingly generate the same directory index.
As a possible implementation manner, the directory index determining unit provided in the present application may be a corresponding directory index determining unit set in each data search engine in advance, where the directory index determining unit having the same parsing rule is configured for each data search engine, and the same data source file may be correspondingly generated to the same directory index.
In the embodiment of the application, the parsing rules of the directory index determining unit set for each data search engine are the same, and when a plurality of data search engines write data source files into Hudi, directory indexes corresponding to the data source files repeatedly written by the plurality of data search engines are the same.
As described above, when the Hudi data is read, the data in the same File Group are merged to ensure that the latest Base File data is read, so that based on the same directory index, multiple data search engines write the repeated data source files into the same File Group to ensure that the repeated Base File data is not read.
In the embodiment of the application, a directory Index corresponding to each data source file is generated based on a Bucket Index pattern. The Bucket Index is an Index based on Hash, N buckets are set, hudi is divided into N Bucket partitions, and each Bucket includes a plurality of File groups. The directory index calculated based on the Hash function decides which File Group of which bucket a certain Hudi record belongs to, wherein the Hudi record is the data source File which needs to be written into Hudi.
In this embodiment of the present application, the attribute information of the data source file is analyzed according to a preset hash algorithm to generate a directory index corresponding to each data source file, and in the foregoing service scenario of recording a bank transaction flow record, the data source file may be understood as a transaction record, and the attribute information of the data source file may be understood as a transaction type, a transaction flow number, a user ID, and the like of the transaction record. As shown in fig. 2, for the schematic layout of the Hudi File provided in the embodiment of the present application, two buckets are set, where each bucket partition includes two File groups, each File Group includes a File Slice File 1 and a File Slice2 … Slice n, and each File Slice File includes a Base File and at least one Log File. In the embodiment of the application, based on a Bucket Index mode, generating a directory Index corresponding to each data source File, wherein the directory Index comprises a Bucket partition identification Bucket id and a File Group identification File id, the Bucket id is used for identifying a Bucket partition where a data storage unit corresponding to the data source File is located, and the File id is used for identifying a File Group corresponding to the data source File in the Bucket partition to which the data source File belongs.
In one or more embodiments, after generating the directory index corresponding to each data source file, the method further includes:
storing the catalog index corresponding to each data source file in a memory variable; thus, invoking each data search engine to determine the directory index to which the data source file corresponds includes: and calling each data search engine to access the memory variable, and determining the directory index corresponding to each data source file.
For example, in the foregoing business scenario of using the link and Spark to record the banking transaction flow record, if the link and Spark are written into the same data source file D1 successively, the process of generating the directory index corresponding to the data source file D1 is as follows:
(1) after receiving the data source File D1, the Flink executes a directory index determining unit, calculates a socket id and a File id based on a hash algorithm, generates a directory index corresponding to the data source File and stores the directory index in a memory variable, and then writes Hudi, wherein the generated directory index corresponding to the data source File D1 has the following format:
00000002-c1b6-4b39-b8b3-5aa9c10fbdf1_20221209180912789.Log.1_0-1-0; wherein, "00000002" is the Bucket id, and the rest is the File id.
(2) After Spark receives the data source File D1, the directory index determining unit also executes, calculates the Bucket id and the File id based on the hash algorithm, generates a directory index corresponding to the data source File and stores the directory index in a memory variable, where it is to be noted that, because attribute information of the data source File is the same as (1), the generated Bucket id and File id are the same as (1).
Based on the directory index corresponding to the generated data source File D1, the File and Spark can access the corresponding memory variables, and write the data source File D1 into the same File Group.
In the embodiment of the application, based on the socket Index mode, by setting the catalog Index determining unit, the same catalog Index determining rule is set for each data search engine, and the catalog Index corresponding to each data source File is generated, so that repeated data source files can be written into the same File Group in the same barrel partition, and the repeated data source files are prevented from being written into different File groups, so that repeated data are read out when Hudi data are read, and user experience is affected.
And step 13, calling each data search engine to write the data source file into a data storage unit corresponding to the directory index according to the directory index corresponding to the determined data source file.
As a possible embodiment, the method further includes:
and responding to the data reading instruction, determining a socket and a File Group where the read data source File is located, and calling each data search engine to read the data source File indexed by the socket id and the File id based on the data writing type of each data source File.
Hudi supports two data write types: copy On Write (COW) and Merge On Read (MOR) are described below for data writing and data reading by Hudi based On these two data Write types.
(1) Copy-on-write COW:
as a possible implementation manner, based on a data writing type of each data source file, calling each data search engine to write the data source file into a data storage unit corresponding to a directory index according to the determined directory index corresponding to the data source file, including:
when the data writing type of the data source File is determined to be copy-on-write (COW), a corresponding data search engine is called to take the data source File as a File Slice File, and the data source File is written into a File Group indexed by a socket id and a File id in a format of a Base File.
(1) And (3) data writing: copy-on-write COW refers to copying an old Base File and merging with a newly written data source File to generate a new Base File when data is written, wherein the new Base File is used as a new File Slice File to be written into a Bucket id and a File Group indexed by the File id.
Copy-on-write COW will create a new version of the corresponding data File, i.e., a File in the format of a new Base File, for each new batch of written data source files. That is, when the data write type is copy-on-write COW, only the Base File is in each File Group, and each time a write of a data File occurs, a new Base File will be created.
It should be noted that, the new Base File is obtained by merging data source files from the old version Base File and the newly written data source File, if the new Base File is written for the first time, the corresponding data search engine is directly called to take the data source File as a File Slice File, and the data source File is written into the File Group indexed by the socket id and the File id in the format of the Base File.
(2) Reading data: when the data writing type of the data source File is copy-on-write (COW), the set data search engine is directly called to read the latest Base File in the corresponding File Group.
In one or more embodiments, based on the data write type of each data source File, invoking each data search engine to read the data source File indexed by the Bucket id and File id, including:
when the data writing type of the data source File is determined to be copy-on-write (COW), a corresponding data search engine is called to read the latest Base File in the File Group indexed by the Bucket id and the File id.
(2) Merging MOR at read time:
as a possible implementation manner, based on a data writing type of each data source file, calling each data search engine to write the data source file into a data storage unit corresponding to a directory index according to the determined directory index corresponding to the data source file, including:
When the data writing type of the data source File is determined to be the merging MOR during reading, a corresponding data search engine is called to write the data source File into a File Group indexed by the Bucket id and the File id in the form of a Log File.
(1) And (3) data writing: when the MOR is read, the data source File is firstly written into the Bucket id and the File Group indexed by the File id in the form of Log File Log File.
When the Hudi determines that the data merging period is reached, merging the latest Log File in the File Group corresponding to the read merging MOR data writing type with the Base File to obtain a new Base File, wherein the latest Log File is obtained by merging all Log files in the File Group.
That is, when the data writing type is the merging MOR at the time of reading, each File slice consists of a Base File stored in a column parquet format and a Log File stored in a row format, the Log File contains the insertion and updating of the Base File since the Base File was generated, and Hudi will periodically merge the Base File and Log File to generate a new Base File.
For the existing Base File to be updated, hudi stores a data source File representing updated data in the form of a Log File, and a new Base File is not combined or created during writing, so that the cost is less than copy-on-write (COW) cost, and the method is suitable for writing a scene with more reads and less reads.
(2) Reading data: when the MOR type data is merged when the data is read in Hudi, the Hudi can enable the data search engine to read the real-time Base File by merging and outputting the Base File and the latest Log File.
As a possible implementation manner, based on the data writing type of each data source File, calling each data search engine to read the data source File indexed by the socket id and the File id, including:
when the data writing type of the data source File is determined to be the merging MOR during reading, a corresponding data search engine is called to read the latest Base File in the File Group indexed by the Bucket id and the File id.
According to the method for avoiding data repetition, the directory index determining units with the same analysis rules are configured for the data search engines, repeated data source files written by the data search engines have the same directory index, the repeated data source files are written into the same File Group, and then the latest data in the File Group are read by the data search engines, so that the purpose of avoiding reading out the repeated data is achieved.
Based on the same inventive concept, the embodiment of the present application further provides a device for avoiding data repetition, referring to fig. 3, the device includes:
The data search module 301 is configured to perform data search by using a plurality of data search engines, so as to obtain a data source file corresponding to each data search engine;
a directory index determining module 302, configured to invoke a directory index determining unit preset for each data search engine to parse the data source file in response to a directory index determining instruction, and determine a directory index corresponding to each data source file, where parsing rules of the directory index determining unit set for each data search engine are the same;
and the data writing module 303 is used for calling each data search engine to write the data source file into a data storage unit corresponding to the directory index according to the directory index corresponding to the determined data source file. In a possible embodiment, the directory index includes a Bucket partition identification Bucket id and a File group identification File id, and the apparatus further includes:
the data reading module is used for responding to the data reading instruction, determining the socket and the File Group where the read data source File is located, and calling each data search engine to read the data source File indexed by the socket id and the File id based on the data writing type of each data source File.
In a possible embodiment, the data search module is further configured to:
responding to a real-time data writing instruction, and utilizing a real-time stream processing data search engine to acquire a data source file corresponding to real-time data from a front-end server in real time according to a specific sequence;
responding to the time delay data writing instruction, and acquiring a data source file corresponding to the time delay data from a background storage space by utilizing a batch processing data search engine.
In a possible embodiment, the apparatus presets a catalog index determination unit independent of each data search engine as a catalog index determination unit common to each data search engine; or alternatively
Corresponding directory index determining units are set in each data search engine in advance.
In a possible embodiment, the directory index determining module is further configured to call a directory index determining unit preset for each data search engine, parse attribute information of the data source file according to a preset hash algorithm, and generate a directory index corresponding to each data source file.
In a possible embodiment, the Bucket id is used to identify a Bucket partition where a data storage unit corresponding to the data source File is located, and the File id is used to identify a File Group corresponding to the data source File in the Bucket partition to which the data source File belongs, where a Bucket partition includes multiple File Group File groups.
In one possible embodiment, after generating the directory index corresponding to each data source file, the directory index determining module is further configured to:
storing the catalog index corresponding to each data source file in a memory variable;
and calling each data search engine to access the memory variable, and determining the directory index corresponding to each data source file.
In a possible embodiment, the data writing module is further configured to:
when the data writing type of the data source File is determined to be copy-on-write (COW), a corresponding data search engine is called to take the data source File as a File Slice File, and the data source File is written into a File Group indexed by a socket id and a File id in a format of a Base File.
In a possible embodiment, the data writing module is further configured to:
when the data writing type of the data source File is determined to be the merging MOR during reading, a corresponding data search engine is called to write the data source File into a File Group indexed by the Bucket id and the File id in the form of a Log File.
In a possible embodiment, the data reading module is further configured to:
when the data writing type of the data source File is determined to be copy-on-write (COW), a corresponding data search engine is called to read the latest Base File in the File Group indexed by the Bucket id and the File id.
In a possible embodiment, the data reading module is further configured to:
when the data merging period is determined to be reached, merging the latest Log File in the File Group corresponding to the merging MOR when the data writing type is read with the Base File to obtain a new Base File;
when the data writing type of the data source File is determined to be the merging MOR during reading, a corresponding data search engine is called to read the latest Base File in the File Group indexed by the Bucket id and the File id.
Based on the same inventive concept, the embodiment of the present application provides an electronic device, which may implement the function of avoiding data repetition discussed above, and referring to fig. 4, the device includes a processor 401 and a memory 402. Wherein the memory 402 is configured to store program instructions. And a processor 401 for calling the program instructions stored in the memory 402, and executing the method for avoiding data repetition according to the obtained program instructions.
The memory 402 is used to store programs. In particular, the program may include program code including computer-operating instructions. The memory 402 may be a volatile memory (RAM), such as a random-access memory (RAM); the memory may also be a nonvolatile memory (non-volatile memory), such as a flash memory (flash memory), a Hard Disk Drive (HDD) or a Solid State Drive (SSD); but may be any one or a combination of any of the above volatile and nonvolatile memories.
The processor 401 may be a central processing unit (central processing unit, CPU for short), a network processor (network processor, NP for short) or a combination of CPU and NP. But also a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD for short), a field-programmable gate array (field-programmable gate array, FPGA for short), general-purpose array logic (generic array logic, GAL for short), or any combination thereof.
Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium, the computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of avoiding data repetition as any of the preceding discussion. Since the principle of the above-mentioned computer readable storage medium for solving the problem is similar to that of the page switching method, the implementation of the above-mentioned computer readable storage medium can refer to the implementation of the method, and the repetition is omitted.
Based on the same inventive concept, embodiments of the present application also provide a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform a method of avoiding data repetition as any of the preceding discussion. Since the principle of the solution of the problem of the computer program product is similar to that of the page switching method, the implementation of the computer program product can refer to the implementation of the method, and the repetition is omitted.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of user operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (15)

1. A method of avoiding duplication of data, the method comprising:
performing data search by using a plurality of data search engines to obtain a data source file corresponding to each data search engine;
responding to a directory index determining instruction, calling a directory index determining unit preset for each data search engine to analyze the data source files, and generating directory indexes corresponding to each data source file, wherein the analysis rules of the directory index determining units preset for each data search engine are the same;
and calling each data search engine to write the data source file into a data storage unit corresponding to the directory index according to the directory index corresponding to the determined data source file.
2. The method of claim 1, wherein the directory index comprises a Bucket partition identification Bucket id and a File group identification File id, the method further comprising:
and responding to the data reading instruction, determining a socket and a File Group where the read data source File is located, and calling each data search engine to read the data source File indexed by the socket id and the File id based on the data writing type of each data source File.
3. The method of claim 1, wherein performing the data search using the plurality of data search engines comprises:
Responding to a real-time data writing instruction, and utilizing a real-time stream processing data search engine to acquire a data source file corresponding to real-time data from a front-end server in real time according to a specific sequence;
responding to the time delay data writing instruction, and acquiring a data source file corresponding to the time delay data from a background storage space by utilizing a batch processing data search engine.
4. The method of claim 1, wherein the step of determining the position of the substrate comprises,
a catalog index determining unit independent of each data search engine is preset and used as a catalog index determining unit shared by each data search engine; or alternatively
Corresponding directory index determining units are set in each data search engine in advance.
5. The method according to claim 1 or 2, wherein calling a directory index determination unit set in advance for each data search engine to parse the data source file, determining a directory index corresponding to each data source file, comprises:
and calling a directory index determining unit preset for each data search engine, analyzing attribute information of the data source files according to a preset hash algorithm, and generating directory indexes corresponding to each data source file.
6. The method of claim 2, wherein the Bucket id is used to identify a Bucket partition in which a data storage unit corresponding to the data source File is located, and the File id is used to identify a File Group corresponding to the data source File in the Bucket partition, where a Bucket partition includes a plurality of File Group File groups.
7. The method of claim 1, further comprising, after generating the directory index corresponding to each data source file:
storing the catalog index corresponding to each data source file in a memory variable;
invoking each data search engine to determine a directory index corresponding to the data source file, including:
and calling each data search engine to access the memory variable, and determining the directory index corresponding to each data source file.
8. The method of claim 6, wherein invoking each data search engine to write the data source file to a data storage unit corresponding to a directory index corresponding to a determined data source file based on a data write type of each data source file comprises:
when the data writing type of the data source File is determined to be copy-on-write (COW), a corresponding data search engine is called to take the data source File as a File Slice File, and the data source File is written into a File Group indexed by a socket id and a File id in a format of a Base File.
9. The method of claim 6, wherein invoking each data search engine to write the data source file to a data storage unit corresponding to a directory index corresponding to a determined data source file based on a data write type of each data source file comprises:
When the data writing type of the data source File is determined to be the merging MOR during reading, a corresponding data search engine is called to write the data source File into a File Group indexed by the Bucket id and the File id in the form of a Log File.
10. The method of claim 8 or 9, wherein invoking each data search engine to read the data source File indexed by the Bucket id and File id based on the data write type of each data source File comprises:
when the data writing type of the data source File is determined to be copy-on-write (COW), a corresponding data search engine is called to read the latest Base File in the File Group indexed by the Bucket id and the File id.
11. The method according to claim 8 or 9, further comprising:
when the data merging period is determined to be reached, merging the latest Log File in the File Group corresponding to the merging MOR when the data writing type is read with the Base File to obtain a new Base File;
based on the data writing type of each data source File, calling each data search engine to read the data source File indexed by the socket id and the File id, wherein the method comprises the following steps:
when the data writing type of the data source File is determined to be the merging MOR during reading, a corresponding data search engine is called to read the latest Base File in the File Group indexed by the Bucket id and the File id.
12. An apparatus for avoiding duplication of data, comprising:
the data searching module is used for searching data by utilizing a plurality of data searching engines to obtain a data source file corresponding to each data searching engine;
the directory index determining module is used for responding to the directory index determining instruction, calling a directory index determining unit preset for each data search engine to analyze the data source files and determining directory indexes corresponding to each data source file, wherein the analysis rules of the directory index determining units preset for each data search engine are the same;
and the data writing module is used for calling each data search engine to write the data source file into the data storage unit corresponding to the directory index according to the directory index corresponding to the determined data source file.
13. An electronic device, comprising:
a memory for storing program instructions;
a processor for invoking program instructions stored in the memory and for performing the steps comprised in the method according to any of claims 1-11 in accordance with the obtained program instructions.
14. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-11.
15. A computer program product, the computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of any of the preceding claims 1-11.
CN202310294581.3A 2023-03-24 2023-03-24 Method, device, equipment and medium for avoiding data repetition Pending CN116303259A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310294581.3A CN116303259A (en) 2023-03-24 2023-03-24 Method, device, equipment and medium for avoiding data repetition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310294581.3A CN116303259A (en) 2023-03-24 2023-03-24 Method, device, equipment and medium for avoiding data repetition

Publications (1)

Publication Number Publication Date
CN116303259A true CN116303259A (en) 2023-06-23

Family

ID=86812983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310294581.3A Pending CN116303259A (en) 2023-03-24 2023-03-24 Method, device, equipment and medium for avoiding data repetition

Country Status (1)

Country Link
CN (1) CN116303259A (en)

Similar Documents

Publication Publication Date Title
US11455217B2 (en) Transaction consistency query support for replicated data from recovery log to external data stores
CN108694195B (en) Management method and system of distributed data warehouse
US20140101167A1 (en) Creation of Inverted Index System, and Data Processing Method and Apparatus
CN109298835B (en) Data archiving processing method, device, equipment and storage medium of block chain
CN107665219B (en) Log management method and device
CN111680017A (en) Data synchronization method and device
GB2529436A (en) Data processing apparatus and method
CN114329096A (en) Method and system for processing native map database
CN107609011B (en) Database record maintenance method and device
CN114528127A (en) Data processing method and device, storage medium and electronic equipment
US10089350B2 (en) Proactive query migration to prevent failures
CN112965939A (en) File merging method, device and equipment
CN116132448B (en) Data distribution method based on artificial intelligence and related equipment
CN110851437A (en) Storage method, device and equipment
CN116303259A (en) Method, device, equipment and medium for avoiding data repetition
CN113220530B (en) Data quality monitoring method and platform
CN115858471A (en) Service data change recording method, device, computer equipment and medium
CN112988696B (en) File sorting method and device and related equipment
CN116628042A (en) Data processing method, device, equipment and medium
CN114490865A (en) Database synchronization method, device, equipment and computer storage medium
CN114385188A (en) Code workload statistical method and device and electronic equipment
CN114297196A (en) Metadata storage method and device, electronic equipment and storage medium
CN111400370A (en) Data monitoring method and device in data circulation, storage medium and server
CN113419896A (en) Data recovery method and device, electronic equipment and computer readable medium
CN115878563B (en) Method for realizing directory-level snapshot of distributed file system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination