CN117632860A - Method and device for merging small files based on Flink engine and electronic equipment - Google Patents

Method and device for merging small files based on Flink engine and electronic equipment Download PDF

Info

Publication number
CN117632860A
CN117632860A CN202410104745.6A CN202410104745A CN117632860A CN 117632860 A CN117632860 A CN 117632860A CN 202410104745 A CN202410104745 A CN 202410104745A CN 117632860 A CN117632860 A CN 117632860A
Authority
CN
China
Prior art keywords
file
information
hive
merging
metadata information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410104745.6A
Other languages
Chinese (zh)
Inventor
徐铭贝
郑杨勇
付大伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunli Intelligent Technology Co ltd
Original Assignee
Yunli Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunli Intelligent Technology Co ltd filed Critical Yunli Intelligent Technology Co ltd
Priority to CN202410104745.6A priority Critical patent/CN117632860A/en
Publication of CN117632860A publication Critical patent/CN117632860A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/144Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for merging small files based on a Flink engine and electronic equipment, and relates to the technical field of data processing, wherein the method comprises the following steps: in the process of writing streaming data into Hive through a Flink engine, monitoring metadata information written into Hive in real time; and merging the small files corresponding to the metadata information to obtain the target file under the condition that the metadata information reaches at least one of the configured time threshold, the configured partition file number threshold and the configured file size threshold. According to the method, when streaming data can be written into Hive in real time, other external services are not needed to merge small files, only a Flink streaming task is started, and under the condition that metadata information written into Hive in real time is monitored to reach a merging standard, the small files are automatically merged, so that the problem of excessive small files is avoided, and the Hadoop pressure is relieved.

Description

Method and device for merging small files based on Flink engine and electronic equipment
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for merging small files based on a Flink engine and electronic equipment.
Background
Apache Flink is a unified analysis engine for stream processing and batch processing; hive is a distributed system infrastructure (Hadoop) based data warehouse software for storing and processing large-scale data. The fly engine real-time writing Hive technology not only provides an efficient real-time data writing scheme, but also brings great convenience to users, so that the users can process and analyze a large amount of real-time data more conveniently.
In general, the file engine real-time writing Hive technology can only ensure that file merging is performed in a checkpoint (checkpoint) period, if the checkpoint period is too short, a large number of small files are still generated, and the pressure of Hadoop is increased by the large number of small files; if the checkpoint time is long, the real-time performance of the data in Hive is reduced.
Disclosure of Invention
The invention provides a small file merging method, device and electronic equipment based on a Flink engine, which are used for solving the defect that when a large number of small files are written in the Hive technology in real time by the existing Flink engine, the pressure of Hadoop is increased, if the checkpoint time is long, the real-time property of data in Hive is reduced.
The invention provides a small file merging method based on a Flink engine, which comprises the following steps:
in the process of writing streaming data into Hive through a Flink engine, monitoring metadata information written into the Hive in real time;
and merging the small files corresponding to the metadata information to obtain the target file under the condition that the metadata information reaches at least one of a configured time threshold, a configured partition file number threshold and a configured file size threshold.
According to the method for merging small files based on the Flink engine provided by the invention, the real-time monitoring of metadata information written into the Hive comprises the following steps: the partition information and the file information written in the Hive are monitored in real time; and determining the metadata information according to the partition information and the file information, wherein the metadata information is used for representing the corresponding relation between the partition information and the file information.
According to the method for merging small files based on the Flink engine provided by the invention, the method for merging the small files corresponding to the metadata information comprises the following steps: and calling a query language in the Hive through the Flink engine, and carrying out offline merging on the small files corresponding to the metadata information.
According to the method for merging small files based on the Flink engine provided by the invention, the real-time monitoring of the partition information and the file information written in the Hive comprises the following steps: the file information written in the Hive is monitored in real time, and the writing duration is recorded; and under the condition that the writing time length reaches a preset time length threshold value, executing a detection point checkpoint operation, and automatically creating the partition information.
According to the method for merging small files based on the Flink engine, which is provided by the invention, the metadata information is determined according to the partition information and the file information, and the method comprises the following steps: responding to data corresponding operation input by a user, determining target partition information from the partition information, and determining a corresponding relation between the target partition information and the file information; and determining the metadata information according to the target partition information, the file information and the corresponding relation.
According to the method for merging the small files based on the Flink engine, which is provided by the invention, the method further comprises the following steps: determining a processing result corresponding to the streaming data under the condition that the metadata information does not reach a configured time threshold, a configured partition file quantity threshold and a configured file size threshold, or after the target file is obtained; and under the condition that the processing result indicates that the streaming data processing is successful, the streaming data of the next round is read through the Flink engine.
According to the method for merging small files based on the Flink engine, which is provided by the invention, in the process of writing streaming data into Hive through the Flink engine, the method further comprises the following steps: modifying the current name corresponding to the file information to obtain a first file name, wherein the first file name is used for representing that the corresponding small file is an invalid file; and under the condition that the writing time length of the file information reaches a preset time length threshold value, modifying the first file name to obtain a second file name, wherein the second file name is used for representing that the corresponding small file is a valid file.
The invention also provides a small file merging device based on the Flink engine, which comprises:
the real-time monitoring module is used for monitoring metadata information written in the Hive in real time in the process of writing the streaming data into the Hive through the Flink engine;
and the file merging module is used for merging the small files corresponding to the metadata information to obtain the target file under the condition that the metadata information reaches at least one of a configured time threshold, a configured partition file quantity threshold and a configured file size threshold.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the file merging method based on the Flink engine when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of file merging based on a flank engine as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a method of file merging based on a flank engine as described in any of the above.
According to the method, the device and the electronic equipment for merging the small files based on the Flink engine, metadata information written into Hive is monitored in real time in the process of writing streaming data into Hive through the Flink engine; and merging the small files corresponding to the metadata information to obtain the target file under the condition that the metadata information reaches at least one of the configured time threshold, the configured partition file number threshold and the configured file size threshold. According to the method, when streaming data can be written into Hive in real time, other external services are not needed to merge small files, only a Flink streaming task is started, and under the condition that metadata information written into Hive in real time is monitored to reach a merging standard, the small files are automatically merged, so that the problem of excessive small files is avoided, and the Hadoop pressure is relieved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of a method for merging small files based on a Flink engine provided by the invention;
FIG. 2 is a schematic diagram of a file merging device based on a Flink engine;
fig. 3 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
For a better understanding of the embodiments of the present invention, first, the background art will be described in detail:
the increase of small files in Hive causes the following problems:
1. for Hive angles, a small file may open a large number of mapping (Map) tasks. Each Map task starts a Java virtual machine (Java Virtual Machine, JVM) to execute, so the initialization, starting and executing of the Map tasks consume a lot of resources, thereby seriously affecting Hive's performance.
2. For metadata angles, in a distributed file system (Hadoop Distributed File System, HDFS), each small file metadata object occupies about 150 bytes (bytes), if the small files occupy a large amount of memory, this can stress the merging of the namespace image files (fsImage) in the namespace (NameNode) memory of the management file system, and the metadata is increased and the available memory is reduced, which severely restricts the expansion of the clusters and severely affects the performance of the HDFS.
3. For the task execution angle, when a Map Reduce (MR) task reads a directory containing a large number of small files, the MR generates more maps, so that frequent garbage collection (Garbage Collection, GC) is caused, and cluster resources are wasted.
In general, the merging scheme of the small files is as follows:
1. flink streaming merge
The fly engine writes Hive in real time is an important function newly added in version 1.11 of the Apache fly community, and in order to meet the requirements of a real-time data warehouse, a file system stream receiver (FileSystemStreaming Sink) is modified and optimized in an important way, and partition submission and rolling strategy mechanisms are added.
The real-time writing-in Hive technology of the Flink engine can only ensure that file merging is carried out in one checkpoint period, if the checkpoint time is too short, a large number of small files are still generated, and the pressure of Hadoop is increased by the large number of small files; if the checkpoint time is long, the real-time performance of the data in Hive is reduced.
2. Custom MR task merging
MR is a programming model and distributed computing framework that is primarily used to process large-scale data sets of greater than 1 TeraByte (TB), split the data into multiple small blocks, and process the small blocks in parallel on multiple compute nodes.
The core idea of MR is to divide and solve a complex large problem into many small problems for processing. MR consists essentially of two phases: map phase and Reduce phase. In the Map phase, the input data is split into a plurality of small blocks, which are then processed in parallel to generate a set of intermediate Key-Value pairs (Key-values). Next, in the Reduce phase, the Key-value pairs are aggregated according to keys (Keys) to produce the final result. The user can realize distributed computation by only realizing two functions of map () and reduce ().
MR tasks use the merge file input format (combinefile format) class: combineF file format is a built-in class of Hadoop, and multiple small files can be combined into one large file. In addition, custom input format (inputFormat) classes can be written to achieve the merging of small files. Specifically, in the method of obtaining fragments (getsplit) of the inputFormat class, an array containing all small file paths is returned, and then these small files are read and processed in the mapping (Mapper) class. However, invoking MR tasks in a streaming computing scenario is inefficient and complex to execute.
The method has the advantages that the operation of the independent merging mode of partial files is carried out by developing an offline task, various calculation logics are required to be artificially developed, not only the Flink streaming task is required to be maintained, but also the operation of the offline task is required to be independently maintained, and as more and more data are written into Hive, a plurality of tasks are required to be started to be in one-to-one correspondence for merging in order to improve the file merging efficiency, the maintenance cost is higher, and if the parameter setting is unreasonable, the problems of MR task performance reduction, overlong merging time and the like can be caused.
3. Hive query language task merge
In Hive, a query language HiveQL (short: hive SQL) similar to the structured query language (Structured Query Language, SQL) is used for querying, analyzing, and reporting data. Hive SQL is a powerful and flexible tool that makes data manipulation and analysis over large data sets simple and efficient.
Hive may overlay files by performing an Insert overlay (Insert overlay) approach or merge by a join (join) approach. However, the Insert overlay mode cannot be performed in a stream computing scenario, because if the Insert overlay mode is performed, the stream data is lost; for files that optimize the columnar storage (Optimized Record Columnar, ORC) file format and record-by-columnar storage (Record Columnar File, RCFile) file format, the confusing approach may be used to merge. For non-partitioned tables, this may be performed during the fly stream write Hive period; for the partition table, the confeate operation can only merge one partition small file at a time, and the file sizes, the file numbers and the like of which partitions are recently written into new files by the Flink cannot be accurately confirmed, so that the latest partition files need to be independently inquired, a plurality of small files exist, and then the corresponding partition confeate merging operation is executed. Based on the method, not only the Flink streaming task, but also the task of checking file change of each partition and merging small files are required to be maintained, and the maintenance cost is high.
In order to solve the problem that the Hadoop pressure is increased by a large number of small files, if the checkpoint time is long, the real-time property of data in Hive is reduced, the embodiment of the invention provides a small file merging method, device and electronic equipment based on a flight engine, which ensure that streaming data can be written into Hive in real time without other external services to merge small files, only a flight streaming task is started, and the small files are automatically merged under the condition that metadata information written into the Hive in real time is monitored to reach merging standards, so that the problem of excessive small files is avoided, and the Hadoop pressure is reduced.
It should be noted that, the execution body according to the embodiment of the present invention may be a small file merging device based on a link engine, or may be an electronic device, where the electronic device may include: computer, mobile terminal, wearable device, etc.
The following further describes embodiments of the present invention by taking an electronic device as an example.
As shown in FIG. 1, a flow chart of a method for merging small files based on a Flink engine provided by the invention can comprise:
101. during the process of writing streaming data into Hive through the Flink engine, metadata information written into Hive is monitored in real time.
Streaming data refers to a real-time data stream, which has a large, continuous, fast and unreproducible nature. Unlike batch data, streaming data does not have a fixed time interval, but is processed in real time in units of events.
The electronic equipment can read the streaming data through the Flink engine, then write the streaming data into the Hive through the Flink engine, and in the writing process, the electronic equipment monitors the metadata information written into the Hive in real time, so that when the metadata information written into the Hive in real time reaches the merging standard, the small files are automatically merged.
In some embodiments, the electronic device monitors the metadata information written in Hive in real time, and may include: the electronic equipment monitors the partition information and the file information written in the Hive in real time; the electronic device determines metadata information according to the partition information and the file information, wherein the metadata information is used for representing the corresponding relation between the partition information and the file information.
The Hive has a plurality of partitions, each partition corresponds to partition information, and each partition is used for storing a small file and file information corresponding to the small file.
Alternatively, the partition information may include: partition name and/or partition location, etc.; the file information may include at least one of: the number of files, the size of the files, the scheduling time of the files, etc.
It is understood that the metadata information may include: partition information, file information, and correspondence between partition information and file information.
In the process of monitoring the metadata information written in the Hive in real time, the electronic equipment can determine the metadata information according to the partition information and the file information which are monitored in real time and written in the Hive.
In some embodiments, the electronic device monitors the partition information and the file information written in Hive in real time, which may include: the electronic equipment monitors the file information written in the Hive in real time and records the writing duration; under the condition that the writing time reaches a preset time threshold, the electronic equipment executes the detection point checkpoint operation and automatically creates partition information.
Among them, the checkpoint operation refers to a mechanism in the database management system, which is used to write the data in the memory into the disk, so as to ensure the durability and consistency of the data.
Optionally, the preset duration threshold may be set before the electronic device leaves the factory, or may be user-defined, which is not specifically limited herein.
In the process of monitoring the partition information and the file information written in Hive in real time, the electronic equipment can monitor the file information written in Hive in real time, record the writing duration, and compare the writing duration with a preset duration threshold value: when the writing time reaches the preset time threshold, it is indicated that the checkpoint time has arrived, and at this time, checkpoint operation may be performed, and partition information may be automatically created.
And under the condition that the writing time length does not reach the preset time length threshold value, indicating that the checkpoint time is not reached, and at the moment, the electronic equipment continuously reads the streaming data through the link engine.
In some embodiments, the determining metadata information by the electronic device according to the partition information and the file information may include: the electronic equipment responds to the data corresponding operation input by the user, determines target partition information from partition information, and determines the corresponding relation between the target partition information and file information; the electronic equipment determines metadata information according to the target partition information, the file information and the corresponding relation.
Wherein the number of target partition information is at least one.
The user needs to merge small files corresponding to the target partition information, at the moment, data corresponding operation can be input into the electronic equipment, the electronic equipment responds to the data corresponding operation, the target partition information is determined from the partition information which is monitored in real time and written into the Hive, and the corresponding relation between the target partition information and the file information is determined; then, the electronic device can determine metadata information according to the target partition information, the file information and the corresponding relation between the target partition information and the file information. So as to merge the small files corresponding to the metadata information later.
102. And merging the small files corresponding to the metadata information to obtain the target file under the condition that the metadata information reaches at least one of the configured time threshold, the configured partition file number threshold and the configured file size threshold.
Optionally, the time threshold, the partition file number threshold, and the file size threshold may be set before the electronic device leaves the factory, or may be user-defined, which is not specifically limited herein.
In the process of determining the target file, when the metadata information reaches at least one of a configured time threshold, a configured partition file number threshold and a configured file size threshold, it is indicated that the file information corresponding to the metadata information has reached the merging standard, small file merging can be performed, and the merged file is determined to be the target file. In the whole process, when streaming data can be written into Hive in real time, other external services are not needed to merge small files, only a Flink streaming task is started, and under the condition that metadata information written into Hive in real time is monitored to reach a merging standard, the small files are automatically merged, so that the problem of excessive small files is avoided, and the Hadoop pressure is reduced.
In addition, the merging standard is rich, so that the electronic equipment can accurately execute the small file merging process.
Illustratively, after all file changes and partition changes are committed during a checkpoint by the Flink engine, the electronic device may check, based on the configuration information, whether metadata information determined during the Flink streaming task includes at least one of the following file information: the number of files, the size of the files, the scheduling time of the files and the like, and if the metadata information comprises at least one item of file information, whether the file information corresponding to the metadata information reaches the merging standard can be further judged according to a threshold value corresponding to the file information.
In some embodiments, the merging, by the electronic device, the small file corresponding to the metadata information may include: and the electronic equipment invokes a query language (Hive SQL) in Hive through the Flink engine to offline merge the small files corresponding to the metadata information.
In order to solve the problem of high maintenance cost, in the process of merging the small files corresponding to the metadata information, the electronic device may call the query language Hive SQL in Hive through the link engine to perform offline merging, that is, perform offline merging on the small files corresponding to the metadata information. In the whole process, only one Flink streaming task is started, the checkpoint time is not required to be prolonged, other offline tasks are not required to be additionally maintained, and the maintenance cost is effectively reduced.
In some embodiments, the method may further comprise: under the condition that the metadata information does not reach the configured time threshold, the configured partition file quantity threshold and the configured file size threshold, or after the target file is obtained, the electronic equipment determines a processing result corresponding to the streaming data; and under the condition that the processing result indicates that the streaming data processing is successful, the electronic equipment reads the streaming data of the next round through the Flink engine.
Under the condition that the metadata information does not reach the configured time threshold, the configured partition file quantity threshold and the configured file size threshold, the file information corresponding to the metadata information does not reach the merging standard; under the condition that the file information corresponding to the metadata information does not reach the merging standard, or after the target file is obtained, the electronic equipment determines a processing result corresponding to the streaming data: and under the condition that the processing result indicates that the streaming data processing is successful, the electronic equipment reads the streaming data of the next round through the Flink engine.
And the electronic equipment ends the small file merging process under the condition that the processing result indicates that the streaming data processing is unsuccessful.
In some embodiments, in writing streaming data to Hive by the Flink engine, the method may further comprise: the electronic equipment modifies the current name corresponding to the file information to obtain a first file name, wherein the first file name is used for representing that the corresponding small file is an invalid file; when the writing time of the file information reaches a preset time threshold, the electronic equipment modifies the first file name to obtain a second file name, wherein the second file name is used for representing that the corresponding small file is a valid file.
In the process of writing streaming data into Hive through a Flink engine, the electronic device modifies the current name corresponding to the file information to obtain a first file name. Specifically, the electronic device may write the streaming data into a temporary file of the HDFS through the link engine, where the file name of the temporary file starts with ". Part", and Hive may treat the temporary file as an invalid file when executing. That is, the current name corresponding to the modified file information, i.e., the first file name, starts with ". Part", and the small file corresponding to the first file name is an invalid file. Then, under the condition that the writing time length reaches a preset time length threshold value, the electronic equipment modifies the first file name to obtain a second file name. Specifically, the electronic device modifies the head of ". Part" in the first file name to be the head of "part", and determines the modified first file name as the second file name. That is, the second file name starts with "part", and the small file corresponding to the second file name is a valid file.
The above procedure improves the flexibility of data processing, i.e. the streaming data is not processed before the arrival of the checkpoint time, and after the arrival of the checkpoint time, the checkpoint operation is performed and the streaming data is processed.
Illustratively, during execution of the checkpoint operation: modifying the head of the 'part' in the first file name to be started by the 'part' aiming at a non-partition table in Hive; for the partition table in Hive, the ". Part" start in the first file name is also modified to start with "part" and the partition creation operation is committed.
In the embodiment of the invention, in the process of writing streaming data into Hive through a Flink engine, metadata information written into Hive is monitored in real time; and merging the small files corresponding to the metadata information to obtain the target file under the condition that the metadata information reaches at least one of the configured time threshold, the configured partition file number threshold and the configured file size threshold. According to the method, when streaming data can be written into Hive in real time, other external services are not needed to merge small files, only a Flink streaming task is started, and under the condition that metadata information written into Hive in real time is monitored to reach a merging standard, the small files are automatically merged, so that the problem of excessive small files is avoided, and the Hadoop pressure is relieved.
Embodiments of the invention are further described with reference to the following examples:
exemplary, the configuration of the parts involved in the embodiment of the present invention is as follows:
insert into hive.ns_zhangjiakou_test_xmb.cdc_fenqu
/+OPTIONS(
'sink.hive.con-cate.state' =true, # whether to open the merge small file function
'sink.hive.con-cate.time' = '20000',# merge small file function interval time unit ms
'sink.hive.con-cate.file.num' = '30', # corresponding HDFS directory more than N doclets/
select from stream_source;
The small file merging device based on the Flink engine provided by the invention is described below, and the small file merging device based on the Flink engine described below and the small file merging method based on the Flink engine described above can be correspondingly referred to each other.
As shown in FIG. 2, the file merging device based on the Flink engine provided by the invention may comprise:
the real-time monitoring module 201 is configured to monitor, in real time, metadata information written in Hive during a process of writing streaming data into the Hive by a link engine;
the file merging module 202 is configured to merge the small files corresponding to the metadata information to obtain the target file when the metadata information reaches at least one of a configured time threshold, a configured partition file number threshold, and a configured file size threshold.
Optionally, the real-time monitoring module 201 is specifically configured to monitor, in real time, partition information and file information written in the Hive; and determining the metadata information according to the partition information and the file information, wherein the metadata information is used for representing the corresponding relation between the partition information and the file information.
Optionally, the file merging module 202 is specifically configured to call, through the link engine, a query language in Hive, and perform offline merging on the small file corresponding to the metadata information.
Optionally, the real-time monitoring module 201 is specifically configured to monitor, in real time, file information written into the Hive, and record writing duration; and under the condition that the writing time length reaches a preset time length threshold value, executing a detection point checkpoint operation, and automatically creating the partition information.
Optionally, the real-time monitoring module 201 is specifically configured to determine target partition information from the partition information in response to a data correspondence operation input by a user, and determine a correspondence between the target partition information and the file information; and determining the metadata information according to the target partition information, the file information and the corresponding relation.
Optionally, the file merging module 202 is further configured to determine a processing result corresponding to the streaming data when the metadata information does not reach the configured time threshold, does not reach the configured partition file number threshold, and does not reach the configured file size threshold, or after the target file is obtained; and if the processing result indicates that the streaming data processing is successful, the streaming data of the next round is read through the Flink engine.
Optionally, in the process of writing the streaming data into Hive through the link engine, the file merging module 202 is further configured to modify a current name corresponding to the file information to obtain a first file name, where the first file name is used to characterize that the corresponding small file is an invalid file; and under the condition that the writing time length of the file information reaches a preset time length threshold value, modifying the first file name to obtain a second file name, wherein the second file name is used for representing that the corresponding small file is a valid file.
As shown in fig. 3, a schematic structural diagram of an electronic device provided by the present invention may include: processor 310, communication interface (Communications Interface) 320, memory 330 and communication bus 340, wherein processor 310, communication interface 320, memory 330 accomplish communication with each other through communication bus 340. Processor 310 may invoke logic instructions in memory 330 to perform a method of file merging based on a Flink engine, the method comprising: in the process of writing streaming data into Hive through a Flink engine, monitoring metadata information written into the Hive in real time; and merging the small files corresponding to the metadata information to obtain the target file under the condition that the metadata information reaches at least one of the configured time threshold, the configured partition file number threshold and the configured file size threshold.
Further, the logic instructions in the memory 330 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, where the computer program, when executed by a processor, can perform a method for merging small files based on a flank engine provided by the above methods, where the method includes: in the process of writing streaming data into Hive through a Flink engine, monitoring metadata information written into the Hive in real time; and merging the small files corresponding to the metadata information to obtain the target file under the condition that the metadata information reaches at least one of the configured time threshold, the configured partition file number threshold and the configured file size threshold.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform the method for file merging based on a flank engine provided by the above methods, the method comprising: in the process of writing streaming data into Hive through a Flink engine, monitoring metadata information written into the Hive in real time; and merging the small files corresponding to the metadata information to obtain the target file under the condition that the metadata information reaches at least one of the configured time threshold, the configured partition file number threshold and the configured file size threshold.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The method for merging the small files based on the Flink engine is characterized by comprising the following steps of:
in the process of writing streaming data into Hive through a Flink engine, monitoring metadata information written into the Hive in real time;
and merging the small files corresponding to the metadata information to obtain the target file under the condition that the metadata information reaches at least one of a configured time threshold, a configured partition file number threshold and a configured file size threshold.
2. The method of claim 1, wherein the real-time monitoring of metadata information written in Hive comprises:
the partition information and the file information written in the Hive are monitored in real time;
and determining the metadata information according to the partition information and the file information, wherein the metadata information is used for representing the corresponding relation between the partition information and the file information.
3. The method of claim 1, wherein merging the small files corresponding to the metadata information comprises:
and calling a query language in the Hive through the Flink engine, and carrying out offline merging on the small files corresponding to the metadata information.
4. The method of claim 2, wherein the real-time monitoring of partition information and file information written in Hive comprises:
the file information written in the Hive is monitored in real time, and the writing duration is recorded;
and under the condition that the writing time length reaches a preset time length threshold value, executing a detection point checkpoint operation, and automatically creating the partition information.
5. The method of claim 2, wherein said determining said metadata information from said partition information and said file information comprises:
responding to data corresponding operation input by a user, determining target partition information from the partition information, and determining a corresponding relation between the target partition information and the file information;
and determining the metadata information according to the target partition information, the file information and the corresponding relation.
6. The method according to any one of claims 1-5, further comprising:
determining a processing result corresponding to the streaming data under the condition that the metadata information does not reach a configured time threshold, a configured partition file quantity threshold and a configured file size threshold, or after the target file is obtained;
and under the condition that the processing result indicates that the streaming data processing is successful, the streaming data of the next round is read through the Flink engine.
7. The method of any one of claims 2, 4, or 5, wherein during the writing of streaming data to Hive by the Flink engine, the method further comprises:
modifying the current name corresponding to the file information to obtain a first file name, wherein the first file name is used for representing that the corresponding small file is an invalid file;
and under the condition that the writing time length of the file information reaches a preset time length threshold value, modifying the first file name to obtain a second file name, wherein the second file name is used for representing that the corresponding small file is a valid file.
8. The utility model provides a doclet merge device based on link engine which characterized in that includes:
the real-time monitoring module is used for monitoring metadata information written in the Hive in real time in the process of writing the streaming data into the Hive through the Flink engine;
and the file merging module is used for merging the small files corresponding to the metadata information to obtain the target file under the condition that the metadata information reaches at least one of a configured time threshold, a configured partition file quantity threshold and a configured file size threshold.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of file merging based on a Flink engine as defined in any one of claims 1 to 7 when the program is executed by the processor.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the Flink engine-based doclet merging method of any one of claims 1 to 7.
CN202410104745.6A 2024-01-25 2024-01-25 Method and device for merging small files based on Flink engine and electronic equipment Pending CN117632860A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410104745.6A CN117632860A (en) 2024-01-25 2024-01-25 Method and device for merging small files based on Flink engine and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410104745.6A CN117632860A (en) 2024-01-25 2024-01-25 Method and device for merging small files based on Flink engine and electronic equipment

Publications (1)

Publication Number Publication Date
CN117632860A true CN117632860A (en) 2024-03-01

Family

ID=90018489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410104745.6A Pending CN117632860A (en) 2024-01-25 2024-01-25 Method and device for merging small files based on Flink engine and electronic equipment

Country Status (1)

Country Link
CN (1) CN117632860A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106843763A (en) * 2017-01-19 2017-06-13 北京神州绿盟信息安全科技股份有限公司 A kind of Piece file mergence method and device based on HDFS systems
US10838931B1 (en) * 2017-04-28 2020-11-17 EMC IP Holding Company LLC Use of stream-oriented log data structure for full-text search oriented inverted index metadata
CN112019605A (en) * 2020-08-13 2020-12-01 上海哔哩哔哩科技有限公司 Data distribution method and system of data stream
CN112965939A (en) * 2021-02-07 2021-06-15 中国工商银行股份有限公司 File merging method, device and equipment
CN113612832A (en) * 2021-07-29 2021-11-05 上海哔哩哔哩科技有限公司 Streaming data distribution method and system
CN114416655A (en) * 2021-12-13 2022-04-29 珠海格力电器股份有限公司 Hive file processing method and device, computer equipment and storage medium
CN115473858A (en) * 2022-09-05 2022-12-13 上海哔哩哔哩科技有限公司 Data transmission method and streaming data transmission system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106843763A (en) * 2017-01-19 2017-06-13 北京神州绿盟信息安全科技股份有限公司 A kind of Piece file mergence method and device based on HDFS systems
US10838931B1 (en) * 2017-04-28 2020-11-17 EMC IP Holding Company LLC Use of stream-oriented log data structure for full-text search oriented inverted index metadata
CN112019605A (en) * 2020-08-13 2020-12-01 上海哔哩哔哩科技有限公司 Data distribution method and system of data stream
CN112965939A (en) * 2021-02-07 2021-06-15 中国工商银行股份有限公司 File merging method, device and equipment
CN113612832A (en) * 2021-07-29 2021-11-05 上海哔哩哔哩科技有限公司 Streaming data distribution method and system
CN114416655A (en) * 2021-12-13 2022-04-29 珠海格力电器股份有限公司 Hive file processing method and device, computer equipment and storage medium
CN115473858A (en) * 2022-09-05 2022-12-13 上海哔哩哔哩科技有限公司 Data transmission method and streaming data transmission system

Similar Documents

Publication Publication Date Title
US9367601B2 (en) Cost-based optimization of configuration parameters and cluster sizing for hadoop
CN108694195B (en) Management method and system of distributed data warehouse
CN108920153B (en) Docker container dynamic scheduling method based on load prediction
CN111258978B (en) Data storage method
CN111324610A (en) Data synchronization method and device
CN105677812A (en) Method and device for querying data
JP2010524060A (en) Data merging in distributed computing
CN111274256A (en) Resource control method, device, equipment and storage medium based on time sequence database
CN112465046B (en) Method, system, equipment and medium for artificial intelligence training of mass small files
CN111966289A (en) Partition optimization method and system based on Kafka cluster
CN116302574B (en) Concurrent processing method based on MapReduce
CN114090580A (en) Data processing method, device, equipment, storage medium and product
CN112559525B (en) Data checking system, method, device and server
CN115150471A (en) Data processing method, device, equipment, storage medium and program product
CN114756629A (en) Multi-source heterogeneous data interaction analysis engine and method based on SQL
CN108228432A (en) A kind of distributed link tracking, analysis method and server, global scheduler
CN103530369A (en) De-weight method and system
CN112711606A (en) Database access method and device, computer equipment and storage medium
CN116974994A (en) High-efficiency file collaboration system based on clusters
CN111427920A (en) Data acquisition method, device, system, computer equipment and storage medium
CN115982230A (en) Cross-data-source query method, system, equipment and storage medium of database
CN117632860A (en) Method and device for merging small files based on Flink engine and electronic equipment
Dai et al. GraphTrek: asynchronous graph traversal for property graph-based metadata management
CN113553320B (en) Data quality monitoring method and device
CN112231292A (en) File processing method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination