CN113836224A - Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment - Google Patents

Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment Download PDF

Info

Publication number
CN113836224A
CN113836224A CN202111044619.9A CN202111044619A CN113836224A CN 113836224 A CN113836224 A CN 113836224A CN 202111044619 A CN202111044619 A CN 202111044619A CN 113836224 A CN113836224 A CN 113836224A
Authority
CN
China
Prior art keywords
files
file
synchronous
ogg
hdfs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111044619.9A
Other languages
Chinese (zh)
Inventor
张志亮
赵永国
杨荣霞
曹熙
曾祥清
黎名航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Southern Power Grid Big Data Service Co ltd
Original Assignee
China Southern Power Grid Big Data Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Southern Power Grid Big Data Service Co ltd filed Critical China Southern Power Grid Big Data Service Co ltd
Priority to CN202111044619.9A priority Critical patent/CN113836224A/en
Publication of CN113836224A publication Critical patent/CN113836224A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a method and a device for processing synchronous files from OGG to HDFS, computer equipment and a storage medium. The method comprises the following steps: limiting the scroll switching time and the file size of the synchronous file from the OGG software to the HDFS based on the OGG software; based on the OGG software which is configured, incremental data generated by a service system is used as a synchronous file, and the synchronous file is synchronized to the HDFS in a quasi-real-time manner; and traversing the synchronous files acquired by the OGG software based on the HDFS, screening out small files in the synchronous files, and merging the small files. By adopting the method, the stability of the hadoop cluster can be improved.

Description

Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment
Technical Field
The present application relates to the field of distributed storage technologies, and in particular, to a method and an apparatus for processing a synchronization file from an OGG to an HDFS, a computer device, and a storage medium.
Background
With the rapid development and application of the hadoop technology, more and more enterprises begin to perform decision analysis application construction based on the hadoop platform, but many important business production systems are deployed based on the relational database, which means that data exchange needs to be performed on two different architecture platforms. A common approach is to employ CDC techniques for near real-time synchronization of data.
The Oracle goldengate (OGG for short) software is mature CDC software, can realize data synchronization from most types of relational databases to a hadoop big data platform, and can perform incremental capture on logs of the databases and push the logs to hadoop HDFS distributed storage in quasi-real time.
In the prior art, Hadoop is good at storing large files, because metadata information of the large files is less, when a large number of small files exist in a Hadoop cluster, each small file needs to maintain one piece of metadata information, the memory pressure of cluster management metadata can be greatly increased, and the defect that the stability of the Hadoop cluster is reduced is caused.
Disclosure of Invention
Therefore, in order to solve the above technical problems, it is necessary to provide an OGG-to-HDFS synchronization file processing method, an OGG-to-HDFS synchronization file processing apparatus, a computer device, and a storage medium, which can improve stability of a Hadoop cluster.
An OGG-to-HDFS synchronized file processing method, the method comprising:
limiting the scroll switching time and the file size of the synchronous file from the OGG software to the HDFS based on the OGG software;
based on the OGG software which is configured, incremental data generated by a service system is used as a synchronous file, and the synchronous file is synchronized to the HDFS in a quasi-real-time manner;
and traversing the synchronous files acquired by the OGG software based on the HDFS, screening out small files in the synchronous files, and merging the small files.
In one embodiment, the method further comprises the following steps: based on OGG software, limiting the rolling switching time of the synchronous files from the OGG software to the HDFS to be a first preset time length;
the file size of the synchronized file to the HDFS through the OGG software is limited to a first preset threshold.
In one embodiment, the method further comprises the following steps: obtaining incremental data generated by a service system where OGG software is located, and taking the incremental data as a synchronous file;
judging whether the rolling switching time of the synchronous data reaches the first preset time length or not based on the configured OGG software;
and if the rolling switching time of the synchronous data reaches the first preset time length, synchronizing the synchronous file to the HDFS.
In one embodiment, the method further comprises the following steps: judging whether the file size of the synchronous data reaches the first preset threshold value or not based on the configured OGG software;
and if the file size of the synchronous data reaches the first preset threshold value, synchronizing the synchronous file to the HDFS.
In one embodiment, the method further comprises the following steps: traversing the directory of the synchronous files based on the HDFS, and screening out small files in the synchronous files;
judging whether the small files in each subdirectory meet the preset small file merging condition or not in each subdirectory of the synchronous file;
if the small files in the same subdirectory meet the merging condition, merging the small files in the same subdirectory, and generating at least one merged file in the same subdirectory;
and deleting the small files after the merging processing.
In one embodiment, the method further comprises the following steps: traversing the directory of the synchronous files based on the HDFS, and judging whether the size of each file in the synchronous files exceeds a preset second preset threshold value;
and taking the file with the file size exceeding a preset second preset threshold value in the synchronous files as a small file.
In one embodiment, the method further comprises the following steps: judging whether the ratio of the number of small files in each subdirectory to the total number of files in the current subdirectory reaches a preset first ratio threshold or not in each subdirectory of the synchronous files;
and if the ratio of the number of the small files in the current subdirectory to the total number of the files in the current subdirectory reaches a preset first ratio threshold, the small files in the current subdirectory accord with the small file merging condition.
An OGG-to-HDFS synchronized file processing apparatus, the apparatus comprising:
the configuration module is used for limiting the rolling switching time and the file size of the synchronous file from the OGG software to the HDFS based on the OGG software;
the synchronization module is used for synchronizing the synchronization file to the HDFS in a quasi-real-time manner by taking incremental data generated by a service system as a synchronization file based on the configured OGG software;
and the merging module is used for traversing the synchronous files acquired by the OGG software based on the HDFS, screening out small files in the synchronous files and merging the small files.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
limiting the scroll switching time and the file size of the synchronous file from the OGG software to the HDFS based on the OGG software;
based on the OGG software which is configured, incremental data generated by a service system is used as a synchronous file, and the synchronous file is synchronized to the HDFS in a quasi-real-time manner;
and traversing the synchronous files acquired by the OGG software based on the HDFS, screening out small files in the synchronous files, and merging the small files.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
limiting the scroll switching time and the file size of the synchronous file from the OGG software to the HDFS based on the OGG software;
based on the OGG software which is configured, incremental data generated by a service system is used as a synchronous file, and the synchronous file is synchronized to the HDFS in a quasi-real-time manner;
and traversing the synchronous files acquired by the OGG software based on the HDFS, screening out small files in the synchronous files, and merging the small files.
According to the method, the device, the computer equipment and the storage medium for processing the synchronous files from the OGG to the HDFS, the rolling switching time and the file size of the synchronous files from the OGG software to the HDFS are limited based on the OGG software, the synchronous files are synchronized to the HDFS in a quasi-real-time mode by taking incremental data generated by a service system as the synchronous files based on the OGG software after configuration, the synchronous files obtained through the OGG software are traversed based on the HDFS, the small files in the synchronous files are screened out, the small files are combined, the small files in the synchronous files are eliminated, and the stability of a Hadoop cluster is improved.
Drawings
FIG. 1 is a diagram of an application environment of a method for processing an OGG-to-HDFS synchronous file in one embodiment;
FIG. 2 is a flow chart illustrating a method for processing an OGG-to-HDFS synchronous file according to an embodiment;
FIG. 3 is a schematic flowchart of the processing steps of the OGG-to-HDFS synchronous file in another embodiment;
FIG. 4 is a block diagram showing the structure of an OGG-to-HDFS synchronous file processing apparatus according to an embodiment;
FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The OGG-to-HDFS synchronous file processing method can be applied to the application environment shown in FIG. 1. Wherein the terminal 102 communicates with the server 104 via a network. For example, the server 104 is configured to limit the scroll switching time and the file size of the synchronized file to the HDFS through the OGG software based on the OGG software; based on the OGG software which is configured, incremental data generated by a service system is used as a synchronous file, and the synchronous file is synchronized to the HDFS in a quasi-real-time manner; and traversing the synchronous files acquired by the OGG software based on the HDFS, screening out small files in the synchronous files, and merging the small files.
The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a method for processing an OGG-to-HDFS synchronization file is provided, which is described by taking the method as an example applied to the terminal in fig. 1, and includes the following steps:
and step 202, limiting the scroll switching time and the file size of the synchronous file from the OGG software to the HDFS based on the OGG software.
The OGG software is log-based structured data replication software, and acquires the change of data increase and deletion (the data volume is only about one fourth of the log) by analyzing the online log or the archived log of a source database. A Hadoop Distributed File System (HDFS) is a Java-based Distributed File System, which has the advantages of fault tolerance, scalability, and easy expandability, and can be operated on commercial hardware or deployed on low-cost hardware. HDFS is a distributed storage Hadoop application that provides a closer interface to data.
Specifically, the OGG software is used for configuring the rolling switching time and the file size of the synchronous files from the OGG software to the HDFS, the rolling switching time of the synchronous files from the OGG software to the HDFS is limited, the generation frequency of the synchronous files is reduced, and the generation quantity of small files is reduced. Meanwhile, the file size of the synchronous file from the OGG software to the HDFS is limited, the number of the small files in the synchronous file is reduced, and the stability of the hadoop cluster is improved.
And step 204, based on the configured OGG software, taking incremental data generated by the service system as a synchronous file, and synchronizing the synchronous file to the HDFS in a quasi-real-time manner.
Specifically, after the configuration of the OGG software is completed, based on the configured OGG software, incremental data generated by a service system of the terminal equipment at the configuration position of the OGG software is obtained and used as a synchronization file, and the synchronization file is synchronized to the HDFS according to the set rolling switching time and file size limitation. For example, fig. 3 is a schematic flow chart of processing steps of synchronizing files from an OGG to an HDFS in another embodiment, and as shown in fig. 3, when an OGG side synchronizes incremental data of a service system side to an HDFS in a quasi-real-time manner through configuration, in order to reduce generation of small files, and in combination with specific service conditions and HDFS storage characteristics, a HDFS file scroll switching time is configured to be 1800 seconds, and a maximum size of a single file is 1 GB.
And step 206, traversing the synchronous files acquired by the OGG software based on the HDFS, screening out small files in the synchronous files, and merging the small files.
Specifically, after incremental data generated by a service system are synchronized to an HDFS, a synchronization file acquired through OGG software is traversed based on the HDFS, namely, the synchronization file is traversed according to directories of the acquired synchronization file, small files in the synchronization file are screened out through traversing all levels of directories, the small files are merged, and the merged small files are deleted after merging is completed. For example, the HDFS small file merging program cyclically scans the OGG synchronous directory, traverses all subdirectories under the root directory, and finds that the number of the small files in the current directory exceeds 30%, the child sub-process performs merging operation on the files under the directory, deletes the corresponding small file after the merging program is normally completed, and moves the new merged file to the directory for replacement, thereby ensuring the accuracy of the data service.
According to the method for processing the synchronous files from the OGG to the HDFS, the rolling switching time and the file size of the synchronous files from the OGG software to the HDFS are limited based on the OGG software, the synchronous files are synchronized to the HDFS in a quasi-real-time mode by taking incremental data generated by a service system as the synchronous files based on the OGG software after configuration is completed, the synchronous files obtained through the OGG software are traversed based on the HDFS, small files in the synchronous files are screened out, the small files are combined, the small files in the synchronous files are eliminated, and the stability of a Hadoop cluster is improved.
In one embodiment, the limiting of the scroll switching time and the file size of the synchronized file to the HDFS through the OGG software based on the OGG software comprises:
based on OGG software, limiting the rolling switching time of the synchronous files from the OGG software to the HDFS to be a first preset time length;
the file size of the synchronized file to the HDFS through the OGG software is limited to a first preset threshold.
Specifically, the limiting of the scroll switching time and the file size of the synchronous files from the OGG software to the HDFS comprises limiting the scroll switching time of the synchronous files from the OGG software to the HDFS to a first preset time length based on the OGG software, so that the scroll switching time of the synchronous files from the OGG software to the HDFS is limited, the generation frequency of the synchronous files is reduced, and the generation number of small files can be reduced. On the other hand, the file size of the synchronous file from the OGG software to the HDFS is limited to be a first preset threshold, the file size of the synchronous file from the OGG software to the HDFS is limited, and the number of small files in the synchronous file is reduced.
In this embodiment, based on the OGG software, the scroll switching time of the synchronization file from the OGG software to the HDFS is limited to a first preset time length, and the file size of the synchronization file from the OGG software to the HDFS is limited to a first preset threshold value, so that the generation of small files in the synchronization file is limited, and the stability of the hadoop cluster is improved.
In one embodiment, the configuring based OGG software, with incremental data generated by a business system as a synchronization file, synchronizing the synchronization file to the HDFS in near real time includes:
obtaining incremental data generated by a service system where OGG software is located, and taking the incremental data as a synchronous file;
judging whether the rolling switching time of the synchronous data reaches the first preset time length or not based on the configured OGG software;
and if the rolling switching time of the synchronous data reaches the first preset time length, synchronizing the synchronous file to the HDFS.
Specifically, incremental data generated by a service system where OGG software is located is obtained, and the incremental data is used as a synchronization file to be synchronized to an HDFS; and after the synchronization is finished, judging whether the rolling switching time of the synchronous data reaches a first preset time length or not based on the OGG software which is configured, and synchronizing the synchronous file to the HDFS if the rolling switching time of the synchronous data reaches the first preset time length. In order to avoid the generation of small files, the first preset time period is not too short. For example, the first preset time duration, which is the scroll switching time of the HDFS file, may be configured to be 1800 seconds, and when the scroll switching time of the synchronization file reaches 1800 seconds, the synchronization file is synchronized to the HDFS.
In this embodiment, incremental data generated by a service system in which OGG software is located is obtained, the incremental data is used as a synchronization file, whether the rolling switching time of the synchronization data reaches a first preset duration is judged based on the configured OGG software, if the rolling switching time of the synchronization data reaches the first preset duration, the synchronization file is synchronized to the HDFS, and by setting the first preset duration, the generation of small files is limited, the number of small files is reduced, and the stability of a hadoop cluster is improved.
In one embodiment, the configuring based OGG software, with incremental data generated by a service system as a synchronization file, synchronizing the synchronization file to the HDFS in near real-time further includes:
judging whether the file size of the synchronous data reaches the first preset threshold value or not based on the configured OGG software;
and if the file size of the synchronous data reaches the first preset threshold value, synchronizing the synchronous file to the HDFS.
Specifically, in order to limit generation of small files, besides setting a first preset time length, the method also comprises the steps of judging whether the file size of the synchronous data reaches a first preset threshold value or not based on the configured OGG software, and if the file size of the synchronous data reaches the first preset threshold value, synchronizing the synchronous file to the HDFS; for example, a single file maximum size of 1GB is set as a first preset threshold, and when the size of the synchronization file reaches 1G, the synchronization file is synchronized to the HDFS.
In this embodiment, whether the file size of the synchronized data reaches the first preset threshold is determined based on the configured OGG software, and if the file size of the synchronized data reaches the first preset threshold, the synchronized file is synchronized to the HDFS.
In one embodiment, the traversing, based on the HDFS, the synchronization files obtained by the OGG software, screening out small files from the synchronization files, and merging the small files includes:
traversing the directory of the synchronous files based on the HDFS, and screening out small files in the synchronous files;
judging whether the small files in each subdirectory meet the preset small file merging condition or not in each subdirectory of the synchronous file;
if the small files in the same subdirectory meet the merging condition, merging the small files in the same subdirectory, and generating at least one merged file in the same subdirectory;
and deleting the small files after the merging processing.
Specifically, after synchronizing the synchronous files to the HDFS, traversing the directories of the synchronous files based on the HDFS to screen out small files in the synchronous files; after screening of the small files is completed, judging whether the small files in each subdirectory conform to a preset small file merging condition or not in each subdirectory of the synchronous files, and merging the small files in the same subdirectory when the small files in the same subdirectory conform to the preset small file merging condition; when merging, merging is carried out according to the total size of the small files in the same directory and the maximum value of the synthesized file, and at least one merged file is generated in the same subdirectory. And after the combination is finished, deleting the small files after the combination processing to eliminate the small files.
In the embodiment, the directory of the synchronous file is traversed based on the HDFS, small files in the synchronous file are screened out, whether the small files in each subdirectory of the synchronous file meet a preset small file merging condition or not is judged in each subdirectory of the synchronous file, when the small files in the same subdirectory meet the merging condition, the small files in the same subdirectory are merged, at least one merged file is generated in the same subdirectory, and finally the merged small files are deleted, so that the deletion of the synchronous file is realized, the starting of an MR process is reduced, and the optimized utilization of small cluster resources is realized.
In one embodiment, the traversing the directory of the synchronized files based on the HDFS, and the filtering out small files of the synchronized files comprises:
traversing the directory of the synchronous files based on the HDFS, and judging whether the size of each file in the synchronous files exceeds a preset second preset threshold value;
and taking the file with the file size exceeding a preset second preset threshold value in the synchronous files as a small file.
Specifically, when small files are screened, the directory of the synchronous files is traversed based on the HDFS, whether the size of each file in the synchronous files exceeds a preset second preset threshold is judged, and the files with the file sizes exceeding the preset second preset threshold in the synchronous files are used as the small files, that is, the files exceeding the second preset threshold in the synchronous files are the small files.
In this embodiment, the directory of the synchronized file is traversed based on the HDFS, whether the size of each file in the synchronized file exceeds a preset second preset threshold is determined, and the file with the size exceeding the preset second preset threshold in the synchronized file is used as a small file according to the determination result, so that the small file in the synchronized file is determined, and a premise is created for further merging the small files.
In an embodiment, the determining, in each sub-directory of the synchronous file, whether a small file in each sub-directory meets a preset small file merging condition includes:
judging whether the ratio of the number of small files in each subdirectory to the total number of files in the current subdirectory reaches a preset first ratio threshold or not in each subdirectory of the synchronous files;
and if the ratio of the number of the small files in the current subdirectory to the total number of the files in the current subdirectory reaches a preset first ratio threshold, the small files in the current subdirectory accord with the small file merging condition.
Specifically, when the number of small files in the synchronization file is small, processing is not usually performed, and only when the ratio of the number of small files to the total number of files in the synchronization file reaches a certain threshold, the small files are merged. In each sub-directory of the synchronous files, judging whether the ratio of the number of the small files in each sub-directory to the total number of the files in the current sub-directory reaches a preset first ratio threshold, for example, 30%; and when the ratio of the number of the small files in the current subdirectory to the total number of the files in the current subdirectory reaches a preset first ratio threshold, judging that the small files in the current subdirectory meet the small file merging condition. For example, when the ratio of the number of the small files to the total number of the synchronization files is 40%, the small files of the current subdirectory meet the small file merging condition at this time because the first ratio threshold is exceeded.
In this embodiment, in each sub-directory of a synchronous file, it is determined whether a ratio of the number of small files in each sub-directory to the total number of files in the current sub-directory reaches a preset first ratio threshold, and when the ratio of the number of small files in the current sub-directory to the total number of files in the current sub-directory reaches the preset first ratio threshold, it is determined that the small files in the current sub-directory meet a small file merging condition.
It should be understood that although the various steps in the flow charts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 4, there is provided an OGG-to-HDFS synchronized file processing apparatus including: a configuration module 401, a synchronization module 402, and a merge module 403, wherein:
a configuration module 401, configured to limit the scroll switching time and the file size of the synchronized file from the OGG software to the HDFS based on the OGG software.
And the synchronization module 402 is configured to synchronize the synchronization file to the HDFS in a near-real-time manner by using incremental data generated by the service system as a synchronization file based on the configured OGG software.
The merging module 403 is configured to traverse the synchronization files obtained through the OGG software based on the HDFS, screen out small files in the synchronization files, and merge the small files.
In one embodiment, the configuration module 401 is further configured to: based on OGG software, limiting the rolling switching time of the synchronous files from the OGG software to the HDFS to be a first preset time length; the file size of the synchronized file to the HDFS through the OGG software is limited to a first preset threshold.
In one embodiment, the synchronization module 402 is further configured to: obtaining incremental data generated by a service system where OGG software is located, and taking the incremental data as a synchronous file; judging whether the rolling switching time of the synchronous data reaches the first preset time length or not based on the configured OGG software; and if the rolling switching time of the synchronous data reaches the first preset time length, synchronizing the synchronous file to the HDFS.
In one embodiment, the synchronization module 402 is further configured to: judging whether the file size of the synchronous data reaches the first preset threshold value or not based on the configured OGG software; and if the file size of the synchronous data reaches the first preset threshold value, synchronizing the synchronous file to the HDFS.
In one embodiment, the merging module 403 is further configured to: traversing the directory of the synchronous files based on the HDFS, and screening out small files in the synchronous files; judging whether the small files in each subdirectory meet the preset small file merging condition or not in each subdirectory of the synchronous file; if the small files in the same subdirectory meet the merging condition, merging the small files in the same subdirectory, and generating at least one merged file in the same subdirectory; and deleting the small files after the merging processing.
In one embodiment, the merging module 403 is further configured to: traversing the directory of the synchronous files based on the HDFS, and judging whether the size of each file in the synchronous files exceeds a preset second preset threshold value; and taking the file with the file size exceeding a preset second preset threshold value in the synchronous files as a small file.
In one embodiment, the merging module 403 is further configured to: judging whether the ratio of the number of small files in each subdirectory to the total number of files in the current subdirectory reaches a preset first ratio threshold or not in each subdirectory of the synchronous files;
and if the ratio of the number of the small files in the current subdirectory to the total number of the files in the current subdirectory reaches a preset first ratio threshold, the small files in the current subdirectory accord with the small file merging condition.
According to the device for processing the synchronous files from the OGG to the HDFS, the rolling switching time and the file size of the synchronous files from the OGG software to the HDFS are limited based on the OGG software, the synchronous files are synchronized to the HDFS in a quasi-real-time mode by taking incremental data generated by a service system as the synchronous files based on the OGG software after configuration is completed, the synchronous files obtained through the OGG software are traversed based on the HDFS, small files in the synchronous files are screened out, the small files are combined, the small files in the synchronous files are eliminated, and the stability of a Hadoop cluster is improved.
For specific limitations of the OGG-to-HDFS synchronization file processing apparatus, reference may be made to the above limitations of the OGG-to-HDFS synchronization file processing method, which are not described herein again. The respective modules in the above OGG to HDFS synchronized file processing apparatus may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an OGG-to-HDFS synchronized file processing method.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
limiting the scroll switching time and the file size of the synchronous file from the OGG software to the HDFS based on the OGG software;
based on the OGG software which is configured, incremental data generated by a service system is used as a synchronous file, and the synchronous file is synchronized to the HDFS in a quasi-real-time manner;
and traversing the synchronous files acquired by the OGG software based on the HDFS, screening out small files in the synchronous files, and merging the small files.
In one embodiment, the processor, when executing the computer program, further performs the steps of: based on OGG software, limiting the rolling switching time of the synchronous files from the OGG software to the HDFS to be a first preset time length; the file size of the synchronized file to the HDFS through the OGG software is limited to a first preset threshold.
In one embodiment, the processor, when executing the computer program, further performs the steps of: obtaining incremental data generated by a service system where OGG software is located, and taking the incremental data as a synchronous file; judging whether the rolling switching time of the synchronous data reaches the first preset time length or not based on the configured OGG software; and if the rolling switching time of the synchronous data reaches the first preset time length, synchronizing the synchronous file to the HDFS.
In one embodiment, the processor, when executing the computer program, further performs the steps of: judging whether the file size of the synchronous data reaches the first preset threshold value or not based on the configured OGG software; and if the file size of the synchronous data reaches the first preset threshold value, synchronizing the synchronous file to the HDFS.
In one embodiment, the processor, when executing the computer program, further performs the steps of: traversing the directory of the synchronous files based on the HDFS, and screening out small files in the synchronous files; judging whether the small files in each subdirectory meet the preset small file merging condition or not in each subdirectory of the synchronous file; if the small files in the same subdirectory meet the merging condition, merging the small files in the same subdirectory, and generating at least one merged file in the same subdirectory; and deleting the small files after the merging processing.
In one embodiment, the processor, when executing the computer program, further performs the steps of: traversing the directory of the synchronous files based on the HDFS, and judging whether the size of each file in the synchronous files exceeds a preset second preset threshold value; and taking the file with the file size exceeding a preset second preset threshold value in the synchronous files as a small file.
In one embodiment, the processor, when executing the computer program, further performs the steps of: judging whether the ratio of the number of small files in each subdirectory to the total number of files in the current subdirectory reaches a preset first ratio threshold or not in each subdirectory of the synchronous files; and if the ratio of the number of the small files in the current subdirectory to the total number of the files in the current subdirectory reaches a preset first ratio threshold, the small files in the current subdirectory accord with the small file merging condition.
According to the computer equipment, the rolling switching time and the file size of the synchronous files from the OGG software to the HDFS are limited based on the OGG software, the incremental data generated by a service system are used as the synchronous files based on the OGG software which is configured, the synchronous files are synchronized to the HDFS in a quasi-real-time mode, the synchronous files obtained through the OGG software are traversed based on the HDFS, small files in the synchronous files are screened out, the small files are combined, the small files in the synchronous files are eliminated, and the stability of a Hadoop cluster is improved.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
limiting the scroll switching time and the file size of the synchronous file from the OGG software to the HDFS based on the OGG software;
based on the OGG software which is configured, incremental data generated by a service system is used as a synchronous file, and the synchronous file is synchronized to the HDFS in a quasi-real-time manner;
and traversing the synchronous files acquired by the OGG software based on the HDFS, screening out small files in the synchronous files, and merging the small files.
In one embodiment, the computer program when executed by the processor further performs the steps of: obtaining incremental data generated by a service system where OGG software is located, and taking the incremental data as a synchronous file; judging whether the rolling switching time of the synchronous data reaches the first preset time length or not based on the configured OGG software; and if the rolling switching time of the synchronous data reaches the first preset time length, synchronizing the synchronous file to the HDFS.
In one embodiment, the computer program when executed by the processor further performs the steps of: judging whether the file size of the synchronous data reaches the first preset threshold value or not based on the configured OGG software; and if the file size of the synchronous data reaches the first preset threshold value, synchronizing the synchronous file to the HDFS.
In one embodiment, the computer program when executed by the processor further performs the steps of: traversing the directory of the synchronous files based on the HDFS, and screening out small files in the synchronous files; judging whether the small files in each subdirectory meet the preset small file merging condition or not in each subdirectory of the synchronous file; if the small files in the same subdirectory meet the merging condition, merging the small files in the same subdirectory, and generating at least one merged file in the same subdirectory; and deleting the small files after the merging processing.
In one embodiment, the computer program when executed by the processor further performs the steps of: traversing the directory of the synchronous files based on the HDFS, and judging whether the size of each file in the synchronous files exceeds a preset second preset threshold value; and taking the file with the file size exceeding a preset second preset threshold value in the synchronous files as a small file.
In one embodiment, the computer program when executed by the processor further performs the steps of: judging whether the ratio of the number of small files in each subdirectory to the total number of files in the current subdirectory reaches a preset first ratio threshold or not in each subdirectory of the synchronous files; and if the ratio of the number of the small files in the current subdirectory to the total number of the files in the current subdirectory reaches a preset first ratio threshold, the small files in the current subdirectory accord with the small file merging condition.
The storage medium limits the rolling switching time and the file size of the synchronous files from the OGG software to the HDFS based on the OGG software, synchronizes the synchronous files to the HDFS in a quasi-real-time manner by taking incremental data generated by a service system as the synchronous files based on the OGG software after configuration, traverses the synchronous files obtained by the OGG software based on the HDFS, screens out the small files in the synchronous files, and combines the small files, so that the elimination of the small files in the synchronous files is realized, and the stability of a Hadoop cluster is improved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An OGG-to-HDFS synchronous file processing method, comprising:
limiting the scroll switching time and the file size of the synchronous file from the OGG software to the HDFS based on the OGG software;
based on the OGG software which is configured, incremental data generated by a service system is used as a synchronous file, and the synchronous file is synchronized to the HDFS in a quasi-real-time manner;
and traversing the synchronous files acquired by the OGG software based on the HDFS, screening out small files in the synchronous files, and merging the small files.
2. The method of claim 1, wherein the limiting of the scroll switching time and file size of the synchronized files to the HDFS through the OGG software based on the OGG software comprises:
based on OGG software, limiting the rolling switching time of the synchronous files from the OGG software to the HDFS to be a first preset time length;
the file size of the synchronized file to the HDFS through the OGG software is limited to a first preset threshold.
3. The method of claim 2, wherein the configuring based OGG software, with incremental data generated by a service system as a synchronization file, quasi-real-time synchronizing the synchronization file to the HDFS comprises:
obtaining incremental data generated by a service system where OGG software is located, and taking the incremental data as a synchronous file;
judging whether the rolling switching time of the synchronous data reaches the first preset time length or not based on the configured OGG software;
and if the rolling switching time of the synchronous data reaches the first preset time length, synchronizing the synchronous file to the HDFS.
4. The method of claim 3, wherein the configuring based OGG software, with incremental data generated by a service system as a synchronization file, synchronizes the synchronization file to the HDFS in near real time further comprises:
judging whether the file size of the synchronous data reaches the first preset threshold value or not based on the configured OGG software;
and if the file size of the synchronous data reaches the first preset threshold value, synchronizing the synchronous file to the HDFS.
5. The method according to claim 1, wherein the traversing the synchronization files obtained through the OGG software based on the HDFS, screening out small files from the synchronization files, and merging the small files comprises:
traversing the directory of the synchronous files based on the HDFS, and screening out small files in the synchronous files;
judging whether the small files in each subdirectory meet the preset small file merging condition or not in each subdirectory of the synchronous file;
if the small files in the same subdirectory meet the merging condition, merging the small files in the same subdirectory, and generating at least one merged file in the same subdirectory;
and deleting the small files after the merging processing.
6. The method of claim 5, wherein traversing the directory of synchronized files based on the HDFS, and wherein filtering out small files of the synchronized files comprises:
traversing the directory of the synchronous files based on the HDFS, and judging whether the size of each file in the synchronous files exceeds a preset second preset threshold value;
and taking the file with the file size exceeding a preset second preset threshold value in the synchronous files as a small file.
7. The method of claim 5, wherein the determining whether the small files in each sub-directory of the synchronous file meet a preset small file merging condition comprises:
judging whether the ratio of the number of small files in each subdirectory to the total number of files in the current subdirectory reaches a preset first ratio threshold or not in each subdirectory of the synchronous files;
and if the ratio of the number of the small files in the current subdirectory to the total number of the files in the current subdirectory reaches a preset first ratio threshold, the small files in the current subdirectory accord with the small file merging condition.
8. An OGG-to-HDFS synchronized file processing apparatus, comprising:
the configuration module is used for limiting the rolling switching time and the file size of the synchronous file from the OGG software to the HDFS based on the OGG software;
the synchronization module is used for synchronizing the synchronization file to the HDFS in a quasi-real-time manner by taking incremental data generated by a service system as a synchronization file based on the configured OGG software;
and the merging module is used for traversing the synchronous files acquired by the OGG software based on the HDFS, screening out small files in the synchronous files and merging the small files.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202111044619.9A 2021-09-07 2021-09-07 Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment Pending CN113836224A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111044619.9A CN113836224A (en) 2021-09-07 2021-09-07 Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111044619.9A CN113836224A (en) 2021-09-07 2021-09-07 Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment

Publications (1)

Publication Number Publication Date
CN113836224A true CN113836224A (en) 2021-12-24

Family

ID=78958550

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111044619.9A Pending CN113836224A (en) 2021-09-07 2021-09-07 Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment

Country Status (1)

Country Link
CN (1) CN113836224A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731921A (en) * 2015-03-26 2015-06-24 江苏物联网研究发展中心 Method for storing and processing small log type files in Hadoop distributed file system
CN106855877A (en) * 2016-11-04 2017-06-16 国网浙江省电力公司 A kind of synchronous method for replicating of big data
CN108763387A (en) * 2018-05-20 2018-11-06 湖北九州云仓科技发展有限公司 Big data fusion method, electronic equipment, storage medium and the system of heterogeneous platform
CN111291127A (en) * 2020-03-11 2020-06-16 北京奇艺世纪科技有限公司 Data synchronization method, device, server and storage medium
CN111723160A (en) * 2020-08-24 2020-09-29 国网浙江省电力有限公司 Multi-source heterogeneous incremental data synchronization method and system
CN112241396A (en) * 2020-10-27 2021-01-19 浪潮云信息技术股份公司 Spark-based method and Spark-based system for merging small files of Delta
CN112434000A (en) * 2020-11-20 2021-03-02 苏州浪潮智能科技有限公司 Small file merging method, device and equipment based on HDFS
CN112507020A (en) * 2020-11-20 2021-03-16 平安普惠企业管理有限公司 Data synchronization method and device, computer equipment and storage medium
CN112527801A (en) * 2020-12-21 2021-03-19 中国人民银行清算总中心 Data synchronization method and system between relational database and big data system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731921A (en) * 2015-03-26 2015-06-24 江苏物联网研究发展中心 Method for storing and processing small log type files in Hadoop distributed file system
CN106855877A (en) * 2016-11-04 2017-06-16 国网浙江省电力公司 A kind of synchronous method for replicating of big data
CN108763387A (en) * 2018-05-20 2018-11-06 湖北九州云仓科技发展有限公司 Big data fusion method, electronic equipment, storage medium and the system of heterogeneous platform
CN111291127A (en) * 2020-03-11 2020-06-16 北京奇艺世纪科技有限公司 Data synchronization method, device, server and storage medium
CN111723160A (en) * 2020-08-24 2020-09-29 国网浙江省电力有限公司 Multi-source heterogeneous incremental data synchronization method and system
CN112241396A (en) * 2020-10-27 2021-01-19 浪潮云信息技术股份公司 Spark-based method and Spark-based system for merging small files of Delta
CN112434000A (en) * 2020-11-20 2021-03-02 苏州浪潮智能科技有限公司 Small file merging method, device and equipment based on HDFS
CN112507020A (en) * 2020-11-20 2021-03-16 平安普惠企业管理有限公司 Data synchronization method and device, computer equipment and storage medium
CN112527801A (en) * 2020-12-21 2021-03-19 中国人民银行清算总中心 Data synchronization method and system between relational database and big data system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王亮: "基于OGG的Oracle与Hadoop集群准实时同步介绍", pages 1 - 7, Retrieved from the Internet <URL:《https://cloud.tencent.com/developer/article/1004462》> *

Similar Documents

Publication Publication Date Title
US11520670B2 (en) Method and apparatus for restoring data from snapshots
CN110147411B (en) Data synchronization method, device, computer equipment and storage medium
US11468015B2 (en) Storage and synchronization of metadata in a distributed storage system
WO2021003985A1 (en) Blockchain data archiving storage method and apparatus, computer device and storage medium
US11886298B2 (en) Using a storage log to generate an incremental backup
CN113297166A (en) Data processing system, method and device
US20100332549A1 (en) Recipes for rebuilding files
CN106648994B (en) Method, equipment and system for backing up operation log
CN111651519B (en) Data synchronization method, data synchronization device, electronic equipment and storage medium
CN105376277A (en) Data synchronization method and device
CN109783457B (en) CGI interface management method, device, computer equipment and storage medium
CN110737719A (en) Data synchronization method, device, equipment and computer readable storage medium
CN111459900A (en) Big data life cycle setting method and device, storage medium and server
CN112579550B (en) Metadata information synchronization method and system of distributed file system
CN116827965B (en) Coal mine underground offline scene data storage and synchronization method based on cloud platform
CN113342746A (en) File management system, file management method, electronic device, and storage medium
US11494105B2 (en) Using a secondary storage system to implement a hierarchical storage management plan
CN104915376A (en) Cloud storage file archiving and compressing method
EP3349416B1 (en) Relationship chain processing method and system, and storage medium
CN113836224A (en) Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment
US20150269086A1 (en) Storage System and Storage Method
CN111966650B (en) Operation and maintenance big data sharing data table processing method and device and storage medium
CN116107801A (en) Transaction processing method and related product
CN113515518A (en) Data storage method and device, computer equipment and storage medium
CN112235332A (en) Read-write switching method and device for cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination