WO2014019349A1 - 一种文件合并方法和装置 - Google Patents

一种文件合并方法和装置 Download PDF

Info

Publication number
WO2014019349A1
WO2014019349A1 PCT/CN2013/070619 CN2013070619W WO2014019349A1 WO 2014019349 A1 WO2014019349 A1 WO 2014019349A1 CN 2013070619 W CN2013070619 W CN 2013070619W WO 2014019349 A1 WO2014019349 A1 WO 2014019349A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
files
merging
merge
type
Prior art date
Application number
PCT/CN2013/070619
Other languages
English (en)
French (fr)
Inventor
程实
梁晓豪
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2014019349A1 publication Critical patent/WO2014019349A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/217Database tuning

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to a file merging method and apparatus. Background technique
  • Incremental database is an incremental file-based database technology that features new data persistence rather than modifying files, thus avoiding random writes to storage media.
  • the update and delete operations of the same record may cause the data of the record to be distributed among multiple files. With the number of files generated by database update and delete operations, more data files need to be searched when data is read, which results in a decrease in random read performance of the disk.
  • the incremental database introduces a file merging mechanism, which combines records scattered in multiple files into one file.
  • the file merge process involves the steps of reading old files, calculating and writing new files, so it also incurs overhead for CPU, memory, and disk reads and writes.
  • the larger the file size involved in the merge the greater the overhead.
  • a reasonable file merge trigger and overhead control mechanism must be designed.
  • the file merging method existing in the prior art mainly adopts a trigger mechanism based on real-time quantity.
  • this method when the number of files reaches a certain threshold, the merge operation of these files is triggered, and a new file is generated instead of the old one.
  • all historical data are always merged together, and all history files are involved in real-time merging, and the overhead of file merging is proportional to the combined file data capacity, with the file. With the accumulation of data capacity, the overhead of merging data into new files will increase until the data capacity reaches the upper limit of storage.
  • an embodiment of the present invention provides a file merging method and apparatus, which can control and reduce the overhead of file merging.
  • a method of file merging comprising:
  • the present invention also has a first possibility that the file merging policy includes any one or more of the following strategies:
  • the second file merging policy, the second file merging policy takes time as a trigger condition.
  • the present invention also has a second possibility that the file category includes a first type file, a second type file, and a third type file, wherein
  • the first type of file is a newly generated file that does not participate in file merging or a file generated according to the first merging policy
  • the second type of file is a file generated according to a second file merging policy
  • the third type of file is a file whose data capacity is greater than a second set threshold.
  • the merge judgment is triggered, and according to the first file merge policy, it is determined whether the merge trigger condition is met;
  • the invention also has a fourth possibility, namely According to the file merging policy, triggering the merging judgment, and determining whether the file merging trigger condition corresponding to the file merging policy is satisfied includes:
  • the file selected to satisfy the file merge trigger condition is selected, and the file merge process is:
  • the invention also has a fifth possibility that the method further comprises:
  • the file with the data capacity greater than the second set threshold in the merged file is used as the third type of file, and the third type of file is archived. deal with.
  • a file merging device comprising: an obtaining unit, configured to determine a category of the new file when a new file is generated, according to a pre-stored file category and file Corresponding relationship of the merge policy, obtaining a file merge policy corresponding to the category of the new file;
  • a triggering determining unit configured to trigger a merge determination according to the file merging policy sent by the obtaining unit, and determine whether a file merging trigger condition corresponding to the file merging policy is satisfied;
  • the merge execution unit is configured to select a file that satisfies the file merge trigger condition when the trigger determination unit determines that the file merge trigger condition corresponding to the file merge policy is satisfied, and perform file merge processing.
  • the present invention also has a sixth possibility that the file merging policy includes any one or more of the following strategies:
  • the second file merging policy, the second file merging policy takes time as a trigger condition.
  • the present invention also has a seventh possibility that the file category includes a first type file, a second type file, and a third type file, wherein
  • the first type of file is a newly generated file that does not participate in file merging or a file generated according to the first merging policy
  • the second type of file is a file generated according to a second file merging policy
  • the third type of file is a file whose data capacity is greater than a second set threshold.
  • the trigger determination unit is:
  • a first trigger determining subunit configured to trigger a merge judgment when a new first type file is generated, determine whether the merge trigger condition is met according to the first file merge policy; and the file data capacity satisfies the preset in all the first type files When the number of files of the capacity condition is greater than the first set threshold, it is determined that the merge trigger condition is satisfied.
  • the present invention also has a ninth possibility that the trigger determination unit is specifically:
  • a second trigger determining subunit configured to determine, according to the second file merging policy, whether the preset time triggering condition is met
  • the merge execution unit is configured to merge the first type file and the second type file when the second trigger determination unit determines that the preset trigger condition is met.
  • the invention also has a tenth possibility that the apparatus further comprises:
  • An archiving processing unit configured to, after merging the first type of file and the second type of file, a file having a data capacity greater than a second set threshold in the merged file as a third type of file, The third type of file is archived.
  • the beneficial effects that can be achieved by the embodiments of the present invention are as follows:
  • the real-time merge processing of all files is performed, and the files in the incremental database are classified and processed according to different files.
  • the categories have different consolidation strategies.
  • the category of the new file is first determined, and the merge policy corresponding thereto is obtained according to the category of the file.
  • the first type of file participates in the real-time merge with the number of files as the trigger condition.
  • the first type of file and the second type of file participate in the timed merge with time as the trigger condition, and the third type of file is archived and does not participate in the merge, so that the file
  • the merge overhead is always manageable. Due to the classification processing of files, different files have different merge processing strategies, and the merge overhead of the files is always controllable compared with the methods provided by the prior art.
  • FIG. 1 is a flowchart of a first embodiment of a file merging method according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a second embodiment of a file merging method according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of an incremental database according to an embodiment of the present invention.
  • FIG. 4 is a flowchart of a third embodiment of a file merging method according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a file merging apparatus according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a file merging apparatus according to another embodiment of the present invention. detailed description
  • the embodiment of the invention provides a method and a device for file merging, which can implement different file merging policies according to the category of the file, so that the merging overhead of the file is always controllable, and the overhead of file merging is further reduced.
  • Disk files data stored by non-disk files, incremental data for file systems, incremental data for atypical file systems, and more.
  • data is stored in the database in an append rather than modified manner. Data modification, deletion, and update operations all result in new files being generated.
  • Data warehousing and file generation is a continuous process. As the amount of data increases, the cost of file merging will increase. Therefore, there is a need for a file merging method that can effectively reduce disk randomness during the continuous growth of data. Input and output, while avoiding the impact of database read and write performance as much as possible.
  • FIG. 1 is a flowchart of a first embodiment of a file merging method provided by the present invention, where the method includes:
  • the new file is a newly generated file, which may be a file generated by persistence of in-memory data (that is, a file directly generated when entering the database), or may be a file generated by file merge.
  • files can be classified according to the manner in which files are generated. Of course, you can also classify files based on their file data size.
  • the present invention does not limit the specific file classification manner.
  • the main purpose of file classification is to perform offload processing on files to ensure that the number of files participating in a certain type of merge (for example, real-time merge) is small, and does not occupy system resources, so that the merge
  • the overhead is controllable.
  • a file merging policy corresponding to the category of the new file is obtained.
  • different file categories correspond to different file merging policies. In this way, a certain category of files can be associated with the corresponding file merging strategy, so that the number of files participating in a certain type of file merging strategy remains controllable, thereby making the merging overhead controllable.
  • the merging of the eligible files can be triggered by sending a merge operation instruction.
  • the merge operation instruction may include a range of files participating in the merge and a type of merge operation, such as timing merge or real-time merge.
  • a time-triggered file merging (timing merging) mechanism is introduced, that is, at a specified time. Trigger file merge operation.
  • Different binding strategies are formulated for different types of files to ensure that the file consolidation overhead is controllable and the database read and write performance is improved.
  • FIG. 2 is a flowchart of a second embodiment of a file merging method according to an embodiment of the present invention.
  • S201 Trigger a merge judgment when a new first type file is generated or when a preset time trigger condition is met.
  • the incremental database is still taken as an example for description.
  • the data buffer module 301 when performing a write operation to the database, the data buffer module 301 is configured to store the newly written data to the memory buffer and trigger all or part of the memory data to be persistent to the non-volatile storage medium, such as Generate a disk file.
  • the trigger condition for buffer data persistence may be that the buffer data capacity, duration, operand, etc. reach a certain condition.
  • the file generated by the buffer trigger persistence is classified into the first type of file.
  • the file storage module 302 is configured to hold the persistent data file generated by the data buffer module 301 and the classification information of the maintenance data file. Whenever a new data file is generated, the file storage module 302 acquires the classification information of the file and synchronizes it with the file.
  • the method for persisting file classification information may be writing to a file name, generating a companion file, or simultaneously writing an independent classification information file, or adding identification information to the file to represent different classifications of the file. The method of classifying files in the embodiment of the present invention will be described below.
  • the file merging strategy includes:
  • the first file merging policy reaches a first set threshold by the number of files As a trigger condition, the real-time merge strategy.
  • the first set threshold N is a threshold for determining whether the trigger condition is met when the real-time merge processing is performed. When the number of files is greater than the first set threshold N, the trigger condition is satisfied; when the number of files is smaller than the first setting When the threshold is N, the trigger condition is not satisfied. Too few files involved in the merger result in too frequent mergers, which in turn leads to resource occupation.
  • the second file merging policy uses time as a trigger condition, that is, a timing merging policy.
  • an archiving policy is further included, that is, when the file data capacity is larger than the second set threshold, the file does not participate in the merging, and the archiving process is performed.
  • the second set threshold is an archiving threshold. When the data capacity of the file is greater than the second set threshold A, the file is archived, and the archive file is not merged.
  • the files are classified into three categories according to the manner in which the files are generated. Specifically, the files can be divided into the following three categories:
  • the first type of file is a newly generated file that is not involved in file merging or a file generated according to the first file merging policy. That is to say, the first type of file includes the file generated by the persistence of the in-memory data, that is, the file directly generated when the data is stored.
  • the first type of file also includes files generated according to the first file merge policy, that is, new files generated by real-time merge.
  • the second type of file is a file generated according to the second file merging policy and having a file data capacity smaller than the third set threshold. Specifically, for the file generated by the timed merge, if the file size is smaller than the third set threshold A, it is marked as the second type of file.
  • the third type of file is a file generated according to the second file merging policy and the file data capacity is greater than a third set threshold, that is, the file capacity generated by the timed combination is greater than or equal to a third set threshold A, marked as Three types of files (archive class).
  • the newly generated file replaces the old file that participated in the merge and becomes the object read by the data read module.
  • the file merge management module is used to obtain the category of the file from the file storage module and generate a real-time merge operation instruction.
  • the trigger judging step is included, and when a new first type file is generated or a preset time trigger condition is satisfied, the merge judgment is triggered.
  • the category of the file is determined according to the way the file is generated when the file is generated.
  • the preset time triggering condition may be that the preset time is reached or the preset time interval is reached, and the present invention does not advance. Line limit.
  • the first file merging policy is that the first set threshold is reached as the trigger condition, that is, the real-time merging policy. Whenever a new first type file is generated, a real-time merge judgment is triggered. In all the first type files, when the data capacity of the file satisfies the preset capacity condition, the number of files is greater than the first set threshold, it is determined to be satisfied. When the trigger condition is merged, an instruction to "merge these files into the first type of file" is generated and sent to the file storage module.
  • determining whether the merge trigger condition is met needs to satisfy the following two conditions simultaneously:
  • the data capacity of the file satisfies the preset capacity condition.
  • the number of files satisfying the condition (1) is greater than the first set threshold N.
  • the preset capacity condition is that the data capacity of the file is greater than 0.5S and less than 1.5S.
  • S is a set capacity value. Generally, S is greater than 50MB.
  • the preset capacity condition can be set by the system or other conditions can be set as needed.
  • the purpose of setting the preset capacity condition is to make the files of similar size preferentially merge, which is beneficial to reduce the number of file merges and thus reduce the merge overhead.
  • the first set threshold N can be set by the system to ensure that at least N files are included in each real-time merge, and the number of participating merged files is too small to cause the merge to be too frequent. Setting the capacity value S allows files of similar size to be merged preferentially, which helps to reduce the number of file merges.
  • S204 Select a first type of file that meets the merge trigger condition, and perform file merge processing.
  • the merge processing flow is specifically as follows: First, the data in each file is sorted, and the read file stream of the file to be merged and the write file stream of a new file are opened. Among them, each file stream contains a cursor to facilitate the acquisition of data records from beginning to end in order.
  • the merge process looks for data from all open file streams that has the smallest primary key value (or the largest primary key value, depending on the data collation). If there are multiple data with the same primary key value (for example, the same update information recorded in two files), then multiple data merges (non-primary key fields are selected according to the time stamping priority principle), otherwise the direct selection is data. Append the data selected in the previous step to the new text In the piece, to achieve file merging.
  • a timing merging mechanism is introduced, and when the preset time triggering condition is satisfied, the merging judgment is triggered.
  • the preset time triggering condition may be that the timing preset is triggered when the system preset time T1 arrives, or may be performed once every T2 time periods, or may be other time triggering conditions, which is not limited by the present invention. . Since the timing combination requires the participation of the first type of file and the second type of file, the total amount of data to be merged is large, so the overhead of the timed merge execution is also large. Specifically, it is possible to select the most idle time of the database service, for example, to perform timing combining at a time of day and night.
  • step S208 Determine whether the data capacity of the merged file generated by the timing combination is greater than a second set threshold A. If it is greater, the process proceeds to step S208, and if it is not greater, the process proceeds to step S207.
  • 5207 if no, store the merged file as a second type of file.
  • the file data capacity generated by the timed combination is greater than the second set threshold, it is archived as the third type of file.
  • the third type of document will no longer participate in the consolidation of the documents.
  • the second set threshold is a larger threshold, such as 200G. The purpose of setting this parameter is to avoid excessive file participation in the merge, thus avoiding the infinite increase in CPU and disk 10 overhead as the database capacity grows.
  • a time-triggered file merging (timing merging) mechanism is introduced, that is, at a specified time. Trigger file merge operation. For example, file merging can be triggered at the most idle time of the database service, which can effectively alleviate the hardware resource competition pressure caused by file merging when the database service is busy, and improve database performance.
  • the second embodiment of the present invention only the files generated by the new warehousing and participating in the real-time merging participate in the real-time merging, and the large-capacity files generated by the timed merging do not participate in the real-time merging, so that the files participating in the merging are merged.
  • the number is greatly reduced, and the amount of file data remains controllable, further ensuring that the merger overhead is controllable.
  • the third set threshold A that is, the archiving threshold
  • the file larger than the file data capacity archiving threshold is archived, and the file merge is not involved, thereby avoiding the large file participating in the merge, thereby avoiding the CPU and the merge.
  • the disk 10 overhead increases indefinitely as the database capacity grows, thereby ensuring that the merge overhead is controllable.
  • timing combining is performed only when the database is relatively idle.
  • the file data capacity generated by the timing combination is larger than the third set threshold A, the file is archived.
  • the timing merge policy and the archive processing policy are included.
  • FIG. 4 it is a flowchart of a third embodiment of a file merging method according to an embodiment of the present invention.
  • the files are divided into archive files and non-archive files.
  • File data A file whose capacity is larger than the set threshold is marked as an archive file and does not participate in the merge process. Files with a file data size smaller than the set threshold participate in the timed merge.
  • files are divided into archive files and non-archive files, and different merge strategies are formulated for different categories of files.
  • archive files it does not participate in timing merge processing.
  • non-archived files participate in timing merges triggered by time thresholds.
  • the preset time triggering condition may be that the timing preset is triggered when the system preset time T1 arrives, or may be performed once every T2 time segments, or may be other time triggering conditions, and the present invention does not Limited.
  • step S404 determining whether the data capacity of the merged file generated by the time combination is greater than the second setting Threshold A. If not greater, the process proceeds to step S404, and if it is greater, the process proceeds to step S405.
  • the merged file is archived as an archive file, and the archive file does not participate in the file merge.
  • the files are merged at the moment when the database service is idle, which solves the shortcomings of resource competition caused by the prior art business being busy.
  • the files in the merged file whose data capacity is larger than the set threshold are archived, so that the merged overhead is incremented in one archiving period, and after reaching the archiving condition, it falls back to the lowest value, so that the merge overhead is controllable. .
  • FIG. 5 is a schematic diagram of an apparatus for file merging according to an embodiment of the present invention.
  • the device includes:
  • the obtaining unit 501 is configured to determine a category of the new file when a new file is generated, and obtain a file merging policy corresponding to the category of the new file according to the correspondence between the pre-stored file category and the file merging policy.
  • the trigger determining unit 502 is configured to trigger a merge determination according to the file merging policy sent by the obtaining unit, and determine whether the file merging trigger condition corresponding to the file merging policy is satisfied.
  • the merge execution unit 503 is configured to: when the trigger determination unit 502 determines that the file merge trigger condition corresponding to the file merge policy is satisfied, select a file that satisfies the file merge trigger condition, and perform file merge processing.
  • file merging policy includes any one or more of the following strategies:
  • the second file merging policy, the second file merging policy takes time as a trigger condition.
  • the file category includes a first type file, a second type file, and a third type file, wherein
  • the first type of file is a newly generated file that is not involved in file merging or a file that is generated according to the first merging policy, and the first file merging policy uses the number of files to reach a first set threshold as a trigger condition;
  • the second type of file is a file generated according to a second file merging policy, and the second file is merged
  • the strategy uses time as the trigger condition
  • the third type of file is a file whose data capacity is greater than a second set threshold.
  • the trigger determining unit is:
  • a first trigger determining subunit configured to trigger a merge judgment when a new first type file is generated, determine whether the merge trigger condition is met according to the first file merge policy; and the file data capacity satisfies the preset in all the first type files When the number of files of the capacity condition is greater than the first set threshold, it is determined that the merge trigger condition is satisfied.
  • the merge execution unit is configured to: when the first trigger determination unit determines that the merge trigger condition is met according to the first file merge policy, select a first type file that satisfies the condition, trigger a merge process, and perform a file that satisfies the merge trigger condition. merge.
  • the trigger determining unit is specifically:
  • the second trigger determining subunit is configured to determine, according to the second file merging policy, whether the preset time triggering condition is met.
  • the merge execution unit is configured to merge the first type file and the second type file when the second trigger determination unit determines that the preset time trigger condition is met.
  • the device further includes:
  • An archiving processing unit configured to, after merging the first type of file and the second type of file, a file having a data capacity greater than a second set threshold in the merged file as a third type of file, The third type of file is archived.
  • FIG. 6 is a schematic diagram of a file merging apparatus according to another embodiment of the present invention.
  • the device includes:
  • a storage 601 configured to store a correspondence between a file category and a file merging policy
  • the processor 602 is configured to determine a category of the new file when a new file is generated, and obtain a file corresponding to the category of the new file according to a correspondence between a file category and a file merging policy stored in the memory 601. Combining the policy; triggering the merge judgment according to the file merge policy, determining whether the file merge trigger condition corresponding to the file merge policy is satisfied; if yes, selecting a file that satisfies the file merge trigger condition, and performing file merge processing.
  • file merging strategy includes:
  • the first file merging policy reaches a first set threshold by the number of files As a trigger condition
  • the second file merging policy, the second file merging policy takes time as a trigger condition.
  • the file category includes a first type file, a second type file, and a third type file, wherein the first type file is a newly generated file that does not participate in file merging or a file generated according to the first file merging policy;
  • the second type of file is a file generated according to a second file merging policy
  • the third type of file is a file whose data capacity is greater than a second set threshold.
  • the processor 602 is specifically configured to: when a new first type file is generated, trigger a merge determination, determine whether the merge trigger condition is met according to the first file merge policy; and the file data capacity is satisfied in all the first type files. When the number of files of the preset capacity condition is greater than the first set threshold, it is determined that the merge trigger condition is met; and the first type of file that satisfies the condition is selected, the merge process is triggered, and the files satisfying the merge trigger condition are merged.
  • the processor 602 is further configured to determine, according to the second file merging policy, whether the preset time trigger condition is met, and when the determining that the preset time trigger condition is met, combining the first type file and the second type file .
  • processor 602 is further configured to, after merging the first type file and the second type file, a third type of a file whose data capacity is greater than a second set threshold in the merged file.
  • File archive the third type of file.
  • the invention may be described in the general context of computer-executable instructions executed by a computer, such as a program module.
  • program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types. You can also practice this in a distributed computing environment.
  • Invention in these distributed computing environments, tasks are performed by remote processing devices that are connected through a communication network.
  • program modules can be located in both local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文件合并方法和装置,所述方法包括:当有新文件生成时,确定所述新文件的类别,根据预存的文件类别与文件合并策略的对应关系,获取与所述新文件的类别对应的文件合并策略;根据所述文件合并策略,触发合并判断,判断是否满足与所述文件合并策略对应的文件合并触发条件,如果满足,选取满足文件合并触发条件的文件,执行文件合并处理。本发明实施例提供的方法可以对文件进行分类处理,使得不同的文件有不同的合并处理策略,相对于现有技术提供的方法,使得文件的合并开销始终保持可控。

Description

一种文件合并方法和装置
本发明要求于 2012 年 8 月 1 日提交中国专利局、 申请号为 201210270365.7、 发明名称为"一种文件合并方法和装置"的中国专利申请的优 先权, 其全部内容通过引用结合在本发明中。 技术领域
本发明涉及数据处理技术领域, 特别是涉及一种文件合并方法和装置。 背景技术
增量数据库(incremental database )是一种基于增量文件的数据库技术, 其特点是以追加而非修改文件的方式持久化新数据,从而可以避免对存储介质 的随机写操作。 而在增量数据库中, 对同一条记录的更新和删除操作, 可能造 成该记录的数据分布在多个文件中。 而随着数据库更新、删除操作产生的文件 数量越多,数据读取时就需要搜索更多的数据文件, 由此造成磁盘随机读性能 下降。
为了解决随机读操作性能问题,增量数据库引入了文件合并机制, 即将分 散在多个文件中的记录合并到一个文件中。 文件合并过程包含读取旧文件、计 算和写入新文件等步骤, 因此本身也会造成 CPU、 内存和磁盘读写的开销。 参与合并的文件容量越大,造成的开销就越大。 为了降低文件合并对增量数据 库性能的影响, 必须设计合理的文件合并触发和开销控制机制。
现有技术中存在的文件合并方法, 主要采用基于实时数量的触发机制。在 这种方法中, 当文件数量达到一定阈值时触发对这些文件的合并操作, 生成一 个新的文件替代旧文件。 然而, 现有技术提供的方法中, 始终以将所有历史数 据合并到一起为目标, 所有的历史文件均参与实时合并, 而文件合并的开销是 与合并的文件数据容量成正比的, 随着文件数据容量的积累,数据合并到新文 件的开销将越来越大, 直到数据容量达到存储的上限。现有技术提供的方法将 导致在大容量磁盘上部署的增量数据库,最终无法承受数据量的增长导致的合 并开销的增长, 因此存在合并开销不可控的缺陷。 发明内容
为解决上述技术问题, 本发明实施例提供了一种文件合并方法和装置, 可 以控制、 降低文件合并的开销。
根据本发明实施例的第一方面,公开了一种文件合并的方法, 所述方法包 括:
当有新文件生成时,确定所述新文件的类别,根据预存的文件类别与文件 合并策略的对应关系, 获取与所述新文件的类别对应的文件合并策略;
根据所述文件合并策略,触发合并判断, 判断是否满足与所述文件合并策 略对应的文件合并触发条件;
如果满足, 选取满足文件合并触发条件的文件, 执行文件合并处理。 在第一方面中, 本发明还具有第一种可能, 即所述文件合并策略包括以下 任意一种或多种策略:
第一文件合并策略,所述第一文件合并策略以文件数量达到第一设定阈值 作为触发条件;
第二文件合并策略, 所述第二文件合并策略以时间作为触发条件。
结合本发明的第一方面的第一种可能, 本发明还具有第二种可能, 即所述 文件类别包括第一类文件、 第二类文件和第三类文件, 其中,
所述第一类文件为新生成且未参与文件合并的文件或根据第一合并策略 生成的文件;
所述第二类文件为根据第二文件合并策略生成的文件;
所述第三类文件为数据容量大于第二设定阈值的文件。
结合本发明的第一方面的第二种可能, 本发明还具有第三种可能, 即所述 根据所述文件合并策略,触发合并判断, 判断是否满足与所述文件合并策略对 应的文件合并触发条件包括:
当有新的第一类文件生成时,触发合并判断,根据第一文件合并策略判断 是否满足合并触发条件;
在所有第一类文件中文件数据容量满足预设容量条件的文件的数量大于 第一设定阈值时, 确定满足合并触发条件。
结合本发明的第一方面的第二种可能, 本发明还具有第四种可能, 即所述 根据所述文件合并策略,触发合并判断, 判断是否满足与所述文件合并策略对 应的文件合并触发条件包括:
根据第二文件合并策略判断是否满足预设的时间触发条件;
所述选取满足文件合并触发条件的文件, 执行文件合并处理为:
在满足预设的时间触发条件时, 对第一类文件和第二类文件进行合并。 结合本发明的第一方面的第四种可能, 本发明还具有第五种可能, 即所述 方法还包括:
当对所述第一类文件和所述第二类文件进行合并后,将合并后的文件中数 据容量大于第二设定阈值的文件作为第三类文件,对所述第三类文件进行归档 处理。
根据本发明实施例的第二方面,公开了一种文件合并装置,所述装置包括: 获取单元, 用于当有新文件生成时, 确定所述新文件的类别, 根据预存的 文件类别与文件合并策略的对应关系,获取与所述新文件的类别对应的文件合 并策略;
触发判断单元, 用于根据获取单元发送的所述文件合并策略,触发合并判 断, 判断是否满足与所述文件合并策略对应的文件合并触发条件;
合并执行单元,用于在触发判断单元判断满足与所述文件合并策略对应的 文件合并触发条件时选取满足文件合并触发条件的文件, 执行文件合并处理。
在第二方面中, 本发明还具有第六种可能, 即所述文件合并策略包括以下 任意一种或多种策略:
第一文件合并策略,所述第一文件合并策略以文件数量达到第一设定阈值 作为触发条件;
第二文件合并策略, 所述第二文件合并策略以时间作为触发条件。
结合本发明第二方面的第六种可能, 本发明还具有第七种可能, 即所述文 件类别包括第一类文件、 第二类文件和第三类文件, 其中,
所述第一类文件为新生成且未参与文件合并的文件或根据第一合并策略 生成的文件;
所述第二类文件为根据第二文件合并策略生成的文件;
所述第三类文件为数据容量大于第二设定阈值的文件。 结合本发明第二方面的第七种可能, 本发明还具有第八种可能, 即所述触 发判断单元为:
第一触发判断子单元, 用于当有新的第一类文件生成时, 触发合并判断, 根据第一文件合并策略判断是否满足合并触发条件;在所有第一类文件中文件 数据容量满足预设容量条件的文件的数量大于第一设定阈值时,确定满足合并 触发条件。
结合本发明第二方面的第七种可能, 本发明还具有第九种可能, 即所述触 发判断单元具体为:
第二触发判断子单元,用于根据第二文件合并策略判断是否满足预设的时 间触发条件;
则所述合并执行单元用于在第二触发判断单元判断满足预设的触发条件 时, 对第一类文件和第二类文件进行合并。
结合本发明第二方面的第九种可能, 本发明还具有第十种可能, 即所述装 置还包括:
归档处理单元, 用于当对所述第一类文件和所述第二类文件进行合并后, 对合并后的文件中数据容量大于第二设定阈值的文件作为第三类文件,对所述 第三类文件进行归档处理。
本发明实施例能够达到的有益效果为: 在本发明实施例中, 不同于现有技 术中对所有文件均进行实时合并处理的方案,对增量数据库中的文件进行分类 处理,根据不同的文件类别制定了不同的合并处理策略。当有新的文件生成时, 首先确定新文件的类别, 并根据文件的类别获取与其对应的合并策略。 其中, 第一类文件参与以文件数量作为触发条件的实时合并,第一类文件和第二类文 件参与以时间作为触发条件的定时合并, 第三类文件进行归档处理不参与合 并, 使得文件的合并开销始终保持可控。 由于对文件进行分类处理, 使得不同 的文件有不同的合并处理策略,相对于现有技术提供的方法,使得文件的合并 开销始终保持可控。 附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施 例或现有技术描述中所需要使用的附图作筒单地介绍,显而易见地, 下面描述 中的附图仅仅是本发明中记载的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可以根据这些附图获得其他的附图。
图 1为本发明实施例提供的文件合并方法第一实施例流程图;
图 2为本发明实施例提供的文件合并方法第二实施例流程图;
图 3为本发明实施例增量数据库示意图;
图 4为本发明实施例提供的文件合并方法第三实施例流程图;
图 5为本发明一实施例提供的文件合并装置示意图;
图 6为本发明又一实施例提供的文件合并装置示意图。 具体实施方式
本发明实施例提供了一种文件合并的方法和装置,可以根据文件的类别执 行不同的文件合并策略,使得文件的合并开销始终保持可控,且进一步降低了 文件合并的开销。
为了使本技术领域的人员更好地理解本发明中的技术方案,下面将结合本 发明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描述, 显然, 所描述的实施例仅仅是本发明一部分实施例, 而不是全部的实施例。 基 于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获 得的所有其他实施例, 都应当属于本发明保护的范围。 磁盘文件、 非磁盘文件存储的数据、 文件系统的增量数据、 非典型文件系统的 增量数据等。 前面提到,在增量数据库中,数据是以追加而非修改的方式保存在数据库中的, 数据的修改、 删除和更新操作, 均会导致新的文件产生。 数据入库和文件的生 成是一个持续性的过程, 随着数据量的增加, 文件合并的开销会越来越大。 因 此亟需一种文件合并方法既能够在数据持续增长的过程中有效减少磁盘随机 输入输出, 同时能够尽可能地避免影响数据库读写性能。
参见图 1 , 为本发明提供的文件合并方法第一实施例流程图, 所述方法包 括:
5101 , 当有新文件生成时, 确定所述新文件的类别, 根据预存的文件类别 与文件合并策略的对应关系, 获取与所述新文件的类别对应的文件合并策略。
在本发明第一实施例中,新文件为新生成的文件, 可以是内存数据持久化 生成的文件(即入数据库时直接生成的文件),也可以是文件合并生成的文件。 在本发明第一实施例中, 可以根据文件生成的方式对文件进行分类。 当然, 还 可以根据文件数据容量大小对文件进行分类。本发明对具体的文件分类方式不 进行限定, 文件分类的主要目的是对文件进行分流处理, 以保证参与某一类合 并(例如实时合并)的文件数量较少, 不占用系统资源, 以使得合并开销可控。
当确定新文件的类别后,即根据预存的文件类别与文件合并策略的对应关 系, 获取与所述新文件的类别对应的文件合并策略。 在本发明第一实施例中, 不同的文件类别对应不同的文件合并策略。这样, 可以使得某一类别的文件参 与对应的文件合并策略, 使得参与某一类文件合并策略的文件数量保持可控, 进而使得合并开销可控。
5102, 根据所述文件合并策略, 触发合并判断, 判断是否满足文件合并触 发条件。
5103,如果满足,选取满足文件合并触发条件的文件,执行文件合并处理。 不同的文件合并策略有不同的合并触发条件, 当满足合并触发条件时,选 取满足所述合并触发条件的文件,触发合并处理流程对满足条件的文件进行合 并。 具体地, 满足文件合并触发条件的文件与所述新文件属于同一类别, 其包 括与所述新文件属于同一类别的、 满足合并触发条件的所有文件。在本发明第 一实施例中,可以通过发送合并操作指令触发对符合条件的文件的合并。其中, 合并操作指令可以包括参与合并的文件的范围以及合并操作的类型,例如定时 合并或实时合并。本发明对触发合并的方式不进行限定, 本领域技术人员在不 付出创造性劳动下获取的其他实施方式均属于本发明的保护范围。 理的方案,对增量数据库中的文件进行分类处理,根据不同的文件类别制定了 不同的合并处理策略。 当有新的文件生成时, 首先确定新文件的类别, 并根据 文件的类别获取与其对应的合并策略。 由于对文件进行分类处理,使得不同的 文件有不同的合并处理策略,对文件进行了分流处理, 以保证参与某一类合并 (例如实时合并)的文件数量较少, 不占用系统资源, 相对于现有技术提供的 方法, 使得文件的合并开销始终保持可控。
下面参照附图 2 , 对本发明第二实施例进行详细介绍。
在本发明第二实施例中,在现有技术由文件数量触发的文件合并(又可称 为实时合并)的基础上, 引入了由时间触发的文件合并(定时合并)机制, 即 在指定时刻触发文件合并操作。 并针对不同类别的文件制定不同的合并策略, 保证文件合并开销可控, 提高了数据库读写性能。
参见图 2, 为本发明实施例提供的文件合并方法第二实施例流程图。
S201 , 当有新的第一类文件生成时或满足预设的时间触发条件时,触发合 并判断。
在本发明第二实施例中, 仍以增量数据库为例进行说明。 参见图 3 , 为增 量数据库的四个功能模块: 数据緩沖模块 301 , 文件存储模块 302, 数据读取 模块 303 , 文件合并管理模块 304。 如图 3所示, 当执行对数据库的写操作时, 数据緩沖模块 301 用于将新写入的数据存储到内存緩沖区并触发内存数据的 全部或部分持久化到非易失存储介质, 如生成磁盘文件。 其中, 緩沖区数据持 久化的触发条件可能是緩沖区数据容量、 持续时间、 操作数等达到特定条件。 当緩沖区数据完成持久化, 则从緩沖区清除,从而保持数据緩沖模块容量的持 续可用性。在本发明第二实施例中,将緩沖区触发持久化生成的文件被归类为 第一类文件。文件存储模块 302用于保存由数据緩沖模块 301生成的持久化数 据文件以及维护数据文件的分类信息。每当新的数据文件生成, 文件存储模块 302获取该文件的分类信息, 并将其与文件同步持久化。 持久化文件分类信息 的方法, 可以是写入到文件名、 生成伴随文件, 或同步写入独立分类信息文件 等; 或者为文件添加标识信息, 以代表文件不同的分类等。 下面对本发明实施 例中文件分类的方法进行介绍。
在本发明第二实施例中, 文件合并策略包括:
第一文件合并策略,所述第一文件合并策略以文件数量达到第一设定阈值 作为触发条件, 即实时合并策略。 第一设定阈值 N为当进行实时合并处理时, 判断是否满足触发条件的一个阈值, 当文件的数量大于第一设定阈值 N时, 则满足触发条件; 当文件的数量小于第一设定阈值 N时, 则不满足触发条件。 参与合并的文件数量过少造成合并过于频繁, 进而造成资源的占用。
第二文件合并策略, 所述第二文件合并策略以时间作为触发条件, 即定时 合并策略。
在本发明第二实施例中,还包括归档策略, 即当文件数据容量大于第二设 定阈值的文件不参与合并, 进行归档处理。 第二设定阈值为归档阈值, 当文件 的数据容量大于第二设定阈值 A时, 则对文件进行归档处理, 归档文件不参 与合并。
与文件合并策略对应的,在本发明第二实施例中,根据文件生成的方式将 文件划分为三类。 具体的, 可以将文件分为以下三类:
( 1 )第一类文件为新生成且未参与文件合并的文件或根据第一文件合并 策略生成的文件。 也就是说, 第一类文件包括内存数据持久化生成的文件, 即 入库时直接生成的文件。 第一类文件还包括根据第一文件合并策略生成的文 件, 也就是实时合并生成的新文件。
( 2 )第二类文件为根据第二文件合并策略生成的且文件数据容量小于第 三设定阈值的文件。 具体的, 对于定时合并生成的文件, 如果文件大小小于第 三设定阈值 A, 标记为第二类文件。
( 3 ) 第三类文件为根据第二文件合并策略生成的且文件数据容量大于第 三设定阈值的文件,也就是由定时合并生成的文件容量大于等于第三设定阈值 A, 标记为第三类文件(归档类) 。 文件合并操作的最后, 新生成的文件取代 参与合并的旧文件, 成为数据读取模块读取的对象。
文件合并管理模块用于从文件存储模块获取文件的类别以及生成实时合 并操作指令。 下面将具体进行说明。 在本发明第二实施例中, 包含触发判断步 骤,当有新的第一类文件生成时或满足预设的时间触发条件时,触发合并判断。 其中, 文件的类别是在文件生成时即根据文件生成的方式确定的。预设的时间 触发条件可以是到达预设的时刻或者达到预设的时间间隔等,本发明对此不进 行限定。
S202,当判断新文件的类别为第一类文件时, 根据第一文件合并策略判断 是否满足合并触发条件。
具体的,在本发明第二实施例中, 第一文件合并策略为以文件数量达到第 一设定阈值作为触发条件, 即实时合并策略。 每当有新的第一类文件生成时, 触发一次实时合并判断,在所有第一类文件中, 当文件的数据容量满足预设容 量条件的文件的数量大于第一设定阈值时,确定满足合并触发条件,则生成 "将 这些文件合并为第一类文件" 的指令, 发送给文件存储模块。
也就是说, 判断是否满足合并触发条件需要同时满足以下两个条件: 文件的数据容量满足预设容量条件。
满足条件 ( 1 ) 的文件的数量大于第一设定阈值 N。
具体的, 在本发明实施例中, 预设容量条件为文件的数据容量大于 0.5S 且小于 1.5S。 S为一个设定的容量值, 一般的, S大于 50MB。 预设容量条件 可以由系统设定,也可以是根据需要设置其他条件。设定预设容量条件的目的 是为了使得相近大小的文件优先进行合并,有利于减少文件合并的次数, 进而 降低合并开销。 在这里, 第一设定阈值 N可以通过系统设定, 以保证每次实 时合并时至少包含 N个文件, 避免参与合并文件数量过少导致合并过于频繁。 设置容量值 S 使得相近大小的文件优先进行合并, 有利于减少文件合并的次 数。
S203,在所有第一类文件中文件数据容量满足预设容量条件的文件的数量 大于第一设定阈值时, 确定满足合并触发条件。
S204, 选取满足合并触发条件的第一类文件, 执行文件合并处理。
具体地, 在本发明第二实施例中, 合并处理流程具体如下: 首先每个文件 中的数据是排序的,打开待合并文件的读取文件流以及一个新文件的写入文件 流。 其中, 每个文件流包含一个游标, 以便于按顺序从头至尾获取数据记录。 合并处理流程从所有打开的文件流中, 查找主键值最小(或主键值最大, 取决 于数据排序规则 )的数据。 若存在多个主键值相等的数据(例如同一条记录在 两个文件中的更新信息), 则对着多条数据合并(非主键字段按照时间戳较大 优先原则选取), 否则直接选取该数据。 将上一步骤选取的数据追加到新的文 件中, 以实现文件合并。
5205 , 当满足预设的时间触发条件时,对第一类文件和第二类文件进行合 并。
在本发明第二实施例中, 在实时合并基础上, 引入了定时合并机制, 当满 足预设的时间触发条件时, 则触发合并判断。预设的时间触发条件可以是系统 预设的时刻 T1到达时触发定时合并, 也可以是每隔 T2个时间段则执行一次 定时合并, 也可以是其他时间触发条件, 本发明对此不进行限定。 由于定时合 并需要第一类文件和第二类文件的参与, 因此, 合并的数据量总量较大, 因此 定时合并执行的开销也较大。 具体地, 可以选择数据库业务最空闲的时刻, 例 如每天深夜的时刻执行定时合并。
5206,判断由定时合并产生的合并后的文件的数据容量是否大于第二设定 阈值 A。 如果大于, 进入步骤 S208, 如果不大于, 进入步骤 S207。
5207, 如果否, 将合并后的文件作为第二类文件进行存储。
当预设的时间条件满足时, 则会触发对新生成的第二类文件的定时合并。 S208, 如果是, 将合并后的文件作为第三类文件进行归档处理。
如果经过定时合并生成的文件数据容量大于第二设定阈值,则将其作为第 三类文件, 进行归档处理。 第三类文件将不再参与文件的合并。 一般的, 第二 设定阈值是一个较大的阈值, 例如 200G。 设置该参数的目的在于避免过大文 件参与合并,从而避免了合并造成 CPU和磁盘 10开销随着数据库容量增长而 无限增加。
在本发明第二实施例中,在现有技术由文件数量触发的文件合并(又可称 为实时合并)的基础上, 引入了由时间触发的文件合并(定时合并)机制, 即 在指定时刻触发文件合并操作。例如, 可以在数据库业务最空闲的时刻触发文 件合并, 可以有效地緩解数据库业务繁忙时文件合并造成的硬件资源竟争压 力, 提高数据库性能。
另一方面,在本发明第二实施例中, 只有新入库和参与实时合并生成的文 件才参与实时合并, 而由定时合并产生的大容量文件则不参与实时合并,使得 参与定时合并的文件数量大大缩小, 其文件数据量保持可控, 进一步保证了合 并开销可控。 再一方面, 由于设置了第三设定阈值 A, 即归档阈值, 对大于文件数据容 量归档阈值的文件进行归档处理, 不参与文件合并, 避免了大文件参与合并, 从而避免了合并造成 CPU和磁盘 10开销随着数据库容量增长而无限增加,从 而保证合并开销可控。
现有技术中,现有技术提供的方法中, 始终以将所有历史数据合并到一起 为目标, 所有的历史文件均参与实时合并, 这样处理带来的另一个问题则是会 在业务繁忙时加剧硬件资源竟争。通常数据库操作繁忙的时段,新数据文件增 长速度最快,按照现有技术提供的方法, 此时由数量阈值触发的合并操作将最 为频繁。这一现象导致了现有技术的合并操作会在数据库业务高峰时段与主功 能竟争硬件资源, 例如 CPU和磁盘 I/O等, 严重影响数据库本身的性能。 现 有技术在数据库业务空闲时段,合并任务也相对空闲,从而浪费了闲时的硬件 处理能力。
为了解决这一问题, 在本发明第三实施例中, 与第二实施例不同的是, 只 在数据库相对空闲的时刻进行定时合并。当由定时合并产生的文件数据容量大 于第三设定阈值 A时, 对文件进行归档处理。 与第二实施例实时合并加定时 合并的合并策略不同的是,在本发明第三实施例中, 只包括定时合并策略以及 归档处理策略。
参见图 4, 为本发明实施例提供的文件合并方法第三实施例流程图。
5401 , 当有新的文件生成时, 确定新文件的类别。
在本发明这一实施例中,将文件划分为归档文件和非归档文件。文件数据 容量大于设定阈值的被标记为归档文件, 不参与合并处理。文件数据容量小于 设定阈值的文件才参与定时合并。
5402, 当满足预设的时间触发条件时, 对非归档文件进行合并。
在本发明第三实施例中,将文件划分为归档文件和非归档文件, 并为不同 类别的文件制定了不同的合并策略。 对于归档文件, 不参与定时合并处理。 对 于非归档文件, 参与由时间阈值触发的定时合并。 这里, 预设的时间触发条件 可以是系统预设的时刻 T1到达时触发定时合并, 也可以是每隔 T2个时间段 则执行一次定时合并, 也可以是其他时间触发条件, 本发明对此不进行限定。
5403,判断由定时合并产生的合并后的文件的数据容量是否大于第二设定 阈值 A。 如果不大于, 进入步骤 S404, 如果大于, 进入步骤 S405。
5404, 如果否, 将合并后的文件作为非归档文件进行存储。
当预设的时间条件满足时, 则会触发对新生成的非归档文件的定时合并。
5405 , 如果是, 将合并后的文件作为归档文件进行归档处理, 归档文件不 参与文件合并。
在本发明第三实施例中, 在数据库业务空闲的时刻对文件进行合并处理, 解决了现有技术业务繁忙时导致资源竟争的缺点。 另一方面,将合并后的文件 中数据容量大于设定阈值的文件进行归档处理,使得合并的开销在一个归档周 期内递增, 而达到归档条件后则会回落到最低值, 使得合并开销可控。
参见图 5 , 为本发明实施例提供的文件合并的装置示意图。
所述装置包括:
获取单元 501 , 用于当有新文件生成时, 确定所述新文件的类别; 根据预 存的文件类别与文件合并策略的对应关系,获取与所述新文件的类别对应的文 件合并策略。
触发判断单元 502, 用于根据获取单元发送的所述文件合并策略, 触发合 并判断, 判断是否满足与所述文件合并策略对应的文件合并触发条件。
合并执行单元 503 , 用于在触发判断单元 502判断满足与所述文件合并策 略对应的文件合并触发条件时,选取满足文件合并触发条件的文件,执行文件 合并处理。
进一步地, 所述文件合并策略包括以下任意一种或多种策略:
第一文件合并策略,所述第一文件合并策略以文件数量达到第一设定阈值 作为触发条件;
第二文件合并策略, 所述第二文件合并策略以时间作为触发条件。
进一步地, 所述文件类别包括第一类文件、 第二类文件和第三类文件, 其 中,
所述第一类文件为新生成且未参与文件合并的文件或根据第一合并策略 生成的文件,所述第一文件合并策略以文件数量达到第一设定阈值作为触发条 件;
所述第二类文件为根据第二文件合并策略生成的文件,所述第二文件合并 策略以时间作为触发条件;
所述第三类文件为数据容量大于第二设定阈值的文件。
进一步地, 所述触发判断单元为:
第一触发判断子单元, 用于当有新的第一类文件生成时, 触发合并判断, 根据第一文件合并策略判断是否满足合并触发条件;在所有第一类文件中文件 数据容量满足预设容量条件的文件的数量大于第一设定阈值时,确定满足合并 触发条件。
所述合并执行单元用于在第一触发判断单元根据第一文件合并策略判断 满足合并触发条件时, 选取满足条件的第一类文件, 触发合并处理流程, 对满 足所述合并触发条件的文件进行合并。
进一步地, 所述触发判断单元具体为:
第二触发判断子单元,用于根据第二文件合并策略判断是否满足预设的时 间触发条件。
所述合并执行单元用于在第二触发判断单元判断满足预设的时间触发条 件时, 对第一类文件和第二类文件进行合并。
进一步地, 所述装置还包括:
归档处理单元, 用于当对所述第一类文件和所述第二类文件进行合并后, 对合并后的文件中数据容量大于第二设定阈值的文件作为第三类文件,对所述 第三类文件进行归档处理。
参见图 6, 为本发明又一实施例提供的文件合并装置示意图。
所述装置包括:
存储器 601 , 用于存储文件类别与文件合并策略的对应关系;
处理器 602, 用于当有新文件生成时, 确定所述新文件的类别, ^据所述 存储器 601存储的文件类别与文件合并策略的对应关系,获取与所述新文件的 类别对应的文件合并策略; 根据所述文件合并策略, 触发合并判断, 判断是否 满足与所述文件合并策略对应的文件合并触发条件; 如果满足,选取满足文件 合并触发条件的文件, 执行文件合并处理。
进一步地, 所述文件合并策略包括:
第一文件合并策略,所述第一文件合并策略以文件数量达到第一设定阈值 作为触发条件;
第二文件合并策略, 所述第二文件合并策略以时间作为触发条件。
所述文件类别包括第一类文件、 第二类文件和第三类文件, 其中, 所述第一类文件为新生成且未参与文件合并的文件或根据第一文件合并 策略生成的文件;
所述第二类文件为根据第二文件合并策略生成的文件;
所述第三类文件为数据容量大于第二设定阈值的文件。
进一步地, 所述处理器 602具体用于当有新的第一类文件生成时,触发合 并判断,根据第一文件合并策略判断是否满足合并触发条件; 在所有第一类文 件中文件数据容量满足预设容量条件的文件的数量大于第一设定阈值时,确定 满足合并触发条件; 以及选取满足条件的第一类文件, 触发合并处理流程, 对 满足所述合并触发条件的文件进行合并。
进一步地,所述处理器 602还用于根据第二文件合并策略判断是否满足预 设的时间触发条件, 当判断满足预设的时间触发条件时,对第一类文件和第二 类文件进行合并。
进一步地,所述处理器 602还用于当对所述第一类文件和所述第二类文件 进行合并后,对合并后的文件中数据容量大于第二设定阈值的文件作为第三类 文件, 对所述第三类文件进行归档处理。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将 一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些 实体或操作之间存在任何这种实际的关系或者顺序。 而且, 术语"包括"、 "包 含"或者其任何其他变体意在涵盖非排他性的包含, 从而使得包括一系列要素 的过程、 方法、 物品或者设备不仅包括那些要素, 而且还包括没有明确列出的 其他要素, 或者是还包括为这种过程、 方法、 物品或者设备所固有的要素。 在 没有更多限制的情况下, 由语句 "包括一个 ...... "限定的要素, 并不排除在包括 所述要素的过程、 方法、 物品或者设备中还存在另外的相同要素。
本发明可以在由计算机执行的计算机可执行指令的一般上下文中描述,例 如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的 例程、 程序、 对象、 组件、 数据结构等等。 也可以在分布式计算环境中实践本 发明,在这些分布式计算环境中, 由通过通信网络而被连接的远程处理设备来 执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地 和远程计算机存储介质中。
以上所述仅是本发明的具体实施方式,应当指出,对于本技术领域的普通 技术人员来说, 在不脱离本发明原理的前提下, 还可以做出若干改进和润饰, 这些改进和润饰也应视为本发明的保护范围。

Claims

权 利 要 求
1、 一种文件合并的方法, 其特征在于, 所述方法包括:
当有新文件生成时,确定所述新文件的类别,根据预存的文件类别与文件 合并策略的对应关系, 获取与所述新文件的类别对应的文件合并策略;
根据所述文件合并策略,触发合并判断, 判断是否满足与所述文件合并策 略对应的文件合并触发条件;
如果满足, 选取满足文件合并触发条件的文件, 执行文件合并处理。
2、 根据权利要求 1所述的方法, 其特征在于, 所述文件合并策略包括以 下任意一种或多种策略:
第一文件合并策略,所述第一文件合并策略以文件数量达到第一设定阈值 作为触发条件;
第二文件合并策略, 所述第二文件合并策略以时间作为触发条件。
3、 根据权利要求 2所述的方法, 其特征在于, 所述文件类别包括第一类 文件、 第二类文件和第三类文件, 其中,
所述第一类文件为新生成且未参与文件合并的文件或根据第一文件合并 策略生成的文件;
所述第二类文件为根据第二文件合并策略生成的文件;
所述第三类文件为数据容量大于第二设定阈值的文件。
4、 根据权利要求 3所述的方法, 其特征在于, 所述根据所述文件合并策 略,触发合并判断, 判断是否满足与所述文件合并策略对应的文件合并触发条 件包括:
当有新的第一类文件生成时, 触发合并判断,根据第一文件合并策略判断 是否满足合并触发条件;
在所有第一类文件中文件数据容量满足预设容量条件的文件的数量大于 第一设定阈值时, 确定满足合并触发条件。
5、 根据权利要求 3所述的方法, 其特征在于, 所述根据所述文件合并策 略,触发合并判断, 判断是否满足与所述文件合并策略对应的文件合并触发条 件包括:
根据第二文件合并策略判断是否满足预设的时间触发条件; 所述选取满足文件合并触发条件的文件, 执行文件合并处理为: 在满足预设的时间触发条件时, 对第一类文件和第二类文件进行合并。
6、 根据权利要求 5所述的方法, 其特征在于, 所述方法还包括: 当对所述第一类文件和所述第二类文件进行合并后,将合并后的文件中数 据容量大于第二设定阈值的文件作为第三类文件,对所述第三类文件进行归档 处理。
7、 一种文件合并的装置, 其特征在于, 所述装置包括:
获取单元, 用于当有新文件生成时, 确定所述新文件的类别, 根据预存的 文件类别与文件合并策略的对应关系,获取与所述新文件的类别对应的文件合 并策略;
触发判断单元, 用于根据获取单元发送的所述文件合并策略, 触发合并判 断, 判断是否满足与所述文件合并策略对应的文件合并触发条件;
合并执行单元,用于在触发判断单元判断满足与所述文件合并策略对应的 文件合并触发条件时,选取满足文件合并触发条件的文件,执行文件合并处理。
8、 根据权利要求 7所述的装置, 其特征在于, 所述文件合并策略包括以 下任意一种或多种策略:
第一文件合并策略,所述第一文件合并策略以文件数量达到第一设定阈值 作为触发条件;
第二文件合并策略, 所述第二文件合并策略以时间作为触发条件。
9. 根据权利要求 8所述的装置, 其特征在于, 所述文件类别包括第一类 文件、 第二类文件和第三类文件, 其中,
所述第一类文件为新生成且未参与文件合并的文件或根据第一合并策略 生成的文件;
所述第二类文件为根据第二文件合并策略生成的文件;
所述第三类文件为数据容量大于第二设定阈值的文件。
10、 根据权利要求 9所述的装置, 其特征在于, 所述触发判断单元为: 第一触发判断子单元, 用于当有新的第一类文件生成时, 触发合并判断, 根据第一文件合并策略判断是否满足合并触发条件;在所有第一类文件中文件 数据容量满足预设容量条件的文件的数量大于第一设定阈值时,确定满足合并 触发条件。
11、根据权利要求 9所述的装置,其特征在于,所述触发判断单元具体为: 第二触发判断子单元,用于根据第二文件合并策略判断是否满足预设的时 间触发条件;
则所述合并执行单元用于在第二触发判断单元判断满足预设的时间触发 条件时, 对第一类文件和第二类文件进行合并。
12、 根据权利要求 11所述的装置, 其特征在于, 所述装置还包括: 归档处理单元, 用于当对所述第一类文件和所述第二类文件进行合并后, 对合并后的文件中数据容量大于第二设定阈值的文件作为第三类文件,对所述 第三类文件进行归档处理。
PCT/CN2013/070619 2012-08-01 2013-01-17 一种文件合并方法和装置 WO2014019349A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210270365.7 2012-08-01
CN201210270365.7A CN103577454B (zh) 2012-08-01 2012-08-01 一种文件合并方法和装置

Publications (1)

Publication Number Publication Date
WO2014019349A1 true WO2014019349A1 (zh) 2014-02-06

Family

ID=50027187

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/070619 WO2014019349A1 (zh) 2012-08-01 2013-01-17 一种文件合并方法和装置

Country Status (2)

Country Link
CN (2) CN109960688A (zh)
WO (1) WO2014019349A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3279813A1 (en) * 2016-08-02 2018-02-07 Palantir Technologies Inc. Time-series data storage and processing database system
US10216695B1 (en) 2017-09-21 2019-02-26 Palantir Technologies Inc. Database system for time series data storage, processing, and analysis
US10417224B2 (en) 2017-08-14 2019-09-17 Palantir Technologies Inc. Time series database processing system
US10585907B2 (en) 2015-06-05 2020-03-10 Palantir Technologies Inc. Time-series data storage and processing database system
US11016986B2 (en) 2017-12-04 2021-05-25 Palantir Technologies Inc. Query-based time-series data display and processing system
US11281726B2 (en) 2017-12-01 2022-03-22 Palantir Technologies Inc. System and methods for faster processor comparisons of visual graph features
US11314738B2 (en) 2014-12-23 2022-04-26 Palantir Technologies Inc. Searching charts
US11379453B2 (en) 2017-06-02 2022-07-05 Palantir Technologies Inc. Systems and methods for retrieving and processing data
US12124467B2 (en) 2023-06-07 2024-10-22 Palantir Technologies Inc. Query-based time-series data display and processing system

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021213B (zh) * 2014-06-20 2017-06-16 中国银行股份有限公司 一种合并关联记录的方法及装置
US9503847B2 (en) * 2015-04-23 2016-11-22 Htc Corporation Electronic apparatus, uploading method and non-transitory computer readable storage medium thereof
CN107861959A (zh) * 2016-09-22 2018-03-30 阿里巴巴集团控股有限公司 数据处理方法、装置及系统
CN108021702A (zh) * 2017-12-26 2018-05-11 百度在线网络技术(北京)有限公司 基于LSM-tree的分级存储方法、装置、OLAP数据库系统及介质
CN108376169A (zh) * 2018-02-26 2018-08-07 众安信息技术服务有限公司 一种用于联机分析处理的数据处理方法和装置
CN110874349A (zh) * 2018-08-13 2020-03-10 北京京东尚科信息技术有限公司 一种文件整理方法和装置
CN110888837B (zh) * 2019-11-15 2021-01-22 星辰天合(北京)数据科技有限公司 对象存储小文件归并方法及装置
CN112925759B (zh) * 2021-03-31 2024-05-31 北京金山云网络技术有限公司 数据文件的处理方法和装置、存储介质、电子装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101018121A (zh) * 2007-03-15 2007-08-15 杭州华为三康技术有限公司 日志的聚合处理方法及聚合处理装置
CN101605028A (zh) * 2009-02-17 2009-12-16 北京安天电子设备有限公司 一种日志记录合并方法和系统
US20100223231A1 (en) * 2009-03-02 2010-09-02 Thales-Raytheon Systems Company Llc Merging Records From Different Databases
CN101902335A (zh) * 2009-05-27 2010-12-01 北京启明星辰信息技术股份有限公司 一种数据过滤与合并的方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7702666B2 (en) * 2002-06-06 2010-04-20 Ricoh Company, Ltd. Full-text search device performing merge processing by using full-text index-for-registration/deletion storage part with performing registration/deletion processing by using other full-text index-for-registration/deletion storage part
CN101571827A (zh) * 2008-04-30 2009-11-04 国际商业机器公司 保存日志的方法和日志系统
US8495316B2 (en) * 2008-08-25 2013-07-23 Symantec Operating Corporation Efficient management of archival images of virtual machines having incremental snapshots
CN102023991A (zh) * 2009-09-21 2011-04-20 中兴通讯股份有限公司 在终端上更新索引并基于其对搜索结果排序的方法及装置
CN102087646B (zh) * 2009-12-07 2013-03-20 北大方正集团有限公司 一种索引建立方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101018121A (zh) * 2007-03-15 2007-08-15 杭州华为三康技术有限公司 日志的聚合处理方法及聚合处理装置
CN101605028A (zh) * 2009-02-17 2009-12-16 北京安天电子设备有限公司 一种日志记录合并方法和系统
US20100223231A1 (en) * 2009-03-02 2010-09-02 Thales-Raytheon Systems Company Llc Merging Records From Different Databases
CN101902335A (zh) * 2009-05-27 2010-12-01 北京启明星辰信息技术股份有限公司 一种数据过滤与合并的方法

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11314738B2 (en) 2014-12-23 2022-04-26 Palantir Technologies Inc. Searching charts
US10585907B2 (en) 2015-06-05 2020-03-10 Palantir Technologies Inc. Time-series data storage and processing database system
EP3279813A1 (en) * 2016-08-02 2018-02-07 Palantir Technologies Inc. Time-series data storage and processing database system
US10664444B2 (en) 2016-08-02 2020-05-26 Palantir Technologies Inc. Time-series data storage and processing database system
US11379453B2 (en) 2017-06-02 2022-07-05 Palantir Technologies Inc. Systems and methods for retrieving and processing data
US11397730B2 (en) 2017-08-14 2022-07-26 Palantir Technologies Inc. Time series database processing system
US10417224B2 (en) 2017-08-14 2019-09-17 Palantir Technologies Inc. Time series database processing system
US10216695B1 (en) 2017-09-21 2019-02-26 Palantir Technologies Inc. Database system for time series data storage, processing, and analysis
US11573970B2 (en) 2017-09-21 2023-02-07 Palantir Technologies Inc. Database system for time series data storage, processing, and analysis
US11914605B2 (en) 2017-09-21 2024-02-27 Palantir Technologies Inc. Database system for time series data storage, processing, and analysis
US11281726B2 (en) 2017-12-01 2022-03-22 Palantir Technologies Inc. System and methods for faster processor comparisons of visual graph features
US12099570B2 (en) 2017-12-01 2024-09-24 Palantir Technologies Inc. System and methods for faster processor comparisons of visual graph features
US11016986B2 (en) 2017-12-04 2021-05-25 Palantir Technologies Inc. Query-based time-series data display and processing system
US12124467B2 (en) 2023-06-07 2024-10-22 Palantir Technologies Inc. Query-based time-series data display and processing system

Also Published As

Publication number Publication date
CN109960688A (zh) 2019-07-02
CN103577454B (zh) 2019-03-01
CN103577454A (zh) 2014-02-12

Similar Documents

Publication Publication Date Title
WO2014019349A1 (zh) 一种文件合并方法和装置
CN103412916B (zh) 一种监控系统的多维度数据存储、检索方法及装置
US9672267B2 (en) Hybrid data management system and method for managing large, varying datasets
WO2009119811A1 (ja) 情報再構成システム、情報再構成方法及び情報再構成用プログラム
US20160306810A1 (en) Big data statistics at data-block level
US9792231B1 (en) Computer system for managing I/O metric information by identifying one or more outliers and comparing set of aggregated I/O metrics
CN103631940A (zh) 一种应用于hbase数据库的数据写入方法及系统
US10884667B2 (en) Storage controller and IO request processing method
WO2017107812A1 (zh) 一种用户日志存储方法及设备
CN112866136B (zh) 业务数据处理方法和装置
CN109445702A (zh) 一种块级数据去重存储系统
CN101866359A (zh) 一种机群文件系统中的小文件存储和访问方法
WO2016070529A1 (zh) 一种实现重复数据删除的方法及装置
WO2022111733A1 (zh) 消息处理方法、装置及电子设备
WO2018068714A1 (zh) 重删处理方法及存储设备
CN101673192A (zh) 时序化的数据处理方法、装置及系统
WO2016197814A1 (zh) 垃圾文件识别及管理方法、识别装置、管理装置和终端
CN111443867A (zh) 一种数据存储方法、装置、设备及存储介质
US20150261439A1 (en) Tier Aware Caching Solution To Increase Application Performance
JP6060276B2 (ja) 監視レコード管理方法及びデバイス
US10671636B2 (en) In-memory DB connection support type scheduling method and system for real-time big data analysis in distributed computing environment
CN103647824A (zh) 一种存储资源优化调度发现算法
KR101830504B1 (ko) 분산 환경 기반 빅데이터 실시간 분석을 위한 인-메모리 db 연결 지원형 스케줄링 방법 및 시스템
CN106161056B (zh) 周期型数据的分布式缓存运维方法及装置
KR20160091471A (ko) 환형큐 기반의 인-메모리 데이터베이스 시스템에서의 데이터 처리방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13826217

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13826217

Country of ref document: EP

Kind code of ref document: A1