CN111881092A - Method and device for merging files based on cassandra database - Google Patents

Method and device for merging files based on cassandra database Download PDF

Info

Publication number
CN111881092A
CN111881092A CN202010576064.1A CN202010576064A CN111881092A CN 111881092 A CN111881092 A CN 111881092A CN 202010576064 A CN202010576064 A CN 202010576064A CN 111881092 A CN111881092 A CN 111881092A
Authority
CN
China
Prior art keywords
merging
file
merged
files
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010576064.1A
Other languages
Chinese (zh)
Other versions
CN111881092B (en
Inventor
叶志钢
王化民
张本军
王赟
谭国权
赵雨佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Greenet Information Service Co Ltd
Original Assignee
Wuhan Greenet Information Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Greenet Information Service Co Ltd filed Critical Wuhan Greenet Information Service Co Ltd
Priority to CN202010576064.1A priority Critical patent/CN111881092B/en
Publication of CN111881092A publication Critical patent/CN111881092A/en
Application granted granted Critical
Publication of CN111881092B publication Critical patent/CN111881092B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of databases, in particular to a method and a device for merging files based on a cassandra database. The method mainly comprises the following steps: receiving data files generated by a database, and generating a merged file list of each disk; the merging process of each disk acquires a merging file list of the corresponding disk, and acquires the size of a data file to be merged in the merging file list of each disk; starting; and a parallel merging process of the database, wherein when the merging process of each disk is calculated to obtain the sum of the sizes of the data files and the sum of the sizes of the data files reaches a merging file threshold value, the parallel merging process merges the data files to be merged in all the disks at one time. The invention can merge small files in time under the condition of using less merging layers and temporary files, reduces merging times, reduces occupied space of files in a disk, reduces IO times of the disk and conflict between IO times of the disk and the disk, improves the performance of file merging and improves the read-write stability of the database.

Description

Method and device for merging files based on cassandra database
[ technical field ] A method for producing a semiconductor device
The invention relates to the field of databases, in particular to a method and a device for merging files based on a cassandra database.
[ background of the invention ]
The Cassandra database is an open-source distributed mixed storage scheme and has the characteristics of decentralization, expandability, high availability, fault tolerance, configurable consistency and the like. When cassandra sequentially flushes the cached data to a disk, a plurality of data files (written String Table, abbreviated as sstable) of about 1-10MB are generated. When the cassandra database has a large data processing amount, the stability of the database is seriously influenced by the huge number of files, and the query speed is slowed down.
To reduce the number of files that need to be processed, the cassandra database provides a file merge (compact) mechanism to merge multiple large files into a small number of small files. The currently common file merging strategies include:
(1)Size Tiered Compaction Strategy
and carrying out multi-layer combination based on the file size, putting the small files into a small file combination thread for combination, putting the large files into a large file combination thread for combination, and finally obtaining the required file size. The number of documents merged at a time is determined by the max threshold parameter, which theoretically provides the user with a means to control the size of the merge, but in a real production environment the effect is: when a smaller max _ threshold parameter (e.g., 32) is used, the number of files can be quickly reduced, but the merged files are still small and secondary merging can be triggered, and the secondary merging needs more disk IO resources and often occurs simultaneously with the writing of the next batch to rob the disk IO resources. When a larger max _ threshold parameter (e.g. 128) is used, the number of files cannot be reduced quickly, and when the merge condition is reached, the merge condition is often performed simultaneously with the writing of the next batch, thereby seizing the IO resources of the disk, and causing serious jitter of the writing performance. The disadvantages of this strategy are: after being combined, a plurality of small files may not reach the required file size, and can be changed into a large file with the required file size only by rolling and combining for many times, so that one copy of data needs to be combined for many times, and the read-write burden of a disk is enlarged by reading and writing for many times.
(2)Leveled Compaction Strategy
And starting to scan from the highest level to determine whether compact is needed, and if so, forming a task by the sstable of the layer. Preferentially performing compact on the higher-level sstables can effectively reduce the number of sstables combined from the bottom level to the higher-level compact. The disadvantages of this strategy are: 1. compressing the high-Level sstables preferentially can actually reduce the number of sstables participating in the compact, but preferentially processing the high-Level sstables leads to accumulation of the sstables at the L0 Level at the bottommost Level, and the performance of reading and writing is affected by too many sstables under single-point management. 2. Before merging of a plurality of large files is not completed, merged data always exist on a disk, and in an extreme case, 2 times of disk space is needed, so that the situation that the disk space is insufficient may exist.
(3)Time Window CompactionStrategy
Based on the merging of the time windows, merging the data in the time windows according to the strategy 1, and the files exceeding the time windows are not merged any more, and the essence is that when the merging cannot be completed in time, the merging is abandoned. The disadvantages of this strategy are: 1. because strategy 1 is used, the same disadvantages as strategy 1 exist. 2. When the warehousing frequency is high, such as 5 minutes/time of warehousing frequency, small files are rapidly increased, when the time window is large (1 day), the advantages of the window strategy are lost, and large combined IO pressure is generated like the strategy 1, so that the data warehousing is seriously influenced, and the service is unavailable; when the time window is small (1 hour), although the writing can be preferentially ensured, the merging of files which are not merged in the window time is abandoned, the number of the files is rapidly increased, the number of data files per day is still close to 10 ten thousand, and when the data storage time is long, the number of the opened files exceeds the processing limit of the database process, so that the database process is stopped.
In view of this, how to overcome the defects existing in the prior art and solve the defects existing in the existing file merging strategy is a problem to be solved in the technical field.
[ summary of the invention ]
Aiming at the defects or improvement requirements of the prior art, the invention solves the problems of large disk read-write burden, more disk space occupation, large merging IO pressure and the like caused by triggering merging actions according to the number of files, merging levels or time windows in the existing file merging strategy.
The embodiment of the invention adopts the following technical scheme:
in a first aspect, the present invention provides a method for merging files based on a cassandra database, which specifically comprises: receiving data files generated by a database, and generating a merged file list of each disk; the merging process of each disk acquires a merging file list of the corresponding disk, and acquires the size of a data file to be merged in the merging file list of each disk; starting a parallel merging process of the database, and calculating the sum of the sizes of the data files obtained by the merging process of each disk; and when the sum of the sizes of the data files reaches a merging file threshold value, merging the data files needing to be merged in all the disks at one time by the parallel merging process.
Preferably, if the total size of the files to be merged does not reach the threshold value of the merged files, the merging is not performed, and the data files generated next time by the database are waited to be received.
Preferably, whether the size of each data file is smaller than a file data volume threshold value is judged; if the data volume is smaller than the threshold value of the file data volume, the threshold value of the merged file uses a first merged file threshold value; if the data volume is larger than the threshold value of the file data volume, the threshold value of the merged file uses a second threshold value of the merged file; wherein the first merged file threshold is greater than the second merged file threshold.
Preferably, the self-contained merging strategy of the system is forbidden before the generated data file is accepted.
Preferably, if the size of the data file to be merged exceeds the file size threshold, the data file to be merged is merged into at least one file according to the file size threshold, and the rest part smaller than the file size threshold is not merged.
Preferably, the directories to be merged are specified, and only the directory to be merged starts a merging process to merge files.
Preferably, after merging all the data files to be merged at one time, the method further includes: the parallel merging process counts merging time; judging whether the merging time exceeds a merging time threshold value; and alarming the disk with the merging time exceeding the merging time threshold.
Preferably, if the merging time exceeds the merging time threshold, the database process is automatically restarted.
Preferably, the source files of the data files to be merged are marked as deleted, and after all the data files to be merged are merged at one time, the data files marked as deleted are deleted
On the other hand, the invention provides a device of a file merging method based on a cassandra database, which specifically comprises the following steps: the cassandra database file merging method comprises at least one processor and a memory, wherein the at least one processor and the memory are connected through a data bus, and the memory stores instructions capable of being executed by the at least one processor, and the instructions are used for completing the cassandra database file merging method in the first aspect after being executed by the processor.
Compared with the prior art, the embodiment of the invention has the beneficial effects that: by using a dynamic merging mode and using a merging file threshold as a trigger item of a merging action, the generated small files are merged at one time in time when the size of the small files reaches the merging file threshold, and a fast and stable small file merging method is provided. In the preferred scheme, the optimized combination of the number of the merged files and the write-in performance is obtained by adjusting the threshold value of the merged files, the abnormal state of the merging process is fed back in time through merging time detection and alarm, and the file storage structure and the IO efficiency are optimized through a garbage recovery mechanism.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a flowchart of a method for merging files based on a cassandra database according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for merging files based on a cassandra database according to an embodiment of the present invention;
fig. 3 is a flowchart of a method for merging files based on a cassandra database according to an embodiment of the present invention;
fig. 4 is a flowchart of a method for merging files based on a cassandra database according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a device for merging files based on a cassandra database according to an embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The present invention is a system structure of a specific function system, so the functional logic relationship of each structural module is mainly explained in the specific embodiment, and the specific software and hardware implementation is not limited.
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The invention will be described in detail below with reference to the figures and examples.
Example 1:
in the cassandra database, when a client writes data, a client program determines a server node to which the data should be sent according to a token range on a cluster, a server receives the data in a multithreading and parallel mode, each thread sorts the received data, and a data file smaller than 10M is generated. In some implementation scenarios, when the amount of data handled by the cassandra database stand-alone reaches 4 TB/day, the number of data files that the database process needs to open may exceed 20 ten thousand, and when the data is required to be stored for 7 days, the number of data files that the process needs to open reaches 140 ten thousand, and the data files need to be merged to reduce the number of files. Therefore, the embodiment provides a new dynamic small file merging method, which avoids the defects existing in the existing file merging strategy.
Cassandra is a NoSQL distributed database, and adopts a Log Structured Merge Tree (LSM Tree) framework. The LSM is a layered, ordered and disk-oriented data structure, and is realized by layered batch reading and writing by utilizing the characteristic that the batch sequential writing of disks is far higher than the random writing performance. To implement batch reading and writing of files, LSM has two important processes: data is sequentially flushed to a disk to generate a data file (sstable) and the data file is merged (compact). The Cassandra database reads data written by the user and caches the data in memory. When the cache is full, the data in the cache is flushed to the disk, and the sstable file is generated. Through the write cache, the quantity can be written in batch, and the characteristic of better sequential write performance of the disk is utilized. On the other hand, the data in the cache is sorted in advance before being read into the disk, so that the data can be quickly searched by a dichotomy, and the data searching efficiency is improved.
As shown in fig. 1, the method for merging files based on a cassandra database according to the embodiment of the present invention includes the following specific steps:
step 101: and receiving the data file generated by the database, and generating a merged file list of each disk.
The Cassandra database stores data in massive structured data on a large number of ordinary commercial-grade servers in a distributed mode by using a distributed storage engine, so that data loss or service abnormity caused by single-point failure is avoided. When the client writes data every time, a client program of the cassandra database starts a process of sequentially brushing the data into a disk to generate a data file, token range information on a cluster is obtained, a server node to which the data should be sent is determined according to the token range, the server receives the data in a multi-thread parallel mode, each thread can sequence the data received by the server, and then a large amount of sstables are generated so as to be written into the disk subsequently. Before file merging, files needing to be merged of each disk in each node in distributed storage need to be obtained, a merged file list of each disk is generated, and the merged file list is used for obtaining the files according to the merged file list during subsequent merging. In a specific embodiment, the files to be merged may be index files, data files, bloomfilter files, and the like.
In the existing LSM architecture, after data is sequentially flushed to a disk to generate a data file, a process of merging the data file based on the existing merging policy is automatically started. In order to merge in the cassandra database only by using the file merging method provided in this embodiment, before starting the process of generating a data file by sequentially flushing data onto a disk, that is, before receiving the generated data file, all merging policies owned by the system need to be disabled, so as to avoid conflict between the merging policies of the data file. Specifically, all merging strategies carried by the system can be disabled by specifying parameters { ' enabled ': false ' } when a database table is established and specifying parameters for table establishment.
In some embodiments, only the files in some directories need to be merged, or the merged files are stored in a designated directory. Therefore, before generating the merged file list of each disk, the directory to be merged can be designated according to the actual needs of the implementation scenario, and the merging process is started only for the directory to be merged to merge the files, so as to reduce the number of the files to be processed and improve the file processing efficiency.
Step 102: and the merging process of each disk acquires a merging file list of the corresponding disk and acquires the size of a data file to be merged in the merging file list of each disk.
Because the cassandra database organizes the stored files into a Distributed File System (DFS), the physical storage resources used by each client or server are not necessarily directly connected to the local nodes, but are connected to the nodes through a computer network; or a complete hierarchical file system formed by combining several different logical disk partitions or volume labels. DFS provides a logical tree file system structure for resources distributed at any position on the network, so that users can access shared files distributed on the network more conveniently. When merging files, in order to merge files on different nodes, merged file directories of different disks in each node need to be read. In this embodiment, a merging process is respectively started in each disk, the merged file list in the disk is read, and the size of the data file to be merged is obtained, which is convenient for use in the subsequent steps.
Step 103: and starting a parallel merging process of the database, and calculating the sum of the sizes of the data files obtained by the merging process of each disk.
In order to uniformly manage and merge files to be merged in all disks on a cluster, a unique parallel merging process needs to be started in a database system, the parallel merging process can scan the merging process of each disk, the sizes of data files obtained by the merging process of each disk are summarized, and the sum of the sizes of the data files is calculated to serve as a basis for whether the merging process of the data files is started or not. In an actual usage scenario of this embodiment, the data file in each disk is changed only after the client writes data each time and generates the data file, so that the sum of the sizes of the data files obtained by the merging processes of the disks is calculated only after the client writes data and generates the data file each time, and whether file merging is needed is determined.
Step 104: and when the sum of the sizes of the data files reaches a merging file threshold value, merging the data files needing to be merged in all the disks at one time by the parallel merging process.
In order to reduce the number of times of file merging and strictly ensure that one copy of data completes merging at one time, the size of a merged target file needs to be specified, namely a merging file threshold. After the parallel merging thread of the database writes data and generates a data file each time at the client, the sum of the sizes of the data files obtained by the merging processes of the disks is calculated, and whether the file in each disk reaches a merging file threshold value is dynamically judged. And once the size of the file to be merged exceeds the threshold of the file to be merged and reaches the threshold of the merged file, immediately starting the merging process of the data file corresponding to the data writing, merging the generated data files, and moving the merged file to the next layer. On the other hand, if the total size of the files to be merged does not reach the threshold value of the merged files, the merging is not carried out, the data files generated next time by the database are waited to be received, whether the sum of the sizes of the data files generated twice reaches the threshold value of the merged files is calculated, and if the sum reaches the threshold value of the merged files, the merging is carried out. In some specific implementation scenarios, the data volume written by the client is small each time, and after multiple times of writing, the sum of the sizes of the data files can reach the threshold of merging the files, and in the scenario, the merging process of the data files can be started after multiple times of waiting.
Through steps 101 to 104, when data written by the client is received each time, the sstable generated in each disk is scanned, the file data and the merged file threshold are compared, merging is performed only when the received data needs to be merged, and merging is not performed when merging is not needed, so that multiple rolling merging is avoided, and the read-write burden of the disks is reduced.
In a specific implementation manner of this embodiment, in order to ensure that the merged file size reaches the merged file threshold, as shown in fig. 2, the process of merging the data files may be implemented by the following steps:
step 201: and receiving the data file generated by the database, and generating a merged file list of each disk.
Step 202: and the merging process of each disk acquires a merging file list of the corresponding disk and acquires the size of a data file to be merged in the merging file list of each disk.
Step 203: and starting a parallel merging process of the database, and calculating the sum of the sizes of the data files obtained by the merging process of each disk.
Step 204: and judging whether the sum of the sizes of the data files to be merged is smaller than a merged file threshold value or not.
Step 205: and if the sum of the sizes of the data files to be merged is smaller than the threshold value of the merged file, waiting for receiving the data file generated next time by the database.
Step 206: and the parallel merging process merges the data files to be merged in all the disks at one time.
Wherein, steps 201 to 203 correspond to steps 101 to 103, and step 206 corresponds to step 104. Through steps 201 to 206, whether to start the data file merging process is dynamically determined by comparing the size relationship between the sum of the sizes of the data files to be merged and the threshold of the merged file, so as to reduce the times of file merging, and the merged file inevitably reaches the required size, thereby avoiding secondary merging caused by undersize of the merged data file.
Further, if the size of data written by the client each time is not an integral multiple of the file size threshold, the generated sstable will generate a file smaller than the file size threshold after being merged according to the file size threshold, resulting in the need of merging the files twice. If the size of the data files to be merged exceeds the file size threshold, merging the data files to be merged into at least one file according to the file size threshold, and not merging the rest parts smaller than the file size threshold. In a specific implementation scenario, the sum of the sizes of the data files to be merged, which are obtained by the merging process of each disk, is 2.6G, and the file size threshold is 1G. At this time, the parallel merging process merges the data files to be merged into 2 files of 1G, the remaining 0.6G files are not merged for the current time, and merging is performed after the database files generated by the database are received next time. The processing mode carries out partial combination on the files to be combined, not only ensures that the size of the files after each combination is not smaller than the size threshold of the files, but also reduces the persistence of the data files to be combined in the disk through partial combination, and saves the occupation of the disk space.
In order to adapt to different practical use scenes, the threshold value of the merged file can be adjusted to achieve the optimal balance between the number of files and the writing performance. As shown in fig. 3, the specific steps are as follows:
step 301: and judging whether the size of each data file is smaller than a file data volume threshold value or not.
Step 302: if the file data volume is smaller than the file data volume threshold, the merged file threshold uses a first merged file threshold.
Step 303. And if the threshold value of the data amount of the file is larger than the threshold value of the data amount of the file, using a second merged file threshold value as the merged file threshold value, wherein the first merged file threshold value is larger than the second merged file threshold value.
The size of the merged file threshold is adjusted according to the file data amount threshold to adjust the file number and the writing performance through steps 301 to 303. When the data volume of the received service file is smaller than the threshold value of the data volume of the file, a larger first merged file threshold value is used, more small files are merged into a large file, and the number of the merged files is further reduced; when the received service data volume is larger than the file data volume threshold, a smaller second merged file threshold is used, although the number of files is increased, the writing performance can be preferentially guaranteed, the stability of the reading and writing performance of the disk is ensured, and the database service stably runs. In a specific embodiment of this embodiment, each data server processes 4TB original data every day, the compressed data is 1.3TB, and when the merged file threshold is set to 1G, the number of files after merging is 1.3TB 1024 — 1331, the number of data files of a service is stored for 15 days, and the total number of files is 15 × 1331 — 19965. The number of the merged files is completely written in the controllable range, merging is stably completed within the set time, and the database stably runs.
In order to further ensure the stability of database operation and avoid performance problems such as excessive memory occupation and increased query delay caused by abnormal file merging, the small file merging method in the embodiment further includes a merging performance detection and alarm mechanism.
In a specific implementation scenario of this embodiment, as shown in fig. 4, the steps of performing merging performance detection and alarm are as follows:
step 401: the parallel merge process counts the merge time.
Step 402: and judging whether the merging time exceeds a merging time threshold value.
Step 403: and alarming the disk with the merging time exceeding the merging time threshold.
In actual use, under the condition that the system performance is stable, the time length for merging the files is also stable, and if the actual merging time is far longer than the theoretical merging time or the average historical merging time, it is indicated that an exception may occur in the merging process and exception handling is required. In a particular implementation scenario, the merge time threshold may be determined by estimating the time at which theoretical merges are made or calculating the average time of historical merges.
In order to timely handle the merging process with the exception, when the parallel merging process finds that the merging time of a certain merging process exceeds the merging time threshold, that is, when the merging is abnormal, the found merging exception needs to be alarmed. In a specific use scenario, the alarm information may be displayed through a management interface of the cassandra database. In order to avoid waiting with other functions of the cassandra database and display the alarm information in time, a special monitoring process can be used for displaying the merging state and displaying the alarm information. In order to enable the manager and the user of the database to obtain more detailed merging exception information, the specific condition of the merging thread with the exception can be output in an exception log mode and the like, and specific exception data can be provided so as to be convenient for the manager or the user of the database to process.
In order to improve the automation degree of exception handling, the thread with merging exception can be automatically handled through a preset exception handling program. According to the requirements of specific implementation scenes, the specific data of the abnormity can be subjected to targeted processing, and the database process can be directly restarted, so that the performance problems of excessive memory occupation, increased query time delay and the like caused by abnormal combination are avoided, and the running stability of the database is improved.
In the cassandra storage management mechanism, in order to reduce the number of times of disk IO, a garbage recovery mechanism is used, data which needs to be deleted is not directly deleted, but is only marked as deleted but still stored in a disk, and the data marked as deleted is uniformly and really deleted when a merging process is carried out. The small file merging method provided in this embodiment also deletes data to be deleted by using a garbage collection mechanism, marks a source file of the data file to be merged as deleted when merging, and deletes the data file marked as deleted after merging all the data files to be merged at one time. By the method, multiple times of disk IO caused by respectively deleting a large number of files can be avoided, disk load is reduced, and merging efficiency is improved.
According to the method for merging the files based on the cassandra database, the sizes of the received files are dynamically monitored, the files to be merged are merged immediately when the sizes of the files reach the merging size threshold, the small files are merged in time under the condition that fewer merging layers and temporary files are used, merging times are reduced, occupied space of the files in a disk is reduced, IO times of the disk and IO contention and robbery of the disk are reduced, the file merging performance is improved, and the read-write stability of the database is improved.
Compared with the Size reserved compact stream Strategy, the small file merging method provided by the embodiment inevitably achieves the required Size of the merged file each time, the problem that the merged file is still very small is solved, multiple rolling merging is not needed, the disk IO (input/output) robbery is reduced, and the disk performance jitter and the disk read-write burden are increased.
Compared with a Leveled compact stream policy, the small file merging method provided by the embodiment performs equivalent processing on files of different levels, so that the sstables at the L0 layer at the bottommost layer are not stacked, the number of sstables needing single-point management is stable, the sstables meeting the merging standard can be merged in time, and the small file merging method does not occupy too much disk space.
Compared with the Time Window compact stream policy, the small file merging method provided by the embodiment is approximately sequential processing of the received data files, and the situation that merging is abandoned due to the end of a Time Window does not exist, so that all files needing to be merged can be processed, and the rapid increase of the number of the files is avoided. Furthermore, the merging process with the exception is processed through an exception detection and alarm mechanism, so that the stability of the system is ensured; the number of the merged files is optimized by adjusting the threshold size of the merged files, so that the merging performance is improved; and the IO times of the disks are further reduced by using a garbage recovery mechanism, and the merging efficiency is improved.
Example 2:
on the basis of the method for merging files based on the cassandra database provided in embodiment 1, the present invention further provides a device for merging files based on the cassandra database, which is capable of implementing the method, as shown in fig. 5, which is a schematic diagram of a device architecture in an embodiment of the present invention. The apparatus for file merging based on cassandra database of the present embodiment includes one or more processors 21 and a memory 22. In fig. 5, one processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.
The memory 22, which is a non-volatile computer-readable storage medium for a file merging method based on a cassandra database, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the file merging method based on the cassandra database in embodiment 1. The processor 21 executes various functional applications and data processing of the apparatus for performing a file merge based on a cassandra database by executing nonvolatile software programs, instructions, and modules stored in the memory 22, that is, implements the method for performing a file merge based on a cassandra database according to embodiment 1.
The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Program instructions/modules are stored in the memory 22 and, when executed by the one or more processors 21, perform the method for cassandra database-based file merging in embodiment 1 described above, for example, perform the various steps shown in fig. 1-4 described above.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A method for merging files based on a cassandra database is characterized in that:
receiving data files generated by a database, and generating a merged file list of each disk;
the merging process of each disk acquires a merging file list of the corresponding disk, and acquires the size of a data file to be merged in the merging file list of each disk;
starting a parallel merging process of the database, and calculating the sum of the sizes of the data files obtained by the merging process of each disk;
and when the sum of the sizes of the data files reaches a merging file threshold value, merging the data files needing to be merged in all the disks at one time by the parallel merging process.
2. The method for cassandra database-based file merging as recited in claim 1, wherein: and if the total size of the files to be merged does not reach the threshold value of the merged files, not merging the files this time, and waiting for receiving the data files generated next time by the database.
3. The method for cassandra database-based file merging as recited in claim 1, wherein:
judging whether the size of each data file is smaller than a file data volume threshold value or not;
if the data volume is smaller than the threshold value of the file data volume, the threshold value of the merged file uses a first merged file threshold value;
if the data volume is larger than the threshold value of the file data volume, the threshold value of the merged file uses a second threshold value of the merged file;
wherein the first merged file threshold is greater than the second merged file threshold.
4. The method for cassandra database-based file merging as recited in claim 1, wherein: and before the generated data file is accepted, the self-contained merging strategy of the system is forbidden.
5. The method for cassandra database-based file merging as recited in claim 1, wherein: if the size of the data files to be merged exceeds the file size threshold, merging the data files to be merged into at least one file according to the file size threshold, and not merging the rest parts smaller than the file size threshold.
6. The method for cassandra database-based file merging as recited in claim 1, wherein: and appointing the directories which need to be merged, and starting a merging process only for the directories which need to be merged to merge files.
7. The method for cassandra database-based file merging as claimed in claim 1, wherein said merging all data files to be merged at a time further comprises:
the parallel merging process counts merging time;
judging whether the merging time exceeds a merging time threshold value;
and alarming the disk with the merging time exceeding the merging time threshold.
8. The method for cassandra database-based file merging as recited in claim 1, wherein: and if the merging time exceeds the merging time threshold, automatically restarting the database process.
9. The method for cassandra database-based file merging as recited in claim 1, wherein: and marking the source files of the data files to be merged as deleted, and deleting the data files marked as deleted after merging all the data files to be merged at one time.
10. A file merging device based on a cassandra database is characterized in that:
comprising at least one processor and a memory, said at least one processor and memory being connected by a data bus, said memory storing instructions executable by said at least one processor, said instructions upon execution by said processor, performing the method for cassandra database based file consolidation according to any of claims 1-9.
CN202010576064.1A 2020-06-22 2020-06-22 Method and device for merging files based on cassandra database Active CN111881092B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010576064.1A CN111881092B (en) 2020-06-22 2020-06-22 Method and device for merging files based on cassandra database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010576064.1A CN111881092B (en) 2020-06-22 2020-06-22 Method and device for merging files based on cassandra database

Publications (2)

Publication Number Publication Date
CN111881092A true CN111881092A (en) 2020-11-03
CN111881092B CN111881092B (en) 2024-07-09

Family

ID=73156949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010576064.1A Active CN111881092B (en) 2020-06-22 2020-06-22 Method and device for merging files based on cassandra database

Country Status (1)

Country Link
CN (1) CN111881092B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112732433A (en) * 2021-03-30 2021-04-30 骊阳(广东)节能科技股份有限公司 Data processing system capable of carrying out priority allocation
CN113238712A (en) * 2021-04-23 2021-08-10 深圳市智微智能软件开发有限公司 Disk space utilization method, device, terminal and storage medium
CN115981570A (en) * 2023-01-10 2023-04-18 创云融达信息技术(天津)股份有限公司 Distributed object storage method and system based on KV database

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104360824A (en) * 2014-11-10 2015-02-18 北京奇虎科技有限公司 Data merging method and device
CN105956183A (en) * 2016-05-30 2016-09-21 广东电网有限责任公司电力调度控制中心 Method and system for multi-stage optimization storage of a lot of small files in distributed database
KR101668397B1 (en) * 2015-12-24 2016-10-21 한국과학기술정보연구원 Method and apparatus for the fast analysis of large-scale scientific data files
CN108021702A (en) * 2017-12-26 2018-05-11 百度在线网络技术(北京)有限公司 Classification storage method, device, OLAP database system and medium based on LSM-tree
US20180349095A1 (en) * 2017-06-06 2018-12-06 ScaleFlux, Inc. Log-structured merge tree based data storage architecture
CN109446165A (en) * 2018-10-11 2019-03-08 中盈优创资讯科技有限公司 The file mergences method and device of big data platform
EP3477490A1 (en) * 2017-10-26 2019-05-01 Druva Technologies Pte. Ltd. Deduplicated merged indexed object storage file system
CN110609813A (en) * 2019-08-14 2019-12-24 北京华电天仁电力控制技术有限公司 Data storage system and method
CN110727685A (en) * 2019-10-09 2020-01-24 苏州浪潮智能科技有限公司 Data compression method, equipment and storage medium based on Cassandra database
CN111221922A (en) * 2019-12-31 2020-06-02 苏州浪潮智能科技有限公司 RocksDB database data writing method and RocksDB database

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104360824A (en) * 2014-11-10 2015-02-18 北京奇虎科技有限公司 Data merging method and device
KR101668397B1 (en) * 2015-12-24 2016-10-21 한국과학기술정보연구원 Method and apparatus for the fast analysis of large-scale scientific data files
CN105956183A (en) * 2016-05-30 2016-09-21 广东电网有限责任公司电力调度控制中心 Method and system for multi-stage optimization storage of a lot of small files in distributed database
US20180349095A1 (en) * 2017-06-06 2018-12-06 ScaleFlux, Inc. Log-structured merge tree based data storage architecture
EP3477490A1 (en) * 2017-10-26 2019-05-01 Druva Technologies Pte. Ltd. Deduplicated merged indexed object storage file system
CN108021702A (en) * 2017-12-26 2018-05-11 百度在线网络技术(北京)有限公司 Classification storage method, device, OLAP database system and medium based on LSM-tree
CN109446165A (en) * 2018-10-11 2019-03-08 中盈优创资讯科技有限公司 The file mergences method and device of big data platform
CN110609813A (en) * 2019-08-14 2019-12-24 北京华电天仁电力控制技术有限公司 Data storage system and method
CN110727685A (en) * 2019-10-09 2020-01-24 苏州浪潮智能科技有限公司 Data compression method, equipment and storage medium based on Cassandra database
CN111221922A (en) * 2019-12-31 2020-06-02 苏州浪潮智能科技有限公司 RocksDB database data writing method and RocksDB database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王博千;于齐;刘辛;沈立;王志英;陈微;: "面向Cassandra数据库的高效动态数据管理机制", 计算机科学, no. 07 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112732433A (en) * 2021-03-30 2021-04-30 骊阳(广东)节能科技股份有限公司 Data processing system capable of carrying out priority allocation
CN113238712A (en) * 2021-04-23 2021-08-10 深圳市智微智能软件开发有限公司 Disk space utilization method, device, terminal and storage medium
CN115981570A (en) * 2023-01-10 2023-04-18 创云融达信息技术(天津)股份有限公司 Distributed object storage method and system based on KV database
CN115981570B (en) * 2023-01-10 2023-12-29 创云融达信息技术(天津)股份有限公司 Distributed object storage method and system based on KV database

Also Published As

Publication number Publication date
CN111881092B (en) 2024-07-09

Similar Documents

Publication Publication Date Title
CN111881092B (en) Method and device for merging files based on cassandra database
US9355112B1 (en) Optimizing compression based on data activity
CN110727685B (en) Data compression method, equipment and storage medium based on Cassandra database
CN111309720B (en) Time sequence data storage and reading method and device, electronic equipment and storage medium
CN103595797B (en) Caching method for distributed storage system
CN103020255B (en) Classification storage means and device
CN105630834B (en) Method and device for deleting repeated data
CN108614837B (en) File storage and retrieval method and device
US11029891B2 (en) Hybrid distributed storage system to dynamically modify storage overhead and improve access performance
US20230418811A1 (en) Transaction processing method and apparatus, computing device, and storage medium
CN114116634B (en) Caching method and device and readable storage medium
CN105183400A (en) Object storage method and system based on content addressing
US20140351506A1 (en) Efficient storage of small random changes to data on disk
CN107888687B (en) Proxy client storage acceleration method and system based on distributed storage system
CN114168262B (en) Cloud platform mirror image cache management method based on LRU replacement algorithm
CN112486918B (en) File processing method, device, equipment and medium
WO2018077092A1 (en) Saving method applied to distributed file system, apparatus and distributed file system
CN117633105A (en) Time-series data storage management method and system based on time partition index
CN107181773A (en) Data storage and data managing method, the equipment of distributed memory system
JP2023531751A (en) Vehicle data storage method and system
CN112711564B (en) Merging processing method and related equipment
CN113849119A (en) Storage method, storage device, and computer-readable storage medium
CN116204130A (en) Key value storage system and management method thereof
CN115878625A (en) Data processing method and device and electronic equipment
CN111625500B (en) File snapshot method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant