CN114168084A

CN114168084A - File merging method, file merging device, electronic equipment and storage medium

Info

Publication number: CN114168084A
Application number: CN202111513482.7A
Authority: CN
Inventors: 张祎轶; 邹洁; 宋淑杰; 梁祎
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-03-11

Abstract

The embodiment of the application discloses a file merging method, a file merging device, electronic equipment and a storage medium, wherein the file merging method comprises the steps of obtaining a first target file with the largest occupied space capacity from a buffer queue and storing the first target file in a merging queue; acquiring a second target file from the buffer queue, and storing the second target file in the merge queue, wherein the second target file is a file which occupies the closest space capacity and is smaller than the residual space capacity of the merge queue; repeatedly executing the steps of obtaining a second target file from the buffer queue and storing the second target file in the merge queue until no file which occupies the space capacity closest to and is smaller than the residual space capacity of the merge queue exists in the buffer queue; and merging the files in the merging queue to obtain a merged file. The embodiment provided by the application can improve the merging efficiency of the files.

Description

File merging method, file merging device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a file merging method, a file merging apparatus, an electronic device, and a storage medium.

Background

With the rapid development of cloud computing and big data, the global data volume is increased exponentially, and the traditional storage system cannot gradually meet the storage requirement of people due to the factors such as equipment cost and maintenance cost. In addition, with the increasing number of small files, most distributed file storage systems have been unable to meet the requirement of efficient storage and reading of small files. How to solve the storage and management problems of the massive small files and improving the storage and access efficiency of the small files are the biggest challenges at present.

Disclosure of Invention

In order to solve the above technical problem, embodiments of the present application provide a file merging method, a file merging device, an electronic device, and a storage medium, which can improve merging efficiency of files.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, a file merging method is provided, including obtaining a first object file with a largest occupied space capacity from a buffer queue, and storing the first object file in a merge queue; acquiring a second target file from the buffer queue, and storing the second target file in the merge queue, wherein the second target file is a file which occupies the closest space capacity and is smaller than the residual space capacity of the merge queue; repeatedly executing the steps of obtaining a second target file from the buffer queue and storing the second target file in the merge queue until no file which occupies the space capacity closest to and is smaller than the residual space capacity of the merge queue exists in the buffer queue; and merging the files in the merging queue to obtain a merged file.

According to an aspect of an embodiment of the present application, there is provided a file merging apparatus, including: the first acquisition module is used for acquiring a first target file with the largest occupied space capacity from the buffer queue and storing the first target file in the merge queue; the second acquisition module is used for acquiring a second target file from the buffer queue and storing the second target file in the merge queue, wherein the second target file is a file which occupies the closest space capacity and is smaller than the residual space capacity of the merge queue; the repeated execution module is used for repeatedly executing the steps of obtaining a second target file from the buffer queue and storing the second target file in the merge queue until no file which occupies the space capacity closest to and is smaller than the residual space capacity of the merge queue exists in the buffer queue; and the merging module is used for merging the files in the merging queue to obtain a merged file.

According to an aspect of the embodiments of the present application, there is provided an electronic device, including a processor and a memory, where the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, implement the file merging method as above.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon computer-readable instructions, which, when executed by a processor of a computer, cause the computer to execute the file merging method as previously provided.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the file merging method provided in the various alternative embodiments described above.

According to the technical scheme provided by the embodiment of the application, the file with the largest occupied space capacity is obtained from the buffer queue and stored in the merge queue, and then the file with the closest occupied space capacity and smaller than the residual space capacity of the merge queue is repeatedly obtained from the buffer queue and stored in the merge queue until the file with the closest occupied space capacity and smaller than the residual space capacity of the merge queue does not exist in the buffer queue.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 is a flow chart illustrating a file merge method according to an exemplary embodiment of the present application;

FIG. 2 is a flowchart illustrating a file merge method according to an exemplary embodiment based on the embodiment of FIG. 1;

FIG. 3 is a flowchart of step S22 in the embodiment shown in FIG. 2 in an exemplary embodiment;

FIG. 4 illustrates a flow chart of a file storage method according to an exemplary embodiment of the present application;

FIG. 5 is a flowchart of step S400 in the embodiment shown in FIG. 1 in an exemplary embodiment;

FIG. 6 is a diagram illustrating the results of a prior art file merging approach;

FIG. 7 is a diagram illustrating the result of merging documents using the document merging method provided herein;

FIG. 8 is a block diagram of a file merge device shown in an exemplary embodiment of the present application;

FIG. 9 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It should also be noted that: reference to "a plurality" in this application means two or more. "and/or" describe the association relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

In the face of ever-increasing mobile applications, massive small files generated in the application using process of a user, and a series of problems such as memory waste caused by massive small file storage are not considered in a distributed file storage system like Hadoop at the beginning of design. Data with a file size of 1KB-10MB is often referred to as small files, and the amount of data in the millions and above is a huge amount of data. This raises the problem of massive Small Files in distributed file storage systems (Lots of Small Files, LOSF). The problem of storage of LOSF has been a long-standing research topic in the industry. The increase of the number of the small files causes that the storage space in the system cannot be fully utilized, and a large amount of memory is wasted; in addition, when the distributed file storage system processes small file data, the storage performance and the read-write efficiency can not maintain the original level. The distributed file storage system becomes unwieldy, slow or even impossible to work due to the large amount of small files.

In order to solve the problem of low File storage efficiency caused by LOSF, the technical schemes of Hadoop Archive (HAR), Sequence File and the like are provided, the implementation principle of the schemes is that small files are combined into a merged File with a certain size according to certain standards and then the merged File is placed into a distributed File storage system for storage, the number of files in the system can be reduced to a certain extent, the number of node memories and metadata required by the stored files is reduced, and the performance of the distributed File storage system is improved. But current merging standards may lead to new problems such as cross-block storage, inefficient merging, etc. It can be said that the existing solutions are not successful in improving the storage performance of the distributed system.

In order to solve at least the above problems in the prior art, embodiments of the present application respectively provide a file merging method, a file merging method apparatus, an electronic device, and a computer-readable storage medium, which will be described in detail below.

Referring to fig. 1, fig. 1 is a flowchart illustrating a file merging method according to an exemplary embodiment of the present application, and as shown in fig. 1, the file merging method provided in this embodiment includes steps S100 to S400, and reference is made to the following for detailed description:

step S100: and acquiring the first target file with the largest occupied space capacity from the buffer queue, and storing the first target file in the merge queue.

A queue is a special linear table, which is special in that it allows only delete operations at the front end (front) of the table, while insert operations at the back end (rear) of the table, as with stacks, a linear table with restricted operations. The end performing the insert operation is called the tail of the queue, and the end performing the delete operation is called the head of the queue. When there are no elements in the queue, it is called an empty queue.

Exemplarily, at least one buffer queue and at least one merge queue are constructed in advance, and in a real application scenario, there are often a plurality of files to be merged, so in this embodiment, the buffer queue and the merge queue are used to accommodate the files to be merged, the merge queue merges and outputs the files to be merged, and the buffer queue serves as a standby queue. When the files to be merged are obtained, the files to be merged are directly stored in the buffer queue, and then the files meeting the preset conditions are obtained from the traversal buffer queue and stored in the merge queue, so that the merge queue merges and outputs one or more files contained in the merge queue. In this embodiment, the buffer queue is configured to buffer data, and ensure that sizes of merged files obtained after merging are distributed as uniformly as possible.

In this embodiment, the number of the buffer queues and the merge queue may be determined according to the number of the files to be merged, and is not specifically limited herein.

For example, since the merged file upload condition obtained after completion of merging may be subsequently stored in the distributed file storage system, in order to solve the problem of memory waste caused by storage of a large number of small files, the total space capacity of the merged queue is set to be less than or equal to a default storage file threshold of a minimum storage unit in the distributed file storage system, for example, one minimum storage unit in Hadoop is called Block, and the default storage file threshold is 64MB, so the total space capacity of the merged queue may be set to be less than or equal to 64 MB. The total spatial capacity of the buffer queue may be less than or equal to the total spatial capacity of the merge queue.

For example, if there are multiple merge queues, in order to make the occupied space capacity of the multiple merge files distributed uniformly, the total space capacity of the multiple merge queues is set to be the same, and the total space capacity of the multiple buffer queues may be the same or different, which is not limited herein.

In this embodiment, the occupied space capacity of a file is the size of the file. And acquiring the first target file with the largest occupied space capacity from the buffer queue, and storing the first target file in the merge queue.

In this embodiment, the file with the largest occupied space capacity in the current buffer queue is first stored in the merge queue, so as to reduce the number of times of reading the file from the buffer queue, thereby speeding up the file merge. It will be appreciated that if there are multiple buffer queues, all buffer queues are traversed to determine the file with the largest occupied space capacity. Illustratively, after determining the position of the first target file in the current buffer queue, due to the first-in first-out characteristic of the buffer queue, the file before the first target file in the current buffer queue is first taken out, then the first target file is taken out from the current buffer queue and stored in the merge queue, and finally the file before the first target file after being taken out is stored in the current buffer queue again.

Exemplarily, after the first target file is obtained, it is determined whether the occupied space capacity corresponding to the first target file is greater than 50% of the total space capacity of the merged file, if so, the first target file is stored in the merged queue, if not, the first target file with the largest occupied space capacity is periodically and repeatedly obtained from the buffer queue within a preset time period until the occupied space capacity of the first target file is greater than 50% of the total space capacity of the merged file, and then the first target file is stored in the merged queue.

Step S200: and acquiring a second target file from the buffer queue, and storing the second target file in the merge queue.

In this embodiment, the second target file is a file whose occupied space capacity is closest to and smaller than the remaining space capacity of the merge queue, among the files stored in all the buffer queues. The merge queue is a merge queue that stores the first target file.

In this embodiment, the second target file satisfies the condition that the occupied space capacity is smaller than the remaining space capacity of the merge queue, so that the file in the merge queue does not have a phenomenon of volume overflow, and in addition, because the total space capacity of the merge queue is close to and smaller than the storage threshold of the minimum unit of the distributed file storage system, it can be further ensured that the merged file is subsequently stored in the distributed file storage system without being divided into more blocks, so that the memory load of the name node is reduced to a certain extent, and meanwhile, the uniform distribution of the file volume will also contribute to the performance of Mapreduce (a programming model for parallel operation of large-scale data sets) parallel computation efficiency, and improve the storage efficiency of small files. In addition, the second target file meets the condition that the occupied space capacity is closest to the residual space capacity of the merge queue, on one hand, the merge efficiency can be improved, the consumption of merge time caused by obtaining the file from the buffer queue for multiple times is avoided, on the other hand, the total capacity space of the merge queue is fully filled, the large amount of blank space capacity of the merge queue is avoided, and the memory consumption is reduced.

Step S300: and repeating the steps of obtaining a second target file from the buffer queue and storing the second target file in the merge queue until no file which occupies the space capacity closest to and is smaller than the residual space capacity of the merge queue exists in the buffer queue.

In this embodiment, since the remaining space capacity of the merge queue is correspondingly reduced along with the storage of the file, the occupied space capacity of the second target file acquired each time is different, that is, the occupied space capacity of the second target file acquired later is smaller than that of the second target file acquired first.

Illustratively, the remaining space capacity of the merged file storing the first target file is 20MB, the occupied space capacity of the first second target file obtained by traversing the buffer queue for the first time is 10MB, the remaining space capacity of the merged queue is 10MB, the occupied space capacity of the second target file obtained by traversing the buffer queue for the second time is 6MB, the remaining space capacity of the merged queue is 4MB, the occupied space capacity of the first second target file obtained by traversing the buffer queue for the third time is 3MB, the remaining space capacity of the merged queue is 1MB, and the second target file which does not meet the condition is obtained by traversing the buffer queue for the fourth time, so that the merged queue sequentially stores the first target file, the file with the occupied space capacity of 10MB, the file with the occupied space capacity of 6MB, and the file with the occupied space capacity of 3 MB.

Step S400: and merging the files in the merging queue to obtain a merged file.

In this embodiment, merging the files in the merge queue is a process of uploading the packaged files in the merge queue to the distributed file storage system to wait for storage.

For convenience of description, a plurality of files included in the merged file are referred to as a doclet, where Mapfile includes an index portion and a data portion, the data portion is used for storing data of the doclet, and the index portion is used as a data index of the doclet and is used for recording a key value of the doclet and an offset position of the doclet in the merged file. After the merged file is stored in the distributed file storage system in the MapFile format subsequently, when the merged file is accessed, the index part of the merged file can be loaded to the memory, and the file position of the designated small file can be quickly positioned through the index mapping relation, so that the retrieval efficiency is greatly improved, and the access efficiency is further improved. It should be noted that, in this embodiment, the files in the merge queue may also be merged by using other manners, which are not specifically limited herein, for example, an item manner or a dictionary manner.

The file merging method provided by this embodiment obtains a file with the largest occupied space capacity from the buffer queue and stores the file in the merge queue, and then repeatedly obtains a file with the closest occupied space capacity and smaller than the remaining space capacity of the merge queue from the buffer queue and stores the file in the merge queue until there is no file with the closest occupied space capacity and smaller than the remaining space capacity of the merge queue in the buffer queue.

Illustratively, after merging the files in the merging queue to obtain a merged file, the merged file is stored in the distributed file storage system.

The distributed file storage system is used for storing data on a plurality of independent devices in a scattered mode. The traditional file storage system adopts a centralized storage server to store all data, the storage server becomes the bottleneck of system performance, is also the focus of reliability and safety, and cannot meet the requirement of large-scale storage application. The distributed file storage system adopts an expandable system structure, utilizes a plurality of storage servers to share the storage load, and utilizes the position server to position the storage information, thereby not only improving the reliability, the availability and the access efficiency of the system, but also being easy to expand. The merged file may be stored in a distributed file storage system, such as a big data distributed file system (HadoopHDFS), a Lustre distributed file storage system (GPFS universal parallel file system, general parallelfile system), and the like.

In this embodiment, since the total space capacity of the merge queue is set according to the storage threshold of the minimum storage unit of the distributed file storage system, and the multiple files forming the merge queue are the best matches of the files in the current buffer queue, the total space capacity of the merge queue occupied by the files in the merge queue is the largest, and further the total capacity space of the merge queue is fully utilized, so that a large amount of blank space capacity of the merge queue is avoided, and thus, when the merge file is stored in the distributed file storage system, the memory consumption can be reduced.

Illustratively, after merging the files in the merge queue to obtain a merged file, the merge queue is emptied, and the source file of the merged output file is deleted.

In this embodiment, the merge queue is emptied to make room for subsequent files to be merged.

In this embodiment, after the merged file is stored in the distributed file storage system, the source files of the small files are deleted to avoid that redundant source files occupy excessive system resources, where the small files refer to multiple files that constitute the merged file.

Referring to fig. 2, fig. 2 is a flowchart illustrating a file merging method according to an exemplary embodiment based on the embodiment of fig. 1, as shown in fig. 2, before step S100, the file merging method provided in this embodiment further includes steps S21-S22, and the detailed description refers to the following:

step S21: and judging whether the space capacity occupied by the file to be processed is smaller than a first preset threshold value.

In this embodiment, mainly for solving the problem of a large amount of small files in the distributed file storage system, data with a file size of 1KB to 10MB is generally called a small file, and a file with a large occupied space capacity does not need to be merged and is directly uploaded to the distributed file storage system for storage. It should be noted that the file with a large occupied space capacity cannot exceed the total space capacity of one data block of the distributed storage space, and if the occupied space capacity of the file is greater than the total space capacity of the data block, the file needs to be split, so as to obtain a plurality of files with the occupied space capacity smaller than the total space capacity of the data blocks, wherein the obtained files can be small files needing to be combined or large files with larger occupied space capacity and without needing to be combined, which is determined according to the file segmentation mode, not to be described herein, for example, the data block size of the HDFS is much larger than the normal disk-defined data block (typically 512B), the current default data block size of the HDFS is 128MB, if the occupied space capacity of the file is larger than 128MB, the file needs to be divided into a plurality of blocks smaller than 128MB, and the blocks are separately stored.

Therefore, the embodiment presets a first preset threshold, and uses the first preset threshold as a measurement standard for determining whether the file to be processed is a small file, and when determining whether the space occupied by the file to be processed is less than or equal to the first preset threshold, the file to be processed is used as a file to be merged and stored in the buffer queue only if the space occupied by the file to be processed is less than or equal to the first preset threshold. In this embodiment, the first preset threshold may be 216M or 512M, and may be specifically set according to actual requirements, which is not limited in the embodiment of the present invention.

Step S22: if the file to be processed is judged to be the file to be processed, the file to be processed is stored in the buffer queue.

In this embodiment, if it is determined that the space occupied by the pending file is less than or equal to the first preset threshold, the pending file is stored in the buffer queue to wait for merging. Illustratively, the files to be processed are stored in the buffer queue according to the occupied space capacity of the files to be processed and the residual space capacity of the buffer queue.

Exemplarily, a first target buffer queue with the largest remaining space capacity is determined from all buffer queues, and if the occupied space capacity corresponding to the file is smaller than the remaining space capacity of the target buffer queue, the file is stored in the target buffer queue; if the occupied space capacity corresponding to the file is larger than the residual space capacity of the target buffer queue, it is indicated that the buffer queue does not have enough residual space to store the file, and since the residual space of the target buffer queue is the largest, other buffer queues are less likely to be capable of storing the file, a new buffer queue is additionally constructed for storing the file.

Referring to fig. 3, fig. 3 is a flowchart of step S22 in the embodiment shown in fig. 2, and as shown in fig. 3, step S22 includes steps S221-S222, which are described in detail as follows:

step S221: and judging whether the file to be processed and the file in the buffer queue belong to the same category.

Generally, there may be an association between multiple different files to be processed, for example, multiple files to be processed are divided into blocks from the same large file, or there is a scenario that there is an association between multiple files to be processed due to similar access preferences of users to different files to be processed.

Based on this, in this embodiment, before merging the files to be processed, the relevance between the files to be processed is determined, and then the files to be processed with strong relevance are merged together, so that cross-block storage is not easily caused when the obtained merged files are subsequently uploaded and stored in the distributed file storage system, and the files with strong relevance are merged into the same large file, and the obtained merged file is stored in the same data block of the same dataode, when a request of a user for a file has strong relevance, that is, as long as a small file that the user continuously accesses is located in the same merged file, according to a file access principle, the distributed file storage system selects a data block on a closer dataode node to read, that is, continuously reads data from the data block of the same dataode, so that it is avoided that jumping between different data nodes is required when different files are accessed, the disk addressing overhead is reduced, the occupied system resources are relatively less, and the file reading efficiency is greatly improved.

In this embodiment, whether the file to be processed and the file in the buffer queue belong to the same category may be determined in various ways, which is not specifically limited herein. Exemplarily, clustering is performed on the file to be processed and the files in the buffer queue by using a clustering algorithm, whether the file to be processed is an element in a cluster where the files in the buffer queue are located is judged, and if yes, it is indicated that the file to be processed and the files in the buffer queue belong to the same category. The clustering algorithm is, for example, a K-Means clustering algorithm, a Min iBatch K-Means clustering algorithm, and the like, and since the embodiment is based on merging of a large number of small files, considering that the amount of computation involved in clustering the files is too large, the embodiment preferably performs clustering on the files to be processed by the Min iBatch K-Means clustering algorithm.

The MiniBatch K-Means clustering algorithm is an optimization scheme of the K-Means clustering algorithm, the calculation time is reduced by adopting small-batch data subsets, and meanwhile, an objective function is tried to be optimized, wherein the small batch refers to data subsets which are randomly extracted each time the algorithm is trained, the randomly generated subsets are adopted for training the algorithm, the calculation time is greatly reduced, the convergence time of K-Means is reduced compared with other algorithms, and the K-Means generated result of the small batch is generally just slightly inferior to that of a standard algorithm. The Min iBatch K-Means clustering algorithm uses a method called Min iBatch (batch process) to calculate the distance between data points. The benefit of MiniBatch is that not all data samples need to be used in the calculation process, but rather a portion of the samples from different classes of samples are taken to represent each type for calculation. Due to the small number of calculation samples, the running time is correspondingly reduced.

In a specific application scenario, referring to fig. 4, fig. 4 is a flowchart of a file storage method according to an exemplary embodiment of the present application, and as shown in fig. 4, the file storage method provided in this embodiment includes the following steps:

s11: and judging whether the space capacity occupied by the file to be processed is smaller than a first preset threshold value, if so, executing the step S12, and if not, executing the step S17.

S12: and judging whether the file to be processed and the file in the buffer queue belong to the same category, if so, executing the step S12.

S13: and storing the file to be processed in the buffered queue.

S14: and acquiring the file with the largest occupied space capacity from the buffer queue and storing the file in the merge queue.

S15: and repeatedly executing the step of obtaining the file with the closest occupied space capacity and smaller than the residual space capacity of the merging queue from the buffer queue and storing the file in the merging queue until the file with the closest occupied space capacity and smaller than the residual space capacity of the merging queue does not exist in the buffer queue.

S16: merging the files in the merging queue to obtain a merged file, and executing step S3.

S17: and uploading and storing the files to be processed or the merged files in a distributed file storage system.

Illustratively, calculating the correlation between the file to be processed and the file in the buffer queue, if the correlation is higher than a preset threshold, determining that the file to be processed and the file in the buffer queue belong to the same category, for example, calculating the correlation between the file to be processed and each file in the buffer queue separately to obtain a plurality of correlations, calculating an average correlation of the plurality of correlations, and taking the average correlation as the correlation between the file to be processed and the file in the buffer queue; for example, a target file is randomly acquired from the buffer queue, the correlation between the target file and the file to be processed is calculated, and the correlation is used as the correlation between the file to be processed and the file in the buffer queue.

Exemplarily, judging whether the file to be processed and the file in the buffer queue belong to the same category by using a machine learning algorithm, specifically, constructing a user access preference classification model, in this embodiment, the user access preference model is obtained by statistics according to a user access log record, specifically, an active user set is obtained by statistics from the user access log record; representing files accessed by the active user set by adopting bean objects; the file is less than or equal to a first preset threshold value; wherein, the attribute of the bean object comprises the ID of the user accessing the file, the name of the file accessed by the user and the number of times the file is accessed by the user; combining JDBC technology, persisting the bean object to a Mysql database for storage, and calculating the similarity of any two different access behaviors according to the stored data; when the similarity of any two different access behaviors is positive, determining that the users of any two access behaviors are similar users, recording the id of the similar users, storing the file information which is accessed by all the similar users and has the association by adopting an association file set, and constructing and training a user access preference model according to the association file set.

In this embodiment, the active user set is statistically obtained from the user access log record, specifically: screening out record lines ending in jg of suffix names of access resources in user access log records; wherein the recording line includes: user IP, access page URL, access starting time, access state and access flow; writing a log analysis class to analyze the record line, and storing an IP (Internet protocol) of an accessor and a small file name by using a two-dimensional array; traversing the access IP in the two-dimensional array, and using a Hashmap set to count the IP access amount of each visitor; the Key Value of the Hashmap set is the IP of the visitor, and the Value is the visit amount; sorting the Hashmap set in descending order according to the Value, screening out the visitors IP ranked at the top 20%, storing the IP subset by using Arraylist set, and marking as an active user set.

Step S222: if the file to be processed is judged to be the file to be processed, the file to be processed is stored in the buffer queue.

In this embodiment, if all the buffer queues are empty queues, the pending file is directly stored in the buffer queue.

In this embodiment, if all the buffer queues are empty queues and the number of the files to be processed is greater than the preset threshold, the multiple files to be processed are classified to obtain multiple classes of files to be processed, and the files to be processed of each class are merged by using the file merging method provided in the first embodiment of the present application. In this embodiment, the preset threshold is a relatively large value, which indicates that when the number of the files to be processed is large, in order to avoid time waste caused by classifying each file to be processed, the files to be processed are directly classified, so that the processing time is saved, and the merging efficiency is further improved.

In this embodiment, before storing the file to be processed in the buffer queue, it is determined whether the file to be processed and the file in the buffer queue belong to the same category; if the file is judged to be processed, the file to be processed is stored in the buffer queue. The method combines the files based on the relevance between the files, reduces the possibility of cross-block storage of the files caused by uploading and storing the combined files in the distributed file storage system, and can improve the efficiency of reading the files from the distributed file storage system.

According to the embodiment, the files are merged based on the relevance of the files, and the obtained merged files are stored in the distributed file storage system, so that the file reading speed can be effectively increased. Illustratively, in the process of reading the file, a pre-reading mechanism can be used to implement pre-reading of the relevant file, and return the corresponding file and the pre-fetched file at the same time. For example, when the request file is read, the correlation degree between the request file and other files in the merged file where the request file is located is calculated, and the correlation rate is compared with the set correlation GUO value; when the relevancy of the file is greater than the relevancy threshold, the small file can be read in advance and stored in the client cache. In this embodiment, the correlation between the documents may be calculated by various methods, for example, a cosine value or a euclidean distance between the documents is calculated as the similarity between the documents.

Preferably, in view of the limitation of the client cache space, in the file prefetching process, when the number of files with correlation degrees larger than the correlation threshold in the merged file is larger than a given maximum prefetching number, only the file with the top correlation degree is stored in the client cache together with the request file.

Referring to fig. 5, fig. 5 is a flowchart of step S400 in the embodiment shown in fig. 1, and as shown in fig. 5, step S400 includes steps S410-S440, which are described in detail as follows:

step S410: and judging whether the total space capacity of the files in the merge queue is larger than a second preset threshold value.

In this embodiment, the space capacity of the merge queue occupied by the first object file and the number of second object files included in the merge queue may be too small, e.g., only 40%, 50%, or 60% of the total space capacity of the merge queue, etc., if the files in the merge queue are merged at this time, the space capacity of the merge queue is not fully utilized, so that the improvement degree of the merge efficiency cannot reach the optimal value, in addition, the obtained merged file occupies too small space, and the effect of improving the waste of storage space when the merged file is subsequently uploaded and stored in the distributed file storage system cannot be optimal, so that the merging condition of the merged queue is controlled by using the second preset threshold value, that is, only when the total space capacity of the files in the merge queue is greater than the second preset threshold, the files in the merge queue are merged at this time.

Step S420: if not, updating the buffer queue.

In this embodiment, if the total space capacity of the files in the merge queue is less than or equal to the second preset threshold, the buffer queue is updated.

Illustratively, in response to a file storage request containing a file to be processed sent by a client, the file to be processed is stored in the buffer queue to update the buffer queue.

Illustratively, if the buffer queue is not updated within a preset time period, merging the files in the merging queue. For example, a file storage request including a file to be processed sent by a client is not received all the time, which indicates that the client does not have a need to store the file to the distributed file storage system, and therefore the buffer queue cannot be updated. By the method, the situation that the files in the merging queue cannot be merged due to the fact that the buffer queue is not updated for a long time is avoided, and therefore the merging efficiency is improved while the space capacity of the merging queue is fully utilized.

Step S430: and repeating the steps of obtaining a second target file from the buffer queue and storing the second target file in the merge queue until the total space capacity of the files in the merge queue is greater than a second preset threshold value.

In this embodiment, since the updated buffer queue includes more files with different sizes of occupied space capacities, there may be a file with a occupied space capacity closest to and smaller than the remaining space capacity of the merge queue, and therefore, in this embodiment, the second target file is obtained from the updated buffer queue, and the second target file is stored in the merge queue until the total space capacity of the files in the merge queue is greater than the second preset threshold.

Step S440: and merging the files in the merging queue.

In this embodiment, if the second target file cannot be acquired within the preset time period started after the update of the buffer queue, the second target file is stored in the merge queue so that the total space capacity of the files in the merge queue is greater than the second preset threshold, the files in the merge queue are merged to obtain the merged file. For example, after the buffer queue is updated, two second target files are obtained from the buffer queue and stored in the merge queue, but the total file capacity space of the merge queue including the two target files is still smaller than a second preset threshold, in this case, in order to take the merge efficiency into account, the files in the merge queue are merged at this time, which optimizes an index of fully utilizing the space capacity in the merge queue because the merge queue additionally incorporates the two second target files, and by this way, the files in the merge queue cannot be merged, thereby improving the merge efficiency while ensuring that the space capacity of the merge queue is fully utilized.

Referring to fig. 6 and 7, fig. 6 is a schematic diagram illustrating a result of a file merging method in the prior art, and fig. 7 is a schematic diagram illustrating a result of merging files by using a file merging method provided in the present application.

The core idea of the current popular file merging algorithm is to merge small files in the system in sequence, suspend merging once the storage threshold of the minimum unit of the distributed file storage system is exceeded, and pack the part which is not exceeded into a merged file; and merging the next file in sequence until all the files are merged. It is not necessary to set that there are 26 files to be merged with different occupation spaces, i.e., a-Z, and merge the files according to the above merging algorithm, and the merging result is shown in fig. 6, and finally merge the 26 files to be merged into 10 file sets. The database threshold values in fig. 6 and 7 refer to the storage threshold of the smallest unit of the distributed file storage system.

As shown in fig. 7, after 26 files to be merged are merged by using the file merging method provided in the embodiment of the present application, the number of the merged file sets is reduced from 10 to 8, and the volume sizes of the remaining file sets are substantially the same except for the last block of storage space, and almost the entire data block is occupied.

Referring to fig. 8, fig. 8 is a block diagram of a file merging device according to an exemplary embodiment of the present application, and as shown in fig. 8, the file merging device 30 includes a first obtaining module 31, a second obtaining module 32, a repeat executing module 33, and a merging module 34.

The first obtaining module 31 is configured to obtain a first target file with a largest occupied space capacity from the buffer queue, and store the first target file in the merge queue; the second obtaining module 32 is configured to obtain a second target file from the buffer queue, and store the second target file in the merge queue, where the second target file is a file whose occupied space capacity is closest to and smaller than the remaining space capacity of the merge queue; the repeated execution module 33 is configured to repeatedly execute the steps of obtaining the second target file from the buffer queue and storing the second target file in the merge queue until there is no file in the buffer queue whose occupied space capacity is closest to and smaller than the remaining space capacity of the merge queue; the merging module 34 is configured to merge the files in the merge queue to obtain a merged file.

In another exemplary embodiment, the file merging device 30 further includes a determining module and a storing module, where the determining module is configured to determine whether the space occupied by the file to be processed is smaller than a first preset threshold; and the storage module is used for storing the file to be processed in the buffer queue under the condition that the judgment is yes.

In another exemplary embodiment, the storage module comprises a first judging unit and a storage unit, wherein the first judging unit is used for judging whether the file to be processed and the file in the buffer queue belong to the same category; and the storage unit is used for storing the file to be processed in the buffer queue under the condition that the judgment result is yes.

In another exemplary embodiment, the merge module 34 includes a second determining unit, an updating unit, a repeating unit, and a merging unit, where the first determining unit is configured to determine whether the total space capacity of the files in the merge queue is greater than a second preset threshold; the updating unit is used for updating the buffer queue under the condition that the judgment is negative; the repeating unit is used for repeatedly executing the steps of obtaining a second target file from the buffer queue and storing the second target file in the merge queue until the total space capacity of the files in the merge queue is larger than a second preset threshold value; the merging unit is used for merging the files in the merging queue.

In another exemplary embodiment, the file merging device 30 further comprises a storage module for storing the merged file in the distributed file storage system.

In another exemplary embodiment, the file merging apparatus 30 further includes a clearing module for clearing the merge queue and deleting the source file of the file output after merging.

It should be noted that the apparatus provided in the foregoing embodiment and the method provided in the foregoing embodiment belong to the same concept, and the specific manner in which each module and unit execute operations has been described in detail in the method embodiment, and is not described again here.

In another exemplary embodiment, the present application provides an electronic device comprising a processor and a memory, wherein the memory has stored thereon computer readable instructions which, when executed by the processor, implement the file merging method as before.

It should be noted that the computer system 1000 of the electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 9, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes, such as performing the information recommendation method in the above-described embodiment, according to a program stored in a Read-Only Memory (ROM) 1002 or a program loaded from a storage portion 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for system operation are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An Input/Output (I/O) interface 1005 is also connected to the bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. When the computer program is executed by a Central Processing Unit (CPU)1001, various functions defined in the system of the present application are executed.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

Yet another aspect of the present application provides a computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor, implement the file merging method as in any one of the preceding embodiments.

Another aspect of the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the file merging method provided in the above embodiments.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The above description is only a preferred exemplary embodiment of the present application, and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make various changes and modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for merging files, comprising:

acquiring a first target file with the largest occupied space capacity from a buffer queue, and storing the first target file in a merge queue;

acquiring a second target file from the buffer queue, and storing the second target file in the merge queue, wherein the second target file is a file which occupies the space capacity closest to and is smaller than the remaining space capacity of the merge queue;

repeatedly executing the step of obtaining a second target file from the buffer queue and storing the second target file in the merge queue until no file which occupies the space capacity closest to and is smaller than the residual space capacity of the merge queue exists in the buffer queue;

and merging the files in the merging queue to obtain a merged file.

2. The method of claim 1, wherein before the obtaining the first target file with the largest occupied space capacity from the buffer queue and storing the first target file in the merge queue, the method further comprises:

judging whether the space capacity occupied by the file to be processed is smaller than a first preset threshold value or not;

if the file to be processed is judged to be yes, the file to be processed is stored in the buffer queue.

3. The method of claim 2, wherein depositing the pending file in the buffer queue comprises:

judging whether the file to be processed and the file in the buffer queue belong to the same category or not;

4. The method of claim 1, wherein merging the files in the merge queue to obtain a merged file comprises:

judging whether the total space capacity of the files in the merge queue is larger than a second preset threshold value or not;

if not, updating the buffer queue;

repeatedly executing the step of obtaining a second target file from the buffer queue and storing the second target file in the merge queue until the total space capacity of the files in the merge queue is greater than a second preset threshold;

and merging the files in the merging queue.

5. The method of claim 4, further comprising:

if the second target file cannot be acquired within a preset time period starting after the buffer queue is updated, storing the second target file in the merge queue so that the total space capacity of the files in the merge queue is greater than a second preset threshold value, merging the files in the merge queue to obtain a merged file.

6. The method of claim 1, wherein after said merging the files in the merge queue to obtain a merged file, the method further comprises:

and storing the merged file in a distributed file storage system.

7. The method of claim 1, wherein after said merging the files in the merge queue to obtain a merged file, the method further comprises:

and emptying the merging queue, and deleting the source file of the file which is merged and output.

8. A file merging apparatus, comprising:

the first acquisition module is used for acquiring a first target file with the largest occupied space capacity from the buffer queue and storing the first target file in the merge queue;

a second obtaining module, configured to obtain a second target file from the buffer queue, and store the second target file in the merge queue, where the second target file is a file that occupies a space with a capacity closest to and smaller than a remaining space capacity of the merge queue;

the repeated execution module is used for repeatedly executing the steps of obtaining a second target file from the buffer queue and storing the second target file in the merge queue until no file which occupies the space capacity closest to and is smaller than the residual space capacity of the merge queue exists in the buffer queue;

and the merging module is used for merging the files in the merging queue to obtain a merged file.

9. An electronic device, comprising:

a memory storing computer readable instructions;

a processor to read computer readable instructions stored by the memory to perform the method of any of claims 1-7.

10. A computer-readable storage medium having computer-readable instructions stored thereon, which, when executed by a processor of a computer, cause the computer to perform the method of any one of claims 1-7.