CN115481086A - Mass small file reading and writing method and system, electronic device and storage medium - Google Patents

Mass small file reading and writing method and system, electronic device and storage medium Download PDF

Info

Publication number
CN115481086A
CN115481086A CN202211035779.1A CN202211035779A CN115481086A CN 115481086 A CN115481086 A CN 115481086A CN 202211035779 A CN202211035779 A CN 202211035779A CN 115481086 A CN115481086 A CN 115481086A
Authority
CN
China
Prior art keywords
file
small
files
aggregation
caching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211035779.1A
Other languages
Chinese (zh)
Inventor
罗心
李冬伟
江文龙
周明伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202211035779.1A priority Critical patent/CN115481086A/en
Publication of CN115481086A publication Critical patent/CN115481086A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a system, electronic equipment and a storage medium for reading and writing a large number of small files, wherein the method comprises the following steps: acquiring a small file; determining a list of files to be merged based on the small files; caching the data information of the small files to an aggregation file according to the file list to be merged; the aggregation file is a cloud storage file which is created in a cloud storage system in a large file form; setting a key value pair based on the aggregation file, and caching the key value pair to a plug-in storage system; and reading and downloading the small file based on the key-value pair. According to the scheme, after the small files are obtained, the small files are stored in the form of the large files, the corresponding key value pairs are set in the external storage system, the small files are read and written according to the key value pairs, the fast reading and writing of the mass small files can be achieved, and the reading and writing efficiency of the mass small files is improved.

Description

Mass small file reading and writing method and system, electronic device and storage medium
Technical Field
The present application relates to the field of distributed storage technologies, and in particular, to a method, a system, an electronic device, and a storage medium for reading a large number of small files.
Background
With the explosive increase of data volume, the storage requirement is more and more emphasized, and the storage mode in the prior art mainly aims at the storage of large files.
In the research and practice processes of the prior art, the inventor of the application finds that in the storage of massive small files, the distributed storage in the prior art can not effectively distinguish the large files from the small files, the adaptability is low, the utilization rate of a disk space is low, and the efficiency of the read-write performance of the massive small files is low.
Disclosure of Invention
The technical problem mainly solved by the application is to provide a method, a system, electronic equipment and a storage medium for reading massive small files, wherein the small files can be stored in a large file mode, and then are positioned and read through key value pairs, so that the massive small files can be read and written, and the efficiency of the small file reading and writing performance is improved.
In order to solve the technical problem, the application adopts a technical scheme that: a method for reading a large number of small files is provided, and the method comprises the following steps: acquiring a small file; determining a list of files to be merged based on the small files; caching the data information of the small files to an aggregation file according to the file list to be merged; the aggregation file is a cloud storage file which is created in a cloud storage system in a large file form; setting a key value pair based on the aggregation file, and caching the key value pair to a plug-in storage system; and reading and downloading the small file based on the key-value pair.
In an embodiment of the application, the caching the data information of the small file to an aggregation file according to the list of files to be merged includes: retrieving the list of the files to be merged, and caching the retrieved small files into a first cache container; creating an aggregated file in the cloud storage system in a large file form; and caching the data information of the small files in the first cache container to the aggregation file.
In an embodiment of the present application, the caching the data information of the small file in the first cache container to an aggregate file includes: if the data information of the small file in the first cache container is successfully written into the aggregation file, caching the small file name corresponding to the small file in a second cache container; and if the data information of the small file in the first cache container is not written into the aggregation file, caching the small file name corresponding to the small file in a third cache container.
In an embodiment of the present application, after completing caching of the aggregated file, the method further includes: and deleting the file list to be merged and the corresponding metadata information according to the cache information of the second cache container.
In an embodiment of the present application, after caching the data information of the small file in the first cache container to an aggregate file, the method further includes: when the size of the aggregation file reaches a preset threshold value, closing the aggregation file; and establishing an aggregation file in the cloud storage system in a large file form, and caching the data information of the rest small files by using the established aggregation file.
In an embodiment of the application, the setting a key-value pair based on the aggregated file, and caching the key-value pair to the plug-in storage system includes: setting a storage directory of the aggregated file as a Key Value, and setting index information of the small files in the aggregated file as a Value; and caching the Key Value and the Value into a plug-in storage system.
In an embodiment of the application, the reading and downloading the small file based on the key-value pair includes: responding to a user retrieval request, retrieving a database to obtain a corresponding database file list and retrieving a corresponding small file list of the plug-in storage system according to the Key value; integrating the database file list and the small file list, removing the aggregate file, and returning to the user; and reading the data information of the small file from the cloud storage system by the user based on the Value.
In an embodiment of the present application, when a cached small file needs to be deleted, the method further includes: acquiring the proportion of the deleted file in the aggregated file; and if the ratio reaches a ratio threshold, regenerating a new aggregation file for the small files which are not deleted in the aggregation file, and deleting the original aggregation file.
In an embodiment of the application, if the ratio reaches a ratio threshold, regenerating a new aggregate file for the small files that are not deleted in the aggregate file, and deleting the original aggregate file includes: acquiring the percentage of deleted file accumulation; if the proportion reaches the proportion threshold, retrieving the plug-in storage system to obtain small files which are not deleted in the original aggregation files, and generating aggregation subfiles according to the small files; and deleting the original aggregation file.
In an embodiment of the application, the obtaining the small file includes: caching the written file to a cloud storage system in a large file form; acquiring the size of the written file; and if the written file is smaller than a preset file size threshold value, recording as a small file.
In an embodiment of the present application, the determining a list of files to be merged based on the small files includes: acquiring a small file name corresponding to the small file; and determining a file list to be merged based on the small file name.
In order to solve the above technical problem, another technical solution of the present application is: providing a mass small file reading and writing system, wherein the system comprises: the acquisition module is used for acquiring the small files; the determining module is used for determining a file list to be merged based on the small files; the first cache module is used for caching the data information of the small files into an aggregation file according to the file list to be merged; the aggregation file is a cloud storage file which is created in a cloud storage system in a large file form; the second caching module is used for setting a key value pair based on the aggregation file and caching the key value pair to the plug-in storage system; and the reading module reads and downloads the small file based on the key value pair.
In order to solve the above technical problem, another technical solution of the present application is: there is provided an electronic device including: the storage is stored with at least one computer program, and the at least one computer program is loaded by the processor and is executed to realize the method for reading and writing the mass small files.
In order to solve the above technical problem, another technical solution of the present application is: there is provided a computer-readable storage medium, wherein the computer-readable storage medium stores at least one program, and when the at least one program is loaded and executed by a processor, the at least one program is used for implementing the method for reading and writing the mass small files
Different from the prior art, the method for reading and writing the mass small files provided by the application comprises the following steps: acquiring a small file; determining a list of files to be merged based on the small files; caching the data information of the small files into an aggregation file according to the file list to be merged; the aggregation file is a cloud storage file which is created in a cloud storage system in a large file form; setting a key value pair based on the aggregation file, and caching the key value pair to a plug-in storage system; and reading and downloading the small file based on the key-value pair. According to the method and the device, the small files are automatically acquired, the file list to be merged is determined, the data information of the small files is stored in the cloud storage system in a large file form according to the file list to be merged, the key value pairs are stored in the plug-in storage according to the file storing in the large file form, the key value pairs are stored in the plug-in storage, the small files are read and downloaded through the key value pairs, the fast positioning and reading and writing of the large quantity of small files are achieved, and the efficiency of the reading and writing performance of the small files is improved.
Drawings
FIG. 1 is a schematic flow chart diagram of an embodiment of a method for reading and writing a large number of small files according to the present invention;
FIG. 2 is a schematic flow chart of one embodiment of step S1 of the present invention;
FIG. 3 is a flowchart illustrating an embodiment of step S2 of the present invention;
FIG. 4 is a flowchart illustrating an embodiment of step S3 of the present invention;
FIG. 5 is a flowchart illustrating an embodiment of step S33;
FIG. 6 is a flowchart illustrating an embodiment of the present invention after step S3;
FIG. 7 is a flowchart illustrating an embodiment of step S4 of the present invention;
FIG. 8 is a flowchart illustrating an embodiment of step S5 of the present invention;
FIG. 9 is a flowchart illustrating an embodiment of deleting cached doclets;
FIG. 10 is a schematic flow chart of an embodiment of step A2 of the present invention;
FIG. 11 is a schematic structural diagram of an embodiment of a system for reading and writing a large number of small files according to the present invention;
FIG. 12 is a schematic structural diagram of an embodiment of an electronic device according to the invention;
FIG. 13 is a schematic structural diagram of an embodiment of a computer-readable storage medium of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be noted that the following examples are only illustrative of the present invention, and do not limit the scope of the present invention. Likewise, the following examples are only some examples, not all examples, and all other examples obtained by those skilled in the art without any inventive work are within the scope of the present invention.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The traditional file storage method, especially for the storage of massive small files, aims at a data classification storage scene, adopts a copy method to carry out fault tolerance so that the utilization rate of a disk space is low, and has no universal characteristic; the storage structure is too long, the cloud data volume is large, and how to distinguish the large and small files cannot be determined; or the client needs to generate and send the related information, which results in poor user experience. Therefore, the reading and writing effect of the copy, redis or layered storage mode, especially for the massive small files of the Fuse-based distributed cloud storage system, is not ideal, so that the efficiency of reading and writing the massive small files is low.
In research, the applicant finds that, for the condition of low efficiency of the read-write performance of a large amount of small files in the prior art, when small files are read-written, especially when the large amount of small files in data burst growth are read-written, the large files can be automatically distinguished, the small files with similar life cycles are written into the large files for storage in an asynchronous additional write mode, meanwhile, file index information is contained in file names, and an additional KV system is provided for bearing cloud data of the small files, so that the large amount of small files are efficiently read-written, the read-write performance of the large amount of small files in a distributed storage system can be effectively improved, the read-write efficiency is improved, and the problem of low efficiency of the read-write performance of the large amount of small files managed by a Fuse-based distributed cloud storage system is particularly solved.
Therefore, a method for reading and writing the mass small files is provided, which is applied to a Fuse-based distributed cloud storage system and is used for obtaining the small files; determining a list of files to be merged based on the small files; caching the data information of the small files to the aggregation file according to the file list to be merged; the aggregation file is a cloud storage file created in a cloud storage system in a large file form; setting a key value pair based on the aggregation file, and caching the key value pair to the plug-in storage system; the small file is read and downloaded based on the key-value pair.
The distributed cloud storage system based on the Fuse comprises a Client (Client), a user space file system (Fuse), a distributed cloud storage system, a DataBase (DataBase) and a distributed KV storage system (KV); the user side sends a file storage request to a user space file system and writes a file, the user space file system receives and stores file data and stores the written file in a distributed cloud storage system in a large file form; when the file is closed, the size of the file is acquired, the metadata of the corresponding small file is recorded in a database, and a file list to be merged is maintained in the database and is used for recording the file name of the small file; searching a list of files to be merged in a database at regular time to obtain small files to be merged, asynchronously writing the small files to be merged into an aggregation file created in a large file form, and recording cloud data of the aggregation file in the database; after the small files are successfully written into the aggregation file, the storage directory of the aggregation file is used as a Key Value, the initial offset of the small files and the size of the small files in the aggregation file are spliced into a character string and the character string is used as a corresponding Value to be recorded in the distributed KV storage system, then the cloud data of the small files are managed through the distributed KV storage system, and further operations such as file retrieval, file downloading or file deletion are achieved through the small file list corresponding to the small files in the distributed KV storage system and the small file list corresponding to the small files in the database.
Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a method for reading and writing a large number of small files according to the present invention; it should be noted that, if the result is substantially the same, the method of the present invention is not limited to the flow sequence shown in fig. 1, and as shown in fig. 1, the method includes the following steps:
s1, acquiring a small file;
wherein, a small file generally refers to a file with a size below 1MB, and a large amount of files with a size above millions can be called; most of these files are unstructured data, including videos, pictures, documents, and the like, and exhibit the order of hundreds of billions or even hundreds of billions under the rapid development of science and technology.
Referring to fig. 2, fig. 2 is a schematic flow chart of an embodiment of step S1 of the present invention, and step S1 includes:
s11, caching the written file to a cloud storage system in a large file form;
the write file may include a large file and a small file, where the write file does not have a different size, and therefore all files that need to be stored are collectively referred to as write files.
Specifically, all files are treated as large files, and then various files needing to be stored are cached in the cloud storage system in a large file mode.
In some embodiments, when the size of a file cannot be determined during writing, all files needing to be stored are uniformly regarded as large files, and then the large files are cached in a cloud storage system; for example, these large files are stored in a Fuse framework-based distributed cloud storage system, with Fuse being the user space file system.
In some embodiments, after the file is stored in the cloud storage system, the cloud data of the file is recorded in a DataBase (DataBase), and the metadata comprises: the packet name, directory name, file creation time, file size, and the like.
S12, acquiring the size of a written file;
wherein the size of the written file cannot be obtained before writing.
Specifically, after all files are stored in the cloud storage system in the form of large files, the size of the write file is acquired when the file is closed.
In some embodiments, for example, based on the Fuse framework in conjunction with the distributed cloud storage system, the size of the write file cannot be determined before and during writing, so the write file size is set to be obtained when the file is closed.
In some embodiments, the size of the write file may be obtained after the file is written or at the time of the file write.
And S13, if the written file is smaller than a preset file size threshold value, recording as a small file.
The preset file size threshold is used for distinguishing the size files and can be set according to actual conditions.
Specifically, a file size threshold is preset, after the size of the write-in file is obtained, the file size threshold is compared with the preset file size threshold, and the write-in file smaller than the preset file size threshold is recorded as a small file.
In some embodiments, the write file may be divided into a large file, the large file is a file with a file size of 1MB or more, and the file format is not limited; the small files are files smaller than 1MB, can be unstructured data, that is, data which is not structured according to a predefined data model or organized according to a predefined mode, can be human-generated or machine-generated, and have internal structures such as documents, books, images, audio, video, files, e-mail messages, web pages and the like; it may also be structured data, i.e. data structured according to a predefined model or organized in a predefined manner, typically stored in a relational database management system.
S2, determining a file list to be merged based on the small files;
the file list to be merged is a data table maintained by a DataBase (DataBase) and is used for recording relevant data of small files.
Referring to fig. 3, fig. 3 is a schematic flow chart of an embodiment of step S2 of the present invention, and step S2 includes:
s21, acquiring a small file name corresponding to the small file;
each file is provided with a corresponding file name, the large file is provided with a corresponding name of the large file, and the small file is provided with a corresponding name of the small file.
Specifically, after distinguishing the small files and the large files, extracting the relevant information of the small files, and further acquiring the small file names corresponding to the small files.
And S22, determining a file list to be merged based on the small file name.
The file list to be merged may be used to record related data of the small files, and here, is used to record small file names corresponding to the small files.
Specifically, after the relevant information of the small files is extracted, the small file names corresponding to the small files are obtained, and the small file names are recorded in a file list to be merged to determine the data size of the file list to be merged, so that the file list to be merged is determined.
In some embodiments, the list of files to be merged may also be determined by other information that can represent the small files, such as file numbers, file extension names, and the like; for example, after the relevant information of the small files is extracted, the file numbers corresponding to the small files are obtained, and the file numbers corresponding to the small files are recorded in the file list to be merged to determine the data size of the file list to be merged, so as to determine the file list to be merged.
S3, caching the data information of the small files into the aggregation file according to the file list to be merged;
the aggregation file is a cloud storage file created in a cloud storage system in a large file form and used for storing data information of a corresponding file in a file list to be merged.
Referring to fig. 4, fig. 4 is a schematic flowchart illustrating an embodiment of step S3 of the present invention, where step S3 includes:
s31, retrieving a list of files to be merged, and caching the retrieved small files into a first cache container;
the first cache container is a local cache container for caching the files to be aggregated, namely the first container is used for caching the small files to be aggregated.
Specifically, a list of files to be merged maintained in a DataBase (DataBase) is retrieved at regular time, all current small files to be merged are obtained according to related contents recorded in the list of files to be merged, and all current small files to be merged are stored in a local cache container of files to be merged, that is, a first cache container.
In some embodiments, it may be desirable to set a timing task by which to time check a list of files to be merged maintained in a DataBase (DataBase).
S32, creating an aggregation file in a cloud storage system in a large file form;
the large file form is a large file form, and does not refer to a certain file type.
Specifically, in the cloud storage system, a new cloud storage file is created in the form of a large file, named as an aggregate file.
In some embodiments, the large file format does not define an upper limit on the size of the file, but does not exceed the size of the DataBase (DataBase), i.e., the aggregate file is a file that is greater than 1MB and less than the size of the DataBase.
And S33, caching the data information of the small files in the first cache container to the aggregation file.
The data information of the small file in the first cache container is the data information of the small file in the file container to be merged in the DataBase (DataBase).
Specifically, according to information which can represent the small files and is cached in the file container to be merged, cloud data information of the files to be merged is firstly acquired from a cloud storage system and is kept to the local, namely is stored in a DataBase (DataBase), then relevant data of the small files are read and written into the aggregation file, and meanwhile, information describing the aggregation file such as current offset and total length of the aggregation file is locally recorded and updated.
In some embodiments, cloud data information of the to-be-merged file may be obtained from the cloud storage system and maintained locally, that is, saved in a DataBase (DataBase), according to a small file name or a file number corresponding to the cached small file in the to-be-merged file container or information capable of representing the small file, then corresponding small file data content is read in the cloud storage system, the small file data content is asynchronously and additionally written in an aggregate file of the cloud storage system, and information such as a current offset and a total length of the aggregate file is recorded and updated in the DataBase (DataBase).
Referring to fig. 5, fig. 5 is a schematic flowchart illustrating an embodiment of step S33 according to the present invention, where step S33 includes:
s331, if the data information of the small file in the first cache container is successfully written into the aggregation file, caching the small file name corresponding to the small file in a second cache container;
the second cache container is a local cache container for caching the aggregation-completed file, that is, the second cache container is used for caching the small aggregated file.
Specifically, the data information of the small files in the local cache container for caching the files to be aggregated is successfully written into the aggregated file of the cloud storage system, and then the aggregated information which can represent the small files is cached in the local cache container for caching the aggregated files.
In some embodiments, in the local cache container for caching the aggregation completion file, the information that can represent the small file after the aggregation completion can be a small file name or a small file number, which can represent the content of the small file.
And S332, if the data information of the small file in the first cache container is not written into the aggregation file, caching the small file name corresponding to the small file in a third cache container.
The third cache container is a local cache container for caching the aggregation failure file, that is, the third cache container is used for caching the aggregation failure small file.
Specifically, if the data information of the small files in the local cache container for caching the files to be aggregated is not successfully written into the aggregation file of the cloud storage system, the information which is used for caching the aggregation failure file and can represent the small files is cached in the local storage container for caching the aggregation failure file.
In some embodiments, in the local cache container for caching the aggregation failure file, the information that the aggregation failure can represent the small file may be a small file name or a small file number, which can represent the content of the small file.
In some embodiments, after the caching of the aggregation file is completed, the method further includes: and deleting the file list to be merged and the corresponding metadata information thereof according to the cache information of the second cache container, and vacating corresponding space.
Referring to fig. 6, fig. 6 is a schematic flow chart of a later embodiment of step S33 of the present invention, which includes:
s34, when the size of the aggregation file reaches a preset threshold value, closing the aggregation file;
the preset threshold is used for limiting the size of the aggregate file, that is, after the size of the aggregate file reaches the preset threshold, the aggregate file is not used for storing information.
Specifically, a preset threshold is set for the aggregate file, and when the size of the aggregate file reaches the preset threshold, the aggregate file is closed, and new information is not stored in the aggregate file.
In some embodiments, if the aggregate file has a part of remaining space, and the size of the file to be stored is larger than the remaining space of the aggregate file, which also indicates that the size of the aggregate file reaches a preset threshold, the aggregate file may be closed.
And S35, creating an aggregation file in a cloud storage system in a large file form, and caching data information of the residual small files by using the created aggregation file.
The newly-built aggregation file is a cloud storage file which is newly built in a cloud storage system in a large file form.
Specifically, after the original aggregated file reaches a preset threshold value and is closed, a new cloud storage file is built in the cloud storage system in the form of a large file to serve as a new aggregated file, and data information of the remaining small files is cached through the new aggregated file.
In some embodiments, there may be one or more new aggregation files, and the data information storage requirement of the small files successfully aggregated is met.
S4, setting key value pairs based on the aggregation file, and caching the key value pairs to a plug-in storage system;
the key value pairs are corresponding key values, and corresponding key value contents can be obtained through one of the key values; the plug-in storage system can be an independent storage subsystem and is connected with the cloud storage system and the database.
In some embodiments, cloud data for a doclet may be managed in a hanging storage system.
Referring to fig. 7, fig. 7 is a schematic flowchart of an embodiment of step S4 of the present invention, where step S4 includes:
s41, setting a storage directory of the aggregated file as a Key Value, and setting index information of the small files in the aggregated file as a Value;
after the aggregation file stores the small files which are successfully aggregated, a storage directory of the aggregation file is generated according to the information which can represent the small files, and corresponding small file index information is generated according to the small files which are successfully aggregated.
Specifically, after all the small files successfully aggregated are stored in the aggregated file, acquiring information representing the small files to generate a storage directory of the aggregated file, and taking the storage directory as a Key value; and generating small file index information as Value values according to the small files successfully aggregated.
In some embodiments, the storage directory of the aggregate file may be a small file name or a small file number capable of representing a small file or other information capable of representing a small file; the small file index information may be a character string formed by splicing the initial offset and the size of the small file in the aggregated file between the small files, or a character string formed by splicing the number and the size of the small file, or a character string formed by splicing the initial position and the size of the small file, and the like, which can represent the related information of the small file index.
And S42, caching the Key Value and the Value into the plug-in storage system.
Specifically, the Key Value and the Value are recorded in the external storage system.
In some embodiments, the plug-in storage system may be a single distributed KV storage system or other storage modules that can implement pure storage functions.
And S5, reading and downloading the small file based on the key value pair.
The content Value actually stored in the data can be obtained through the unique Key Value of each data address of the Key Value pair, so that the small file can be read and downloaded through the Key Value pair.
Referring to fig. 8, fig. 8 is a schematic flowchart of an embodiment of step S5 of the present invention, where step S5 includes:
s51, responding to a user retrieval request, retrieving a database to obtain a corresponding database file list and retrieving a corresponding small file list of the plug-in storage system according to a Key value;
the DataBase file list is a list of files stored in a DataBase (DataBase), and the small file list is a list of corresponding small files in the plug-in storage system.
Specifically, when file retrieval is needed, according to retrieval conditions, responding to a retrieval request of a user to retrieve a DataBase (DataBase), and acquiring and caching a DataBase file list in a specified directory from the DataBase (DataBase); and then, retrieving the plug-in storage system according to the Key value representing the aggregated file storage directory to acquire the small file list under the same directory.
In some embodiments, the list of files under the specified directory may contain a common large file or an aggregate file in the form of a large file or both a common large file and an aggregate file in the form of a large file obtained from a DataBase (DataBase).
In some embodiments, the plug-in storage system may be retrieved in a fuzzy matching manner, or in an accurate matching manner, if necessary.
S52, integrating the database file list and the small file list, removing the aggregation file, and returning to the user;
the database file list comprises common large files and aggregation files, and the small file list comprises small files stored in the aggregation files, so that after the database file list and the small file list are integrated, the aggregation files are repeated and need to be removed, and the file list without repeated data is returned to a user.
Specifically, a file list under a specified directory is acquired from a DataBase (DataBase) and a small file list under the same directory is acquired from the external storage system for integration, and after repeated aggregate files are deleted, the file list without repeated data is returned to the user.
And S53, reading the data information of the small file from the cloud storage system by the user based on the Value.
The Value contains index information of the small file in the aggregated file, so that the small file name can be analyzed through the index information to access the cloud storage system and further read related data.
Specifically, the small file name is analyzed based on the aggregated file index information corresponding to the Value, and the cloud storage system is accessed according to the small file name to read and download the data information corresponding to the small file name, so that the small file is quickly positioned, and the small file reading performance is improved.
In some embodiments, the small file name may be a Value recorded by the add-on system, which includes index information of the small file in the aggregate file, such as start offset, file size, and other information related to the small file.
Referring to fig. 9, fig. 9 is a schematic flow chart of the invention when a small cache file needs to be deleted, including:
a1, acquiring the proportion of a deleted file in an aggregated file;
and deleting the small files mainly by deleting the corresponding small file metadata in the plug-in storage system.
Specifically, the method comprises the following steps: and when the client applies for deleting the small files, only deleting the metadata records of the small files in the plug-in storage system. Meanwhile, maintaining the detailed information of the aggregation file to which the small file is applied to be deleted; for example, an aggregation file to which a small file requested to be deleted belongs is taken as an aggregation file 1, and a small file requested to be deleted is taken as a small file 1;
recording or updating the ratio of invalid contents of the aggregation files, and if the small file requested to be deleted is the first deleted small file in the aggregation files, calculating and recording the ratio of the size of the small file to the size of the aggregation files; if the small file which is applied for deletion is not the first deleted small file in the aggregated file, accumulating the calculated ratio and the ratio recorded before, and then storing the record; for example: if the small file 1 is the first deleted small file in the aggregate file 1, calculating the ratio of the sizes of the small file 1 and the aggregate file 1 and recording the ratio, and if the small file 1 is not the first deleted small file in the aggregate file 1, accumulating the ratio of the size of the small file 1 to the size of the aggregate file 1 and the ratio recorded before and then storing the record.
In some embodiments, only the small file metadata records in the distributed KV system are deleted when a small file deletion is requested.
And A2, if the proportion reaches the proportion threshold value, regenerating a new aggregation file for the small files which are not deleted in the aggregation file, and deleting the original aggregation file.
Wherein, the proportion is the proportion of the small files in the aggregated file, and here is the proportion of the small files which are applied for deletion in the aggregated file; the ratio threshold is used for managing the size of the aggregated file, that is, if the aggregated file has more deleted files, the size of the original aggregated file needs to be deleted, so that a corresponding space is vacated, and the serious problem of space fragments is reduced.
Specifically, judging whether the proportion of deleted files in the aggregation file 1 reaches a proportion threshold value, if so, detecting the plug-in storage system to obtain all current undeleted small files in the aggregation file 1, analyzing corresponding small file names to obtain the initial offset and the length of the undeleted small files in the aggregation file 1, and storing detailed information of the small files in a DataBase (DataBase) according to the size sequence of the initial offset of the small files; and regenerating the aggregation file 2 according to the small files which are not deleted, transferring the detailed information of the small files which are not deleted into the aggregation file 2, and deleting the original aggregation file.
Referring to fig. 10, fig. 10 is a schematic flow chart of an embodiment of step A2 of the present invention, where step A2 includes:
a21, acquiring the percentage of the deleted file accumulation;
the deleted file is a small deleted file in the aggregated file.
Specifically, the size of the deleted small file in the aggregated file is obtained, and the ratio of the size of the deleted small file in the aggregated file is calculated.
In some embodiments, deleting a file may be deleting one small file or deleting a plurality of small files, and thus is a cumulative percentage of files deleted.
A22, if the proportion reaches the proportion threshold, retrieving the plug-in storage system to obtain small files which are not deleted in the original aggregation files, and generating an aggregation subfile;
specifically, if the proportion of the deleted files in the aggregate file 1 reaches the proportion threshold, in order to vacate space, the plug-in storage system is retrieved to obtain all the small files which are not deleted currently in the aggregate file 1, the corresponding small file names are analyzed to obtain the actual offset and the file length corresponding to the aggregate file, and the detailed information of the small files is stored in a DataBase (DataBase) according to the size sequence of the initial offset of the small files; and then generating a corresponding aggregation file 2 according to the small files which are not deleted, and transferring the small file data to the aggregation file 2 according to the detailed information of the small files.
And A23, deleting the original aggregation file.
And the deleted original aggregation file is the aggregation file subjected to data transfer.
Specifically, there is data superposition between the aggregation file 1 subjected to data transfer and the newly generated aggregation file 2, so that the aggregation file 1 subjected to data transfer is deleted, and further the space fragment is released, and a corresponding space is vacated.
In some embodiments, each time the re-aggregation of a small file in one aggregated file 1 is completed, the cloud data information of the small file in the external storage system is modified, that is: the Key Value is modified into a storage directory of the aggregation file 2, and the corresponding Value is modified into index information of the small file in the aggregation file 2.
In some embodiments, the small file index information corresponding to the modified Value may be a character string formed by splicing the start offset and the size of the small file in the aggregated file between the small files, or a character string formed by splicing the number and the size of the small file, or a character string formed by splicing the start position and the size of the small file, or the like, which may represent the related information of the small file index.
Different from the prior art, in the embodiment, the small files are obtained; determining a list of files to be merged based on the small files; caching the data information of the small files to the aggregation file according to the file list to be merged; the aggregation file is a cloud storage file which is created in a cloud storage system in a large file form; setting a key value pair based on the aggregation file, and caching the key value pair to the plug-in storage system; the small file is read and downloaded based on the key-value pair. According to the method and the device, after the small files are obtained, the small files are stored in the form of the large files, the corresponding key value pairs are set in the external storage system, the small files are read and written according to the key value pairs, the fast reading and writing of the large amount of small files can be achieved, and the reading and writing efficiency of the large amount of small files is improved.
Referring to fig. 11, fig. 11 is a schematic structural diagram of an embodiment of the system for reading and writing the mass small files according to the present invention, where the system can execute the steps of the method for reading and writing the mass small files, and related contents refer to detailed descriptions in the method, which are not described herein again.
The mass small file reading and writing system 200 includes: the system comprises an acquisition module 210, a determination module 220, a first cache module 230, a second cache module 240, and a reading module 250. The obtaining module 210 is configured to obtain a small file; the determining module 220 determines a list of files to be merged based on the small files; the first caching module 230 is configured to cache the data information of the small files into the aggregate file according to the list of files to be merged; the aggregation file is a cloud storage file which is created in a cloud storage system in a large file form; the second caching module 240 sets a key-value pair based on the aggregated file and caches the key-value pair to the plug-in storage system; the reading module 250 reads and downloads the doclet based on the key-value pair.
Referring to fig. 12, fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the invention. The electronic equipment can execute the steps in the method for reading and writing the mass small files. The electronic device 300 includes: the memory 310 and the processor 320 coupled to the memory, wherein the memory 310 stores at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the method for reading and writing the mass small files.
Referring to fig. 13, fig. 13 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the invention. The computer-readable storage medium 400 stores at least one program 410, and the at least one program 410 is loaded and executed by a processor to implement the method for reading and writing the mass small files.
According to the scheme, the small files are aggregated in the form of the large files, so that the automatic distinguishing of the large files and the storage of the small files are realized, and the small files are adapted to a cloud storage system emphasizing the storage of the large files in the prior art; according to the corresponding key value pair setting, the cloud data volume is greatly reduced, the database pressure is reduced, the method can be well applied to the magnitude storage requirements of hundreds of billions or even hundreds of billions, the corresponding business storage requirements are met, and the read-write performance of mass small files is improved.
In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (14)

1. A method for reading and writing a large number of small files is characterized in that the method is applied to a Fuse-based distributed cloud storage system, wherein the Fuse-based distributed cloud storage system at least comprises a distributed cloud storage system and a plug-in storage system; the method comprises the following steps:
acquiring a small file;
determining a list of files to be merged based on the small files;
caching the data information of the small files to an aggregation file according to the file list to be merged; the aggregation file is a cloud storage file which is created in a cloud storage system in a large file form;
setting a key value pair based on the aggregation file, and caching the key value pair to a plug-in storage system;
and reading and downloading the small file based on the key-value pair.
2. The method of claim 1,
the caching the data information of the small files to the aggregation file according to the file list to be merged comprises the following steps:
retrieving the list of the files to be merged, and caching the retrieved small files into a first cache container;
creating an aggregated file in the cloud storage system in a large file form;
and caching the data information of the small files in the first cache container to the aggregation file.
3. The method of claim 2,
the caching the data information of the small files in the first cache container to an aggregate file includes:
if the data information of the small file in the first cache container is successfully written into the aggregation file, caching the small file name corresponding to the small file in a second cache container;
and if the data information of the small file in the first cache container is not written into the aggregation file, caching the small file name corresponding to the small file in a third cache container.
4. The method of claim 3,
after the caching of the aggregation file is completed, the method further comprises the following steps: and deleting the file list to be merged and the corresponding metadata information according to the cache information of the second cache container.
5. The method of claim 2,
after caching the data information of the small files in the first cache container to an aggregation file, the method further comprises the following steps:
when the size of the aggregation file reaches a preset threshold value, closing the aggregation file;
and establishing an aggregation file in the cloud storage system in a large file form, and caching the data information of the rest small files by using the established aggregation file.
6. The method of claim 1,
the setting of the key-value pair based on the aggregation file and the caching of the key-value pair to the plug-in storage system comprise the following steps:
setting a storage directory of the aggregated file as a Key Value, and setting index information of the small file in the aggregated file as a Value;
and caching the Key Value and the Value into a plug-in storage system.
7. The method of claim 6,
the reading and downloading of the small file based on the key-value pair includes:
responding to a user retrieval request, retrieving a database to obtain a corresponding database file list and retrieving a corresponding small file list of the plug-in storage system according to the Key value;
integrating the database file list and the small file list, removing the aggregate file, and returning to the user;
and reading the data information of the small file from the cloud storage system by the user based on the Value.
8. The method of claim 1,
when the cached small file needs to be deleted, the method further comprises the following steps:
acquiring the proportion of the deleted file in the aggregated file;
and if the ratio reaches a ratio threshold, regenerating a new aggregation file for the small files which are not deleted in the aggregation file, and deleting the original aggregation file.
9. The method of claim 8,
if the ratio reaches the ratio threshold, regenerating a new aggregation file for the small files which are not deleted in the aggregation file, and deleting the original aggregation file, wherein the method comprises the following steps:
acquiring the percentage of the deleted file accumulation;
if the proportion reaches the proportion threshold, retrieving the plug-in storage system to obtain small files which are not deleted in the original aggregation files, and generating aggregation subfiles according to the small files;
and deleting the original aggregation file.
10. The method of claim 1,
the acquiring of the small file comprises the following steps:
caching the written file to a cloud storage system in a large file form;
acquiring the size of the written file;
and if the written file is smaller than a preset file size threshold value, recording as a small file.
11. The method of claim 1,
the determining a list of files to be merged based on the small files comprises:
acquiring a small file name corresponding to the small file;
and determining a file list to be merged based on the small file name.
12. A system for reading and writing a mass of small files, the system comprising:
the acquisition module is used for acquiring the small files;
the determining module is used for determining a file list to be merged based on the small files;
the first caching module is used for caching the data information of the small files into an aggregation file according to the file list to be merged; the aggregation file is a cloud storage file which is created in a cloud storage system in a large file form;
the second caching module is used for setting a key value pair based on the aggregation file and caching the key value pair to the plug-in storage system;
and the reading module reads and downloads the small file based on the key value pair.
13. An electronic device, characterized in that the electronic device comprises:
a memory and a processor coupled to the memory, the memory storing at least one computer program that, when loaded and executed by the processor, is adapted to carry out the method of any of claims 1-11.
14. A computer-readable storage medium, characterized in that it stores at least one program which, when loaded and executed by a processor, is adapted to carry out the method of any one of claims 1-11.
CN202211035779.1A 2022-08-26 2022-08-26 Mass small file reading and writing method and system, electronic device and storage medium Pending CN115481086A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211035779.1A CN115481086A (en) 2022-08-26 2022-08-26 Mass small file reading and writing method and system, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211035779.1A CN115481086A (en) 2022-08-26 2022-08-26 Mass small file reading and writing method and system, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN115481086A true CN115481086A (en) 2022-12-16

Family

ID=84421972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211035779.1A Pending CN115481086A (en) 2022-08-26 2022-08-26 Mass small file reading and writing method and system, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN115481086A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493284A (en) * 2023-10-30 2024-02-02 安徽鼎甲计算机科技有限公司 File storage method, file reading method, file storage and reading system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493284A (en) * 2023-10-30 2024-02-02 安徽鼎甲计算机科技有限公司 File storage method, file reading method, file storage and reading system

Similar Documents

Publication Publication Date Title
US8650164B2 (en) Efficient storage and retrieval for large number of data objects
US8683228B2 (en) System and method for WORM data storage
US8843454B2 (en) Elimination of duplicate objects in storage clusters
US8560786B2 (en) Efficient use of memory and accessing of stored records
CN107911461B (en) Object processing method in cloud storage system, storage server and cloud storage system
US7577808B1 (en) Efficient backup data retrieval
US8095678B2 (en) Data processing
US20100106696A1 (en) File management method
CN109710185A (en) Data processing method and device
CN111104377B (en) File management method, electronic device and computer readable storage medium
CN112714359A (en) Video recommendation method and device, computer equipment and storage medium
CN112416880A (en) Method and device for optimizing storage performance of mass small files based on real-time merging
CN115481086A (en) Mass small file reading and writing method and system, electronic device and storage medium
CN113448946B (en) Data migration method and device and electronic equipment
CN102346783A (en) Data retrieval method and device
CN109710194A (en) The storage method and device of upper transmitting file
CN106371770B (en) Method for writing data and device
CN109521957A (en) A kind of data processing method and device
US8886656B2 (en) Data processing
CN115576956B (en) Data processing method, system, equipment and storage medium
CN112181918B (en) Quick pre-allocation method for video file of camera for embedded system
CN117540176B (en) Data recovery analysis method and system based on solid state disk
US8290993B2 (en) Data processing
CN113220211A (en) Data storage system, data access method and related device
CN114461572A (en) Metadata collection method and device for distributed file system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination