CN109284273B

CN109284273B - Massive small file query method and system adopting suffix array index

Info

Publication number: CN109284273B
Application number: CN201811133108.2A
Authority: CN
Inventors: 赵鑫; 孙茜; 农革
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2021-09-21
Anticipated expiration: 2038-09-27
Also published as: CN109284273A

Abstract

The invention discloses a massive small file query method adopting suffix array index. The invention improves the space utilization rate by combining the small files and storing the small files on the distributed file system, establishes suffix array index for each small file to record the storage information and the attribute information of the small file, provides an effective small file updating method, supports the inquiry of the small files in various modes, avoids the traditional single and low-efficiency inquiry of massive small files, and ensures the instantaneity, the accuracy and the high efficiency of the inquiry. The problems that in the prior art, the small files are simply combined, so that the small file query mode is single, the reading efficiency is low, the small files are difficult to update, the query instantaneity is poor and the like are solved.

Description

Massive small file query method and system adopting suffix array index

Technical Field

The invention relates to the field of big data management, in particular to a massive small file query method and a massive small file query system adopting suffix array index.

Background

In the current big data era, a great deal of data is generated by various modern information applications, and accordingly, the storage and management pressure is brought. Many common distributed file systems, represented by HDFS, are more suitable for storage of large files in terms of design. If small files are stored, each small file occupies a complete storage unit space, which results in waste of space. Meanwhile, small files are directly stored in the distributed file system, a large amount of server memory is consumed due to the fact that metadata information of the small files is created, and the storing and retrieving speed is correspondingly reduced after the number of the small files reaches a certain scale.

The general method for solving the problems is to merge small files and then store the small files in a distributed file system, but in the prior art, indexes are mainly directly established for offsets of the small files in large files, for example, hash indexes are established to perform simple merging. The merging mode can cause the problems of single small file query mode, low reading efficiency, difficult updating of small files, incapability of guaranteeing query instantaneity and the like.

Disclosure of Invention

The invention aims to solve the problems of single small file query mode, low reading efficiency, difficult small file updating, poor query instantaneity and the like caused by simple merging of small files in the prior art, and provides a massive small file query method adopting suffix array index.

In order to achieve the above purpose, the technical means adopted is as follows:

a massive small file query method adopting suffix array index comprises the following steps:

small file storage step:

a client submits a file uploading request;

acquiring the size of each file, judging the size of each file, and if judging that the file is not a small file, respectively establishing suffix array indexes for the files and uploading the suffix array indexes to a distributed file system; if the small files are judged to be the small files, the small files are placed into a merging queue to be merged, suffix array indexes are respectively established for the small files, and the merged files are uploaded to a distributed file system.

Small file query step:

acquiring and analyzing a query request;

determining a query type;

determining a designated domain to be queried and a query condition;

searching a designated domain in the suffix array index according to the query condition to obtain a suffix array index record meeting the condition;

and acquiring the position information of the small files in the distributed file system according to the suffix array index record, and acquiring the corresponding small files from the distributed file system.

According to the scheme, suffix array indexes are established for each small file to record the storage information of the small file and the attribute information of the small file, and then the small files are merged and stored in a distributed file system. The small file query mode of the invention supports small file query in various modes, avoids traditional single low-efficiency massive small file query, and ensures the instantaneity, accuracy and high efficiency of query.

Preferably, the specific process of the judgment in the storing step is as follows: the size of a default storage unit on a distributed file system is defined as a threshold b, the threshold a is defined as a value smaller than the threshold b, files smaller than the threshold a are small files, and files larger than or equal to the threshold a are non-small files.

Preferably, the suffix array index in the storing step comprises five domains including a small file name, a small file size, a file name of the small file stored in the distributed system correspondingly, an offset of the small file stored in the distributed system correspondingly, and creation time; each domain comprises metadata, a suffix array and a domain information structure, wherein the metadata is used for recording the specific content of the file corresponding to the domain;

wherein, the domain information structure comprises the number of the stored files in the domain, the size of the metadata of the domain, and the FileInfo of the file information structure of each file in the domain;

the FileInfo comprises an index deletion marker, the metadata size of the attribute content corresponding to the file in the domain, the offset of the metadata of the attribute metadata corresponding to the file in the domain, and the file ID.

Preferably, the storing step further includes merging the files in a binary form and establishing a suffix array index for each file when the size of the file in the merge queue reaches a threshold b, and then uploading the merged file; and emptying and recycling the files in the merging queue after uploading is finished.

Preferably, the query types in the query step include an exact query and a fuzzy query.

Preferably, the search designation field in the querying step is specifically: inquiring the metadata and the suffix array to find the offset of the matching item in the metadata recorded by the suffix array index, and finding the corresponding file ID in the FileInfo according to the offset;

preferably, the method further comprises the following steps:

updating the small file:

acquiring a small file to be updated;

searching suffix array index, and marking the small file to be updated as deleted;

uploading the updated small files;

carrying out physical recombination on a merged file which contains an old version of small file and meets the recombination condition on a distributed file system; wherein satisfying the recombination condition means: the sum of the sizes of small files which are not updated in the merged files is defined as an effective utilization space, a threshold value of the effective utilization space of each merged file on the distributed file system is set, and the recombination condition is met when the number of the merged files of which the effective utilization spaces are smaller than the threshold value reaches a specified number.

Preferably, the specific calculation process for efficiently utilizing space in the updating step is as follows: and inquiring suffix array index records of the small files with the deletion identifiers of 0 to obtain the merged files where the small files are located, calculating the effective utilization space of each merged file in the distributed file system, judging whether the effective utilization space reaches a threshold value, and if the effective utilization space does not reach the threshold value, marking as the merged file with the effective utilization space smaller than the threshold value.

Meanwhile, the invention also provides a system applying the method, which comprises the following steps:

the file size judging module is used for judging whether each file to be uploaded is a small file;

the merging module is used for merging the small files;

the file uploading module is used for uploading the combined file or uploading a non-small file;

the index module is used for creating suffix array indexes for each file;

and the query module is used for providing a plurality of query types for querying the mass small files.

And the query file acquisition module is used for acquiring the queried small files from the distributed file system according to the suffix array index records.

And the file updating module is used for updating the small files.

And the merged file recombining module is used for recombining the merged files on the distributed file system after the small files are updated, deleting the old small files, regenerating new merged files and storing the new merged files on the distributed file system.

Preferably, the query type provided by the query module comprises an accurate query and a fuzzy query.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

a massive small file query method adopting suffix array index can record storage information and attribute information of small files by establishing suffix array index for each small file, and then merging the small files to be stored on a distributed file system; meanwhile, the invention provides an effective small file updating method, and compared with the conventional direct physical deletion and reconstruction, the method has the advantages that the method adopts a mode of logical deletion and physical recombination, so that a large amount of IO (input/output) expenses can be reduced; the small file query mode of the invention supports small file query in various modes, avoids the traditional single and low-efficiency mass small file query, and solves the problems of single small file query mode, low reading efficiency, difficult small file update, poor query instantaneity and the like caused by simple merging of small files in the prior art.

Drawings

Fig. 1 is a flowchart of a method for storing a small file according to an embodiment of the present invention.

Fig. 2 is a flowchart of a method for querying a small file according to an embodiment of the present invention.

Fig. 3 is a flowchart of a method for updating a small file according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of the module connections of the system of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

This embodiment applies the present invention to a Hadoop Distributed File System (HDFS).

The specific attribute data of the two small files in this embodiment is shown in table 1, and there are five attributes, that is, five fields, including a small file name filename, a small file size filesize, a filename uninfileme that the small file corresponds to and is stored in the distributed system, an offset fileoffset in a storage file corresponding to the distributed system, and a creation time date.

TABLE 1

Suffix array indexes corresponding to the two files are shown in table 2, and each field comprises metadata for recording the specific content of the field corresponding to the file, a suffix array and a field information structure;

TABLE 2

Wherein, the domain information structure of the filename domain in the two files is shown in table 3, and includes the number of files stored in the domain, filenamesum, the size of metadata of the domain, and FileInfo of the file information structure of each file in the domain; the FileInfo records the file information structure of each file in the domain, including an index deletion marker delete (where 0 is not deleted and 1 is deleted), the metadata size of the file-corresponding attribute content in the domain, the offset of the metadata of the file-corresponding attribute metadata in the domain, and the file ID fileID. Since the document record corresponding file is not deleted in this embodiment, delete is identified as 0.

TABLE 3

As shown in fig. 1, the small file storing step includes:

A1. submitting a file uploading request, and acquiring mass small files of any type;

A2. judging the sizes of the massive small files needing to be stored in the HDFS one by one;

since hadoop for version 2.X is 128MB, the size of threshold b is set to 128MB, and the size of threshold a is set to 32MB according to specific requirements.

If the file size is smaller than the threshold value a, jumping to the step A3; otherwise, the file is put into a new uploading queue, the uploading queue meets the uploading condition, and the step A5 is skipped;

A3. and (3) putting the files judged to be the small files into a merging queue, and judging for the second time before putting the files: and (c) whether the sum of the sizes of all the files in the merge queue is greater than a threshold value b, if so, the small file is placed into a new merge queue, because the current merge queue meets the uploading condition. Simultaneously entering the step A4, otherwise, circularly entering the step A2 to judge the size of the next file;

A4. binary merging is carried out on the files in the merging queue meeting the uploading condition of A3 to form an uploading queue meeting the uploading condition;

A5. establishing suffix array indexes for all files (small files before combination if combination exists) in an uploading queue meeting uploading conditions, and recording position information and self information of offset and creation time of the small files stored in corresponding files on the HDFS, and index information stored in a server for maintaining the index information, wherein the suffix array indexes comprise small file names, small file sizes, file names corresponding to the small files stored on the HDFS, and the offset and the creation time of the small files stored in the corresponding files on the HDFS;

A6. uploading a binary form of a combined file or a single large file (the latter is converted into a binary form) to the HDFS, emptying and recovering an uploading queue in A5;

A7. and C, judging whether the file is not uploaded, if not, finishing the uploading request, otherwise, jumping to the step A2.

When the files in the merging queue are merged, the files in the queue are still reserved, and only a new merged file is created on the HDFS and the files in the queue are copied and written into the merged file. Therefore, when the file index is established, the information of each file before combination can be obtained according to the files in the queue.

In the present embodiment, the small file name filename in the lookup table 1 is a small file of the picture. As shown in fig. 2, the small file query process includes:

B1. acquiring a query request;

B2. analyzing the query request;

B3. determining the type of the query request as an accurate query or a fuzzy query;

B4. determining a designated domain to be queried and a query condition according to the query request analysis content; in this embodiment, the designated field is a filename field, and the query condition is picture.

B5. Searching a filename field in the suffix array index, inquiring the metadata and the suffix array to find an offset of the picture in the metadata, and then finding a corresponding file ID in the FileInfo according to the offset;

B6. and acquiring corresponding other data in other fields in the suffix array index according to the file ID to obtain complete small file information, wherein the complete small file information comprises a small file name, a small file size, a file name correspondingly stored on the HDFS, an offset in the corresponding HDFS file and creation time.

B7. And acquiring a storage file corresponding to the HDFS, and acquiring the inquired small file according to the offset and the size of the small file.

Txt doclets are used to find a complete content, such as a book. Fuzzy queries require specifying the format and specification of some fuzzy query wildcards, such as matching any character, _ matching a character, [ abcd ] matching any single character in the string abcd, etc. For example, a small file named b _ e.text is searched, and a plurality of small files meeting fuzzy query conditions, such as bce.text, bde.text, and the like, can be obtained.

As shown in fig. 3, the small file updating step includes:

C1. acquiring a small file to be updated;

C2. searching suffix array indexes, and acquiring fileIDs of corresponding indexes according to information such as small file names;

C3. finding out corresponding index information in all domains according to the fileID, and changing the delete identification into 1;

C4. uploading the updated small files;

C5. judging whether the HDFS combined file recombination condition is met;

C6. if recombination is achieved.

As shown in fig. 4, the query system applied by the method of the present invention includes:

the file acquisition module 1 is used for acquiring massive files to be uploaded, and the file type can support files in any format;

the file size judging module 2 is used for judging whether each file to be uploaded is a small file;

the merging module 3 is used for merging a plurality of small files;

the file uploading module 4 is used for uploading the combined file or uploading a non-small file;

an indexing module 5, configured to create a suffix array index for each file;

and the query module 6 is used for providing a plurality of query types for querying the mass small files, such as accurate query and fuzzy query.

And the query file acquisition module 7 is used for acquiring the queried small files from the distributed file system according to the suffix array index records.

And the file updating module 8 is used for updating the small files.

And the merged file recombining module 9 is used for recombining the merged files on the distributed file system after the small files are updated, deleting the old small files, regenerating new merged files and storing the new merged files on the distributed file system.

The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A massive small file query method adopting suffix array index is characterized by comprising the following steps:

small file storage step:

a client submits a file uploading request;

acquiring the size of each file, judging the size of each file, and if judging that the file is not a small file, respectively establishing suffix array indexes for the files and uploading the suffix array indexes to a distributed file system; if the small files are judged to be the small files, the small files are placed into a merging queue to be merged, suffix array indexes are respectively established for the small files, and the merged files are uploaded to a distributed file system;

small file query step:

acquiring and analyzing a query request;

determining a query type;

determining a designated domain to be queried and a query condition;

2. The method for querying the mass small files by using suffix array index as claimed in claim 1, wherein the specific process of the judgment in the storage step is as follows: the size of a default storage unit on a distributed file system is defined as a threshold b, the threshold a is defined as a value smaller than the threshold b, files smaller than the threshold a are small files, and files larger than or equal to the threshold a are non-small files.

3. The method for querying the massive small files by using the suffix array index as claimed in claim 1, wherein the suffix array index in the storing step comprises five fields including a small file name, a small file size, a file name of the small file corresponding to the small file stored on the distributed system, an offset of the small file corresponding to the file stored on the distributed system, and creation time; each domain comprises metadata, a suffix array and a domain information structure, wherein the metadata is used for recording the specific content of the file corresponding to the domain;

4. The method for querying the mass small files by using the suffix array index as claimed in claim 2, wherein the storing step further comprises merging the files in a binary form and establishing the suffix array index for each file when the size of the file in the merging queue reaches a threshold b, and then uploading the merged files; and emptying and recycling the files in the merging queue after uploading is finished.

5. The method for querying mass small files indexed by suffix array as claimed in claim 1, wherein the query types in the querying step include precise query and fuzzy query.

6. The method for querying the mass small files by using suffix array index as claimed in claim 1, wherein the search designation field in the querying step is specifically: and inquiring the metadata and the suffix array to find the offset of the matching item in the metadata of the suffix array index record, and finding the corresponding file ID in the FileInfo according to the offset.

7. The method for querying mass small files by using suffix array index as claimed in claim 6, further comprising:

updating the small file:

acquiring a small file to be updated;

uploading the updated small files;

8. The method for querying mass small files by using suffix array index as claimed in claim 7, wherein the specific calculation process for effectively utilizing space in the updating step is as follows: and inquiring suffix array index records of the small files with the deletion identifiers of 0 to obtain the merged files where the small files are located, calculating the effective utilization space of each merged file in the distributed file system, judging whether the effective utilization space reaches a threshold value, and if the effective utilization space does not reach the threshold value, marking as the merged file with the effective utilization space smaller than the threshold value.

9. A massive small file query system adopting suffix array index is characterized by comprising:

the file acquisition module is used for acquiring massive files to be uploaded, and the file type can support files in any format;

the merging module is used for merging the small files;

the index module is used for creating suffix array indexes for each file;

the query module is used for providing a plurality of query types for querying the mass small files;

the query file acquisition module is used for acquiring the queried small files from the distributed file system according to the suffix array index records;

the file updating module is used for updating the small files;

10. The mass small file query system with suffix array index as claimed in claim 9, wherein the query types provided by the query module include precision query and fuzzy query.