US20200311035A1 - Hybrid file system architecture, file storage, dynamic migration, and application thereof - Google Patents

Hybrid file system architecture, file storage, dynamic migration, and application thereof Download PDF

Info

Publication number
US20200311035A1
US20200311035A1 US16/831,964 US202016831964A US2020311035A1 US 20200311035 A1 US20200311035 A1 US 20200311035A1 US 202016831964 A US202016831964 A US 202016831964A US 2020311035 A1 US2020311035 A1 US 2020311035A1
Authority
US
United States
Prior art keywords
file
file system
distributed
distributed file
migration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US16/831,964
Other versions
US10810169B1 (en
Inventor
Yeh-Ching Chung
Lidong Zhang
Yongwei Wu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Research Institute Tsinghua University
Original Assignee
Shenzhen Research Institute Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Research Institute Tsinghua University filed Critical Shenzhen Research Institute Tsinghua University
Assigned to RESEARCH INSTITUTE OF TSINGHUA UNIVERSITY IN SHENZHEN reassignment RESEARCH INSTITUTE OF TSINGHUA UNIVERSITY IN SHENZHEN ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHUNG, YEH-CHING, WU, YONGWEI, ZHANG, Lidong
Publication of US20200311035A1 publication Critical patent/US20200311035A1/en
Application granted granted Critical
Publication of US10810169B1 publication Critical patent/US10810169B1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/185Hierarchical storage management [HSM] systems, e.g. file migration or policies thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/119Details of migration of file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Definitions

  • the present disclosure relates to a technical field of distributed file systems, and more particularly, to a hybrid file system architecture having a plurality of distributed file systems hybridized therein, file storage, dynamic migration, and application thereof.
  • HDFS has high read and write performance with respect to large files.
  • Experimental analysis shows that it has better read and write performance when files are larger than 8M; while Glusterfs has better I ⁇ O performance with respect to files smaller than 8M; and so on.
  • One of the technical problems to be solved by the present disclosure is: in a case where a variety of high-performance file systems coexist, how to make full use of performance advantages of various file systems, integrate a variety of file systems, make full use of their respective advantages, improve storage efficiency, improve overall performance, and comprehensively process various situations to achieve optimal overall performance of the file systems.
  • a file storage processing method applied in a hybrid file system architecture including a plurality of different types of distributed file systems, for determining in which distributed file system a file to be stored is stored the file storage processing method comprising: acquiring storage attributes of the file to be stored, wherein, the storage attributes at least include a size of the file; determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored; and storing the file to be stored in the determined distributed file system.
  • the storage rule is an intelligent storage model obtained through learning by using an artificial intelligence learning algorithm based on a training sample set; and features of each training sample of the training sample set include the storage attributes of the file and a label of the file system to which the file has been determined to be assigned.
  • the storage attributes of the file further include: access mode type, access permission level, and associated users of the file, wherein, the access mode type is selected from one of: read-only, write-only, read-write, and executable.
  • the hybrid file system architecture includes a metadata manage server, wherein, the storage rule is stored in a non-volatile storage medium, and meanwhile maintained in a metadata manage server memory; and the storage rule is dynamically updated, wherein, the determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored includes: reading the storage rule from the metadata manage server, and determining, according the read storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored.
  • the storage rule is further maintained in a remote standby node.
  • the artificial intelligence learning algorithm is a decision tree
  • the intelligent storage model is a decision tree model constructed based on training data.
  • optimization processing including pruning and cross-validation is performed in construction of the decision tree model.
  • the file storage processing method further comprises: receiving, by the metadata manage server, from a client a request to read a file from the hybrid file system architecture or update a file therein; acquiring, by the metadata manage server, path information of the file to be read or updated, to further obtain storage location information of the file; returning, by the metadata manage server, the storage location of the file to be read or updated to the client; and communicating, by the client, with a corresponding distributed file system according to the returned storage location, to perform actual read operation or update operation.
  • I/O performance of the file on each of the distributed file systems is determined experimentally as follows: acquiring a read throughput rate F irt and a write throughput rate F iwt of the file on each distributed file system through experiments, the read throughput rate F irt being a data size of the file read per second, and the write throughput rate F iwt being a data size of the file written per second; and calculating a sum of the read throughput rate F irt and the write throughput rate F iwt of the file in each distributed file system as the I/O performance of the file on each of the distributed file systems.
  • the file storage processing method further comprises: determining a distributed file system that needs file migration; determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration; and migrating the file that has been determined to be migrated.
  • the determining a distributed file system that needs file migration includes: calculating a difference in usage rate between any two distributed file systems; and determining that a distributed file system with a higher usage rate needs file migration, when the difference in usage rate is greater than a predetermined threshold.
  • the determining a file to be migrated on the distributed file system, for the distributed file system that needs file migration includes: calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems; and determining the file to be migrated and the migration destination of the file based on sorting of migration gains of migrating respective files to other distributed file systems.
  • the calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems includes: referring to the distributed file system that needs file migration as a distributed file system i, referring to any one of the other distributed file systems as a distributed file system j, and referring to the file on the distributed file system i as a file x; obtaining read throughput and write throughput of the file x on the distributed file system i, and predicting read throughput and write throughput of the file x on the distributed file system j; obtaining a read frequency and a write frequency of the file x on the distributed file system i; and calculating a migration gain of migrating the file x from the distributed file system i to the distributed file system j, at least based on the size of the file x, the read frequency and the write frequency of the file x on the distributed file system i, the read throughput and the write throughput of the file x on the distributed file system i, as well as the read through
  • the migration gain of migrating the file x from the distributed file system i to the distributed file system j is calculated based on a formula below:
  • DFS i and DFS j represent the distributed file systems i, j;
  • F xrt (DFS i ) and F xrt (DFS j ) are respectively read throughput rates of the file x in the distributed file systems i, j;
  • F xwt (DFS i ) and F xwt (DFS j ) are write throughput rates of the file x in the distributed file systems i, j;
  • a throughput rate is a size of a file read and written per second; the read throughput rate and the write throughput rate are functions of the file size;
  • F xrf and F xwf are respectively the read frequency and the write frequency of the file x in the distributed file system i; and
  • s x is a size of the file x to be migrated in the file system.
  • the predicting read throughput and write throughput of the file x on the distributed file system j includes: predicting by using a predetermined regression model, the regression model being selected from one of:
  • the predetermined regression model is determined through a fitting process and a selecting process below: inputting file training data to different types of regression models; calculating unknown parameters by using a least square method; fitting to obtain the different types of regression models after the fitting; and selecting a regression model with a best fitting effect from the different types of regression models after the fitting as the predetermined regression model.
  • the obtaining a read frequency and a write frequency of the file x on the distributed file system i includes: obtaining the read frequency and the write frequency of the file x on the distributed file system i by querying the metadata manage server.
  • a file dynamic migration method applied in a hybrid file system architecture including a plurality of different types of distributed file systems comprising: determining a distributed file system that needs file migration; determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration; and migrating the file that has been determined to be migrated.
  • the determining a distributed file system that needs file migration includes: calculating a difference in usage rate between any two distributed file systems; and determining that a distributed file system with a higher usage rate needs file migration, when the difference in usage rate is greater than a predetermined threshold.
  • the determining a file to be migrated on the distributed file system, for the distributed file system that needs file migration includes: calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems; and determining the file to be migrated and the migration destination of the file based on sorting of migration gains of migrating respective files to other distributed file systems.
  • the calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems includes: referring to the distributed file system that needs file migration as a distributed file system i, referring to any one of the other distributed file systems as a distributed file system j, and referring to the file on the distributed file system i as a file x; obtaining read throughput and write throughput of the file x on the distributed file system i, and predicting read throughput and write throughput of the file x on the distributed file system j; obtaining a read frequency and a write frequency of the file x on the distributed file system i; and calculating a migration gain of migrating the file x from the distributed file system i to the distributed file system j, at least based on the size of the file x, the read frequency and the write frequency of the file x on the distributed file system i, the read throughput and the write throughput of the file x on the distributed file system i, as well as the read through
  • the migration gain of migrating the file x from the distributed file system i to the distributed file system j is calculated based on a formula below:
  • DFS i and DFS j represent the distributed file systems i, j;
  • F xrt (DFS i ) and F xrt (DFS j ) are respectively read throughput rates of the file x in the distributed file systems i, j;
  • F xwt (DFS i ) and F xwt (DFS j ) are write throughput rates of the file x in the distributed file systems i, j;
  • a throughput rate is a size of a file read and written per second; the read throughput rate and the write throughput rate are functions of the file size;
  • F xrf and F xwf are respectively the read frequency and the write frequency of the file x in the distributed file system i; and
  • s x is a size of the file x to be migrated in the file system.
  • the predicting read throughput and write throughput of the file x on the distributed file system j includes:
  • Predicting by using a predetermined regression model the regression model being selected from one of:
  • the predetermined regression model is determined through a fitting process below: inputting file training data to different regression models; calculating unknown parameters by using a least square method; and obtaining a curve with a best fitting effect as the predetermined regression model.
  • the obtaining a read frequency and a write frequency of the file x on the distributed file system i includes: obtaining the read frequency and the write frequency of the file x on the distributed file system i by querying the metadata manage server.
  • a file storage processing device comprising a memory and a processor, the memory having computer-executable instructions stored thereon, and when executed by a controller, the computer-executable instructions being operable to execute the above-described file storage processing method.
  • a file migration processing system comprising a memory and a processor, the memory having computer-executable instructions stored thereon, and when executed by a controller, the computer-executable instructions being operable to execute the above-described file dynamic migration method.
  • a computer-readable storage medium having computer-executable instructions stored thereon, and when executed by a computing device, the computer-executable instructions being operable to execute the above-described file storage processing method.
  • a computer-readable storage medium having computer-executable instructions stored thereon, and when executed by a computing device, the computer-executable instructions being operable to execute the above-described file dynamic migration method.
  • a metadata manage server in a hybrid file system architecture system, which interacts with a client and a plurality of distributed file systems, the metadata manage server maintaining a pre-configured storage rule below, and being configured to perform a method below: acquiring storage attributes of a file to be stored, wherein, the storage attributes at least include a size of the file; determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored; determining a distributed file system that needs file migration; determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration; and migrating the file that has been determined to be migrated.
  • a hybrid file system architecture system comprising a metadata manage server and a plurality of different types of distributed file systems.
  • the file intelligent storage policy according to the embodiment of the present disclosure is adopted to make full use of storage features of a variety of file systems, integrate a variety of file systems, and intelligently select the file underlying storage policy according to the file feature attributes, to optimize file read and write performances.
  • the intelligent storage policy is the decision tree model; the training data is acquired through previous experiments, then the decision tree model is obtained by training, subsequently the stored file attributes are used as input of the decision tree model, and output thereof is just the file storage location, so as to make the file read and write characteristics the best.
  • file dynamic migration policy is adopted.
  • file system load equalization is used as an evaluation index of the file system, and it is decided whether to migrate the file and to which file system the file is migrated, according to storage space usage rates of different underlying file systems, read and write I/O of different files in different file systems, as well as different read and write frequencies of different files, so as to satisfy usage equalization of different file systems and also minimize performance degradation.
  • the high-performance hybrid file system architecture structure, the file storage processing method, the file dynamic migration method and the metadata manage server make comprehensive use of the performance advantages of a variety of distributed file systems to process various file storage problems, which, committed to improving a universal high-performance file system, can cope with storage problems of files of various types under various complex environments, and all have high performance.
  • FIG. 1 shows a structural schematic diagram of a hybrid file system architecture according to an embodiment of the present disclosure
  • FIG. 2 shows a flow chart of an applied file storage processing method in a hybrid file system architecture according to an embodiment of the present disclosure
  • FIGS. 3A to 3E show schematic diagrams of an exemplary process of constructing an intelligent storage policy decision tree
  • FIG. 4 shows a sequence chart of writing a file in a hybrid file system architecture according to an embodiment of the present disclosure
  • FIG. 5 shows a sequence chart of corresponding operations caused by a file read request or update request from a client after a file has been stored in a hybrid file system architecture
  • FIG. 6 shows an overall flow chart of a file dynamic migration method according to an embodiment of the present disclosure.
  • FIG. 7 shows a schematic diagram of comparison between a throughput fit curve and an actual curve of respective distributed file systems obtained through experiments according to an embodiment of the present disclosure.
  • FIG. 1 shows a structural schematic diagram of a hybrid file system architecture 1000 according to an embodiment of the present disclosure, mainly comprising three parts: an underlying storage system 1100 , a metadata manage server 1200 , and a client 1300 .
  • the diagram shows that the underlying storage system 1100 includes various types of distributed file systems DFS- 1 , DFS- 2 . . .
  • DFS-n such as Ceph, HDFS, GlusterFs, etc.
  • the client 1300 is for users to read and write data, and provides a variety of frequently-used file system universal interfaces
  • the metadata manage server 1200 is a core module of the hybrid file system architecture; according to one embodiment, the metadata manage server 1200 stores an intelligent storage decision policy 1210 and a dynamic migration policy 1230 , and at a same time, may store a part of metadata 1220 ; the metadata manage server 1200 , in response to the client's file write request, determines a file storage location according to the file intelligent storage decision policy 1210 , and feeds back the same to the client; and the metadata manage server 1200 monitors usage situation of respective distributed file systems DFS- 1 , . . . , DFS-n, and performs file migration between distributed file systems according to the file dynamic migration policy when severe dis
  • FIG. 2 shows a flow chart of an applied file storage processing method 200 in a hybrid file system architecture according to an embodiment of the present disclosure.
  • step S 210 acquiring storage attributes of a file to be stored, wherein, the storage attributes at least include a size of the file.
  • the storage attributes of the file further include: access mode type, access permission level, and associated users of the file, wherein, the access mode type is selected from one of: read-only, write-only, read-write, and executable.
  • a metadata manage server obtains the storage attributes of the file to be stored from a client, stores and maintains the same as metadata in its own memory, as shown in FIG. 1 .
  • Step S 220 determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored.
  • the storage rule is an intelligent storage model obtained through learning by using an artificial intelligence learning algorithm based on a training sample set; and features of each training sample of the training sample set include the storage attributes of the file and a label of the file system to which the file has been determined to be assigned.
  • the label of the file system to which the file has been determined to be assigned is determined based on experimentally determined I/O performance of the file on each of the distributed file systems, and the I/O performance includes a read throughput rate and/or a write throughput rate.
  • the storage rule for example, may be stored in a non-volatile storage medium such as a hard disk while the decision tree model is maintained and stored in the memory.
  • the storage rule is simultaneously sent to a remote standby node.
  • the storage rule is dynamically updated, for example, according to a certain period; through learning by using the artificial intelligence learning algorithm again, a newly learned storage rule is updated to the metadata manage server; and the storage rule stored in the hard disk and/or the remote node is updated synchronously.
  • the determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored includes: reading the storage rule from the metadata manage server, and determining, according the read storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored.
  • the artificial intelligence learning algorithm is a decision tree
  • the intelligent storage model is a decision tree model constructed based on training data. Subsequently, an example of a process of constructing the decision tree model will be described in detail with reference to the drawings.
  • the metadata manage server 1200 determines in which distributed file system the file is stored, by using the intelligent storage model 1210 , based on the storage attributes of the file obtained from the client, and returns the same to the client 1300 .
  • Step S 230 storing the file to be stored in the determined distributed file system.
  • the client 1300 directly communicates with the distributed file system 1100 , and the distributed file system stores the file in the determined specific distributed file system.
  • the specific distributed file system is selected according to the attributes of the file based on the predetermined storage rule, so as to, for example, improve storage performance and efficiency, and solve the technical problem of how to use different file systems for storage to improve storage efficiency.
  • a variety of distributed file systems are integrated, and system performance is comprehensively improved, by acquiring performance characteristics of various types of distributed file systems for various files through, for example, machine learning in advance, and by comprehensively utilizing advantages of different distributed file systems in a file access process.
  • processing attributes of files with different attributes when stored on these distributed file systems are obtained in advance, for example, I/O performances of files of different sizes on different distributed file systems may be obtained; rules may be established according to the knowledge obtained in advance; and these rules are used when a file is stored subsequently.
  • file of different sizes are selected as experimental data, tested and assessed in a variety of distributed file systems, to acquire a read throughput rate and a write throughput rate F irt , F iwt of different files in different distributed file systems; and then one with a maximum result is selected as a training data label according a formula below.
  • the storage attributes of the file are extracted, including file size, access mode, access permission, and owner; a training data label of each file determined through the above-described experiment is obtained; and data shown in Table 1 is acquired as the training data.
  • a simplified training data form is used, to acquire a 3-tiple dataset including size, permission, and target DFS; each sample includes features such as size, permission, and target DFS; and the training dataset is as shown in FIG. 3A .
  • optimization processing including pruning and cross-validation, etc. is performed in construction of the decision tree model.
  • the decision tree is provided as a preferred example, not as a limitation; on the contrary, other artificial intelligence learning algorithm may also be selected, for example, a deep neural network, a support vector machine, nearest neighbor learning, etc.
  • File operations on the file system include initial storage operation (write operation), and subsequent read and possible update operations.
  • FIG. 4 shows a sequence chart of writing a file in a hybrid file system architecture according to an embodiment of the present disclosure.
  • step S 410 a client sends a file write access request to a metadata manage server.
  • step S 420 the metadata manage server acquires file attribute information.
  • step S 430 the metadata manage server acquires a decision tree model maintained by the metadata manage server.
  • step S 440 the metadata manage server obtains a storage location of the file to be written, based on the file storage attribute information and the decision tree model.
  • step S 450 the metadata manage server returns the storage location of the file to the client.
  • step S 460 the client communicates with a corresponding distributed file system according to the returned storage location, to perform an actual file write operation.
  • FIG. 5 shows a sequence chart of corresponding operations caused by a file read request or update request from a client after a file has been stored in a hybrid file system architecture.
  • step S 510 the client sends a file read request or update request to a metadata manage server.
  • step S 520 the metadata manage server acquires a file path from the read request or the update request.
  • step S 530 the metadata manage server queries a metadata database, to acquire a storage location of the file to be read or updated.
  • step S 540 the metadata manage server feeds back the storage location of the file to the client.
  • step S 550 the client communicates with a corresponding distributed file system according to the returned storage location, and performs actual file read or update operations.
  • file migration may also be performed, that is, a file stored in one distributed file system is migrated to another distributed file system, so that storage capacity of the system may be further improved through migration, to promote load equalization between respective distributed file systems.
  • Step S 610 determining a distributed file system that needs file migration.
  • usage situation of respective distributed file systems may also be continuously monitored, to judge whether file migration is needed.
  • Usage rates of the respective distributed file systems may be investigated, to determine a situation of load equalization, or say, usage equalization between the respective distributed file systems; and in a case where severe disequilibrium in usage rate occurs, file migration, specifically, file emigration, is performed on a distributed file system with an excessively high usage rate.
  • the determining a distributed file system that needs file migration includes: calculating a difference in usage rate between any two distributed file systems; and determining that a distributed file system with a higher usage rate needs file migration, when the difference in usage rate is greater than a predetermined threshold.
  • a usage rate of a distributed file system A is 90% while a usage rate of a distributed file system B is only 10%, it is obvious that severe load disequilibrium occurs, then a file migration operation may be performed on the distributed file system A.
  • a usage rate of a distributed file system represents that the file system usage rate is a ratio of actual use capacity of the file system to original capacity.
  • Step S 620 determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration.
  • the determining a file to be migrated on the distributed file system, for the distributed file system that needs file migration includes: calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems; and determining the file to be migrated and the migration destination of the file based on sorting of migration gains of migrating respective files to other distributed file systems.
  • the calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems may be performed as follows:
  • the migration gain of migrating the file x from the distributed file system i to the distributed file system j is calculated based on a formula below:
  • DFS i and DFS j represent the distributed file systems i, j;
  • F xrt (DFS i ) and F xrt (DFS j ) are respectively read throughput rates of the file x in the distributed file systems i, j;
  • F xwt (DFS i ) and F xwt (DFS j ) are write throughput rates of the file x in the distributed file systems i, j;
  • a throughput rate is a size of a file read and written per second; the read throughput rate and the write throughput rate are functions of the file size;
  • F xrf and F xwf are respectively the read frequency and the write frequency of the file x in the distributed file system i; and
  • s x is a size of the file x to be migrated in the file system.
  • a first part of the summation on the right side of the equal sign represents an overall performance improvement made by migrating the file x from the distributed file system i to the distributed file system j, or say, a comprehensive migration gain in file size and read performance, in consideration of file size (a factor of file system usage rate level), read performance throughput rate, and read frequency; and a second part of the summation represents an overall performance improvement made by migrating the file x from the distributed file system i to the distributed file system j, or say, a comprehensive migration gain in file size and write performance, in consideration of file size, write performance throughput rate, and write frequency.
  • Formula (1) indicates that, the larger the file size, the higher the read and write frequencies, the greater the throughput rate of the file on the distributed file system j, and the higher the migration gain of migrating the file to the distributed file system j with respect to the distributed file system i.
  • the read frequency and the write frequency of the file x in the distributed file system i may be obtained by querying the metadata manage server.
  • Formula (1) is a preferred example of calculating a migration gain of a file, but it is not a limitation; and other calculation formulas may also be designed according to needs.
  • the read throughput and the write throughput of the file x on the distributed file system i may be obtained by, for example, actual observation, or may also be obtained by prediction; while the read throughput and the write throughput of the file x on the distributed file system j may only be obtained by prediction.
  • predicting the read throughput and the write throughput of the file x on a distributed file system may be performed, for example, by using a predetermined regression model, and the regression model is selected from one of:
  • the predetermined regression model may be determined through a fitting process and a selecting process below: inputting file training data to different types of regression model formulas; calculating unknown parameters by using a least square method; fitting to obtain the different types of regression models after the fitting; and selecting a regression model with a best fitting effect from the different types of regression models after the fitting as the predetermined regression model.
  • FIG. 7 shows a schematic diagram of comparison between a throughput fit curve and an actual curve of respective distributed file systems obtained through experiments according to an embodiment of the present disclosure.
  • an abscissa represents different file sizes
  • an ordinate represents throughput rates.
  • Target distributed file systems as experimental objects are respectively Ceph, HDFS and GlusterFs.
  • the file sizes are substituted into the respective regression model formulas shown in Table 2, and an error is calculated by using a least square method; when the overall error is minimal, a curve fitting effect is optimal, wherein, read and write curves of several types of distributed file systems are fitted respectively, and it can be seen from FIG. 7 that, it is only necessary to perform first-order fitting on HDFS write with Ceph Write and Ceph Read to achieve an optimal effect, while other types require higher-order fitting.
  • Table 3 shows throughput rate fit curves of different distributed file systems based on experiments and fitting calculations.
  • the target file systems are respectively Ceph, HDFS and GlusterFs; and it is found through experiments that, HDFS write, Ceph Write, and CephRead achieve optimal effects with only the first-order fitting, while other types require higher-order fitting.
  • GlusterFS write y(k) 8.43731 + 0.10894e ⁇ 0.04518k cos( ⁇ 38.07854k) ⁇ curve 1.89347e ⁇ 0.04518k sin( ⁇ 38.07854k) + 1.49443e ⁇ 0.61613k cos(33.75146k) ⁇ 0.05625e ⁇ 0.61613k sin(33.75146k)
  • Table 4 is a physical environment configuration example of a high-performance hybrid file system architecture experiment as an example; and as shown below, in order to meet architecture requirements, the physical environment of the experiment is mainly divided into one node for a client and 6 nodes for underlying storage servers, as well as one metadata manage server node, wherein, the underlying physical storage node may be expanded and hidden from the client, and all node operating systems are ubuntu14.04, with 1T capacity.
  • a file to be migrated may be determined; the migration gain is an expected gain of migrating the file from the file system where it is located to a certain distributed file system, and thus, a destination distributed file system to which the file is to be migrated is also determined.
  • Step S 630 migrating the file that has been determined to be migrated.
  • file migration can be performed in order from a file with a largest migration gain, until a usage rate difference between file systems meets requirements, and the migration is complete.
  • the migration process is a C-D process, that is, copying and then deleting, wherein, mandatory locks are added in a file operation process.
  • a first “for” loop is to determine a difference in usage rate between any two file systems; when there is a difference in usage rate between two file systems that is greater than p0, that is, when load disequilibrium occurs to the file system architecture, a migration procedure is enabled; line 14 is to calculate a migration degree of all files of a file system that needs migration and other file systems; and line 15 is to sort according to the calculated migration degree.
  • Lines 16 to 23 are to migrate: firstly copy the file to the target file system, and then delete the file from the original file system, until the difference in usage rate between file systems meets conditions.
  • a file storage processing system comprising a memory and a processor, the memory having computer-executable instructions stored thereon, and when executed by a controller, the computer-executable instructions being operable to execute the above-described file storage processing method.
  • a file migration processing system comprising a memory and a processor, the memory having computer-executable instructions stored thereon, and when executed by a controller, the computer-executable instructions being operable to execute the above-described file dynamic migration method.
  • a computer-readable storage medium having computer-executable instructions stored thereon, and when executed by a computing device, the computer-executable instructions being operable to execute the above-described file storage processing method.
  • a computer-readable storage medium having computer-executable instructions stored thereon, and when executed by a computing device, the computer-executable instructions being operable to execute the above-described file dynamic migration method.
  • a metadata manage server in a hybrid file system architecture system, which interacts with a client and a plurality of distributed file systems, the metadata manage server maintaining a pre-configured storage rule below, and being configured to perform a method below: acquiring storage attributes of a file to be stored, wherein, the storage attributes at least include a size of the file; determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored; determining a distributed file system that needs file migration; determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration; and migrating the file that has been determined to be migrated.
  • a hybrid file system architecture system comprising the above-described metadata manage server and a plurality of different types of distributed file systems.
  • processors There may be one or more of the above-described processors, which may be concentrated on one physical address or distributed on a plurality of physical addresses.
  • Each of the one or more processors may be a device that can execute machine-readable and executable instructions, for example, a computer, a microprocessor, a microcontroller, an integrated circuit, a microchip, or any other computing device.
  • the one or more processors may be coupled to a communication path that provides signal interconnection between different devices, components and/or modules.
  • the communication path may cause any number of processors to be communicatively coupled to each other, and may allow modules coupled to the communication path to operate in a distributed computing environment. Specifically, each module may be operated as a node that can send and/or receive data.
  • “being communicatively coupled” refers to that mutually coupled components may exchange data with each other, for example, in a form of electrical signals, electromagnetic signals, and optical signals.
  • the above-described memory may include one or more memory modules.
  • the memory module may be configured to include a volatile memory, for example, a Static Random Access Memory (S-RAM) and a Dynamic Random Access Memory (D-RAM), as well as a non-volatile memory, for example, a flash memory, a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) and an Electrically Erasable Programmable Read-Only Memory (EEPROM).
  • a volatile memory for example, a Static Random Access Memory (S-RAM) and a Dynamic Random Access Memory (D-RAM)
  • non-volatile memory for example, a flash memory, a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) and an Electrically Erasable Programmable Read-Only Memory (EEPROM).
  • ROM Read-Only Memory
  • EPROM Erasable Programmable Read-Only Memory
  • EEPROM Electrically Erasable Programmable
  • the machine-readable and executable instructions may be logics or algorithms written in any programming language, for example, a machine language that can be directly executed by a processor, or an assembly language that can be compiled or assembled into machine-readable instructions and stored in the memory module, an Object-Oriented Programming (OOP) language, Javascript language, a microcode, etc.
  • OOP Object-Oriented Programming
  • the machine-readable and executable instructions may also be written in a hardware description language, for example, logics implemented by a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC), etc.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • the high-performance hybrid file system architecture structure, the file storage processing method, the file dynamic migration method and the metadata manage server make comprehensive use of the performance advantages of a variety of distributed file systems to process various file storage problems, which, committed to improving a universal high-performance file system, can cope with storage problems of files of various types under various complex environments, and all have high performance.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided are a hybrid distributed file system architecture structure, an applied file storage processing method, a dynamic migration method, and application thereof. The file storage processing method comprises: acquiring storage attributes of a file to be stored, wherein the storage attributes at least include a size of the file; determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored; and storing the file to be stored in the determined distributed file system. The method further comprises migrating, according to a predetermined policy, a file that has been stored in a predetermined storage location. The device intelligently selects a file underlying storage policy according to file feature attributes to decide whether to migrate the file and to which file system the file is migrated so as to satisfy usage equalization of different file systems and also minimize performance degradation. By means of experimental comparison, it is concluded that the present disclosure can greatly improve comprehensive file performances such as I/O performance and the usage equalization of the file system.

Description

    TECHNICAL FIELD
  • The present disclosure relates to a technical field of distributed file systems, and more particularly, to a hybrid file system architecture having a plurality of distributed file systems hybridized therein, file storage, dynamic migration, and application thereof.
  • BACKGROUND
  • In the research field of distributed file systems, with respect to different fields and application scenarios, different research institutes as well as enterprises and institutions may design distributed file systems of different architectures to meet specific needs, for example, the Taobao File System (TFS) meets users' storage needs while meeting Taobao's massive picture storage optimization, HDFS is mainly applied to distributed computing and has good processing performance for large data streams, Glusterfs adopts a non-metadata server idea to optimize small file storage and operations involving large amounts of metadata, FaceBook has mainly improved HDFS according to a size range of stored files and content requirements, Ceph is committed to proposing a highly available distributed file system and designing a plurality of metadata servers to improve metadata performance. In view of the different design objectives of the above-described different file systems, universality of the file systems is relatively poor. For example, HDFS has high read and write performance with respect to large files. Experimental analysis shows that it has better read and write performance when files are larger than 8M; while Glusterfs has better I\O performance with respect to files smaller than 8M; and so on.
  • In the prior art, there is no related solution for how to use different file systems for storage to improve storage efficiency.
  • SUMMARY
  • One of the technical problems to be solved by the present disclosure is: in a case where a variety of high-performance file systems coexist, how to make full use of performance advantages of various file systems, integrate a variety of file systems, make full use of their respective advantages, improve storage efficiency, improve overall performance, and comprehensively process various situations to achieve optimal overall performance of the file systems.
  • In this regard, the present disclosure is proposed.
  • According to one aspect of the present disclosure, there is provided a file storage processing method applied in a hybrid file system architecture including a plurality of different types of distributed file systems, for determining in which distributed file system a file to be stored is stored, the file storage processing method comprising: acquiring storage attributes of the file to be stored, wherein, the storage attributes at least include a size of the file; determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored; and storing the file to be stored in the determined distributed file system.
  • Optionally, the storage rule is an intelligent storage model obtained through learning by using an artificial intelligence learning algorithm based on a training sample set; and features of each training sample of the training sample set include the storage attributes of the file and a label of the file system to which the file has been determined to be assigned.
  • Optionally, the storage attributes of the file further include: access mode type, access permission level, and associated users of the file, wherein, the access mode type is selected from one of: read-only, write-only, read-write, and executable.
  • Optionally, the hybrid file system architecture includes a metadata manage server, wherein, the storage rule is stored in a non-volatile storage medium, and meanwhile maintained in a metadata manage server memory; and the storage rule is dynamically updated, wherein, the determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored includes: reading the storage rule from the metadata manage server, and determining, according the read storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored.
  • Optionally, the storage rule is further maintained in a remote standby node.
  • Optionally, the artificial intelligence learning algorithm is a decision tree, and the intelligent storage model is a decision tree model constructed based on training data.
  • Optionally, optimization processing including pruning and cross-validation is performed in construction of the decision tree model.
  • Optionally, the file storage processing method further comprises: receiving, by the metadata manage server, from a client a request to read a file from the hybrid file system architecture or update a file therein; acquiring, by the metadata manage server, path information of the file to be read or updated, to further obtain storage location information of the file; returning, by the metadata manage server, the storage location of the file to be read or updated to the client; and communicating, by the client, with a corresponding distributed file system according to the returned storage location, to perform actual read operation or update operation.
  • Optionally, I/O performance of the file on each of the distributed file systems is determined experimentally as follows: acquiring a read throughput rate Firt and a write throughput rate Fiwt of the file on each distributed file system through experiments, the read throughput rate Firt being a data size of the file read per second, and the write throughput rate Fiwt being a data size of the file written per second; and calculating a sum of the read throughput rate Firt and the write throughput rate Fiwt of the file in each distributed file system as the I/O performance of the file on each of the distributed file systems.
  • Optionally, the file storage processing method further comprises: determining a distributed file system that needs file migration; determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration; and migrating the file that has been determined to be migrated.
  • Optionally, the determining a distributed file system that needs file migration includes: calculating a difference in usage rate between any two distributed file systems; and determining that a distributed file system with a higher usage rate needs file migration, when the difference in usage rate is greater than a predetermined threshold.
  • Optionally, the determining a file to be migrated on the distributed file system, for the distributed file system that needs file migration includes: calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems; and determining the file to be migrated and the migration destination of the file based on sorting of migration gains of migrating respective files to other distributed file systems.
  • Optionally, the calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems includes: referring to the distributed file system that needs file migration as a distributed file system i, referring to any one of the other distributed file systems as a distributed file system j, and referring to the file on the distributed file system i as a file x; obtaining read throughput and write throughput of the file x on the distributed file system i, and predicting read throughput and write throughput of the file x on the distributed file system j; obtaining a read frequency and a write frequency of the file x on the distributed file system i; and calculating a migration gain of migrating the file x from the distributed file system i to the distributed file system j, at least based on the size of the file x, the read frequency and the write frequency of the file x on the distributed file system i, the read throughput and the write throughput of the file x on the distributed file system i, as well as the read throughput and the write throughput of the file x on the distributed file system j.
  • Optionally, the migration gain of migrating the file x from the distributed file system i to the distributed file system j is calculated based on a formula below:

  • diffx(DFSi,DFSj)=(s x /F xrt(DFSi)−s x /F xrt(DFSj))*F xrf+(s x /F xwt(DFSi)−s x /F xwt(DFSj))*F xwf  (1)
  • DFSi and DFSj represent the distributed file systems i, j; Fxrt(DFSi) and Fxrt(DFSj) are respectively read throughput rates of the file x in the distributed file systems i, j; Fxwt(DFSi) and Fxwt(DFSj) are write throughput rates of the file x in the distributed file systems i, j; a throughput rate is a size of a file read and written per second; the read throughput rate and the write throughput rate are functions of the file size; Fxrf and Fxwf are respectively the read frequency and the write frequency of the file x in the distributed file system i; and sx is a size of the file x to be migrated in the file system.
  • Optionally, the predicting read throughput and write throughput of the file x on the distributed file system j includes: predicting by using a predetermined regression model, the regression model being selected from one of:
  • Model Regression equation
    First-order model y(k) = a0 + a1e−pk
    Second-order model y(k) = a0 + a1e−pk + a2e−p 2 k
    Third-order model y(k) = a0 + a1e−pk +
    be−δwk cos (w√{square root over (1 − δ2)}k) +
    ce−δwk sin (w√{square root over (1 − δ2)}k)
    Fourth-order model y(k) = a0 + b1e−δ 1 w 1k cos (w1√{square root over (1 − δ1 2)}k) +
    c1e−δ 1 w 1k sin (w1√{square root over (1 − δ1 2)}k) +
    b2e−δ 2 w 2k cos (w2√{square root over (1 − δ2)}2k) +
    c2e−δ 2 w 2k sin (w2√{square root over (1 − δ2 2)}k)
  • The predetermined regression model is determined through a fitting process and a selecting process below: inputting file training data to different types of regression models; calculating unknown parameters by using a least square method; fitting to obtain the different types of regression models after the fitting; and selecting a regression model with a best fitting effect from the different types of regression models after the fitting as the predetermined regression model.
  • Optionally, the obtaining a read frequency and a write frequency of the file x on the distributed file system i includes: obtaining the read frequency and the write frequency of the file x on the distributed file system i by querying the metadata manage server.
  • According to another aspect of the present disclosure, there is provided a file dynamic migration method applied in a hybrid file system architecture including a plurality of different types of distributed file systems, comprising: determining a distributed file system that needs file migration; determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration; and migrating the file that has been determined to be migrated.
  • Optionally, the determining a distributed file system that needs file migration includes: calculating a difference in usage rate between any two distributed file systems; and determining that a distributed file system with a higher usage rate needs file migration, when the difference in usage rate is greater than a predetermined threshold.
  • Optionally, the determining a file to be migrated on the distributed file system, for the distributed file system that needs file migration includes: calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems; and determining the file to be migrated and the migration destination of the file based on sorting of migration gains of migrating respective files to other distributed file systems.
  • Optionally, the calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems includes: referring to the distributed file system that needs file migration as a distributed file system i, referring to any one of the other distributed file systems as a distributed file system j, and referring to the file on the distributed file system i as a file x; obtaining read throughput and write throughput of the file x on the distributed file system i, and predicting read throughput and write throughput of the file x on the distributed file system j; obtaining a read frequency and a write frequency of the file x on the distributed file system i; and calculating a migration gain of migrating the file x from the distributed file system i to the distributed file system j, at least based on the size of the file x, the read frequency and the write frequency of the file x on the distributed file system i, the read throughput and the write throughput of the file x on the distributed file system i, as well as the read throughput and the write throughput of the file x on the distributed file system j.
  • Optionally, the migration gain of migrating the file x from the distributed file system i to the distributed file system j is calculated based on a formula below:

  • diffx(DFSi,DFSj)=(s x /F xrt(DFSi)−s x /F xrt(DFSj))*F xrf+(s x /F xwt(DFSi)−s x /F xwt(DFSj))*F xwf  (1)
  • DFSi and DFSj represent the distributed file systems i, j; Fxrt(DFSi) and Fxrt(DFSj) are respectively read throughput rates of the file x in the distributed file systems i, j; Fxwt(DFSi) and Fxwt(DFSj) are write throughput rates of the file x in the distributed file systems i, j; a throughput rate is a size of a file read and written per second; the read throughput rate and the write throughput rate are functions of the file size; Fxrf and Fxwf are respectively the read frequency and the write frequency of the file x in the distributed file system i; and sx is a size of the file x to be migrated in the file system.
  • Optionally, the predicting read throughput and write throughput of the file x on the distributed file system j includes:
  • Predicting by using a predetermined regression model, the regression model being selected from one of:
  • Model Regression equation
    First-order model y(k) = a0 + a1e−pk
    Second-order model y(k) = a0 + a1e−pk + a2e−p 2 k
    Third-order model y(k) = a0 + a1e−pk +
    be−δ6wk cos (w√{square root over (1 − δ2)}k) +
    ce−δwk sin (w√{square root over (1 − δ2)}k)
    Fourth-order model y(k) = a0 + b1e−δ 1 w 1k cos (w1√{square root over (1 − δ1 2)}k) +
    c1e−δ 1 w 1k sin (w1√{square root over (1 − δ1 2)}k) +
    b2e−δ 2 w 2k cos (w2√{square root over (1 − δ2 2)}k) +
    c2e−δ 2 w 2k sin (w2√{square root over (1 − δ2 2)}k)
  • The predetermined regression model is determined through a fitting process below: inputting file training data to different regression models; calculating unknown parameters by using a least square method; and obtaining a curve with a best fitting effect as the predetermined regression model.
  • Optionally, the obtaining a read frequency and a write frequency of the file x on the distributed file system i includes: obtaining the read frequency and the write frequency of the file x on the distributed file system i by querying the metadata manage server.
  • According to another aspect of the present disclosure, there is provided a file storage processing device, comprising a memory and a processor, the memory having computer-executable instructions stored thereon, and when executed by a controller, the computer-executable instructions being operable to execute the above-described file storage processing method.
  • According to another aspect of the present disclosure, there is provided a file migration processing system, comprising a memory and a processor, the memory having computer-executable instructions stored thereon, and when executed by a controller, the computer-executable instructions being operable to execute the above-described file dynamic migration method.
  • According to another aspect of the present disclosure, there is provided a computer-readable storage medium, having computer-executable instructions stored thereon, and when executed by a computing device, the computer-executable instructions being operable to execute the above-described file storage processing method.
  • According to another aspect of the present disclosure, there is provided a computer-readable storage medium, having computer-executable instructions stored thereon, and when executed by a computing device, the computer-executable instructions being operable to execute the above-described file dynamic migration method.
  • According to another aspect of the present disclosure, there is provided a metadata manage server in a hybrid file system architecture system, which interacts with a client and a plurality of distributed file systems, the metadata manage server maintaining a pre-configured storage rule below, and being configured to perform a method below: acquiring storage attributes of a file to be stored, wherein, the storage attributes at least include a size of the file; determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored; determining a distributed file system that needs file migration; determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration; and migrating the file that has been determined to be migrated.
  • According to another aspect of the present disclosure, there is provided a hybrid file system architecture system, comprising a metadata manage server and a plurality of different types of distributed file systems.
  • The file intelligent storage policy according to the embodiment of the present disclosure is adopted to make full use of storage features of a variety of file systems, integrate a variety of file systems, and intelligently select the file underlying storage policy according to the file feature attributes, to optimize file read and write performances.
  • Preferably, the intelligent storage policy is the decision tree model; the training data is acquired through previous experiments, then the decision tree model is obtained by training, subsequently the stored file attributes are used as input of the decision tree model, and output thereof is just the file storage location, so as to make the file read and write characteristics the best.
  • Further, a file dynamic migration policy is adopted. Preferably, file system load equalization is used as an evaluation index of the file system, and it is decided whether to migrate the file and to which file system the file is migrated, according to storage space usage rates of different underlying file systems, read and write I/O of different files in different file systems, as well as different read and write frequencies of different files, so as to satisfy usage equalization of different file systems and also minimize performance degradation.
  • By means of experimental comparison, it is concluded that the present disclosure can greatly improve performances of different underlying files.
  • The high-performance hybrid file system architecture structure, the file storage processing method, the file dynamic migration method and the metadata manage server according to the embodiments of the present disclosure, make comprehensive use of the performance advantages of a variety of distributed file systems to process various file storage problems, which, committed to improving a universal high-performance file system, can cope with storage problems of files of various types under various complex environments, and all have high performance.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 shows a structural schematic diagram of a hybrid file system architecture according to an embodiment of the present disclosure;
  • FIG. 2 shows a flow chart of an applied file storage processing method in a hybrid file system architecture according to an embodiment of the present disclosure;
  • FIGS. 3A to 3E show schematic diagrams of an exemplary process of constructing an intelligent storage policy decision tree;
  • FIG. 4 shows a sequence chart of writing a file in a hybrid file system architecture according to an embodiment of the present disclosure;
  • FIG. 5 shows a sequence chart of corresponding operations caused by a file read request or update request from a client after a file has been stored in a hybrid file system architecture;
  • FIG. 6 shows an overall flow chart of a file dynamic migration method according to an embodiment of the present disclosure; and
  • FIG. 7 shows a schematic diagram of comparison between a throughput fit curve and an actual curve of respective distributed file systems obtained through experiments according to an embodiment of the present disclosure.
  • DETAILED DESCRIPTION
  • The following is to disclose the present disclosure so as to enable those skilled in the art to implement the present disclosure. Preferred embodiments as described below are merely exemplary, and those skilled in the art may conceive of other obvious modifications. Basic principles the present disclosure defined in the following description may be used in other embodiments, modifications, improvements, equivalents, and other technical solutions without departing from the spirit and the scope of the present disclosure.
  • The terms and words used in the following description and claims are not limited to literal meanings, but are merely used by the inventor to enable a clear and consistent understanding of the present disclosure. Accordingly, it should be apparent to those skilled in the art that the following description of various embodiments of the present disclosure is provided for illustration only, rather than limiting the present disclosure as defined by the appended claims and their equivalents.
  • The terminology used herein is for describing various embodiments only and is not intended to limit the same. As used herein, a singular form is intended to include a plural form as well, unless otherwise clearly indicated by the context. It will be further understood that the terms “including” and/or “having”, as used in the specification, specify presence of features, numbers, steps, operations, components, items or combinations thereof as described, and do not exclude presence or addition of one or more features, numbers, steps, operations, components, items or combinations thereof.
  • The technical terms or scientific terms here should be of general meaning as understood by those ordinarily skilled in the art, as long as the terms are not defined differently. It should be understood that the terms defined in commonly used dictionaries have meanings that are consistent with the meanings of terms in the prior art.
  • Hereinafter, the present disclosure will be further described in detail in conjunction with the accompanying drawings and specific embodiments.
  • FIG. 1 shows a structural schematic diagram of a hybrid file system architecture 1000 according to an embodiment of the present disclosure, mainly comprising three parts: an underlying storage system 1100, a metadata manage server 1200, and a client 1300. The diagram shows that the underlying storage system 1100 includes various types of distributed file systems DFS-1, DFS-2 . . . DFS-n, such as Ceph, HDFS, GlusterFs, etc., which are used to actually store data and are hidden from, or say, transparent to users, but the users do not know in which distributed file system the data they care about is stored; the client 1300 is for users to read and write data, and provides a variety of frequently-used file system universal interfaces; the metadata manage server 1200 is a core module of the hybrid file system architecture; according to one embodiment, the metadata manage server 1200 stores an intelligent storage decision policy 1210 and a dynamic migration policy 1230, and at a same time, may store a part of metadata 1220; the metadata manage server 1200, in response to the client's file write request, determines a file storage location according to the file intelligent storage decision policy 1210, and feeds back the same to the client; and the metadata manage server 1200 monitors usage situation of respective distributed file systems DFS-1, . . . , DFS-n, and performs file migration between distributed file systems according to the file dynamic migration policy when severe disequilibrium in usage rate occurs between file systems, so as to maintain relative equalization in usage rate between the hybrid distributed file systems.
  • FIG. 2 shows a flow chart of an applied file storage processing method 200 in a hybrid file system architecture according to an embodiment of the present disclosure.
  • As shown in FIG. 2, step S210: acquiring storage attributes of a file to be stored, wherein, the storage attributes at least include a size of the file.
  • In one example, the storage attributes of the file further include: access mode type, access permission level, and associated users of the file, wherein, the access mode type is selected from one of: read-only, write-only, read-write, and executable.
  • In one example, a metadata manage server obtains the storage attributes of the file to be stored from a client, stores and maintains the same as metadata in its own memory, as shown in FIG. 1.
  • Step S220: determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored.
  • In one example, the storage rule is an intelligent storage model obtained through learning by using an artificial intelligence learning algorithm based on a training sample set; and features of each training sample of the training sample set include the storage attributes of the file and a label of the file system to which the file has been determined to be assigned.
  • In one example, the label of the file system to which the file has been determined to be assigned is determined based on experimentally determined I/O performance of the file on each of the distributed file systems, and the I/O performance includes a read throughput rate and/or a write throughput rate.
  • In one example, in consideration of problems of metadata server node failure and memory data loss, the storage rule, for example, may be stored in a non-volatile storage medium such as a hard disk while the decision tree model is maintained and stored in the memory. In another example, for more security reasons, the storage rule is simultaneously sent to a remote standby node.
  • In one example, the storage rule is dynamically updated, for example, according to a certain period; through learning by using the artificial intelligence learning algorithm again, a newly learned storage rule is updated to the metadata manage server; and the storage rule stored in the hard disk and/or the remote node is updated synchronously.
  • In one example, the determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored includes: reading the storage rule from the metadata manage server, and determining, according the read storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored.
  • In one example, the artificial intelligence learning algorithm is a decision tree, and the intelligent storage model is a decision tree model constructed based on training data. Subsequently, an example of a process of constructing the decision tree model will be described in detail with reference to the drawings.
  • For example, in conjunction with the hybrid file system architecture of FIG. 1, the metadata manage server 1200 determines in which distributed file system the file is stored, by using the intelligent storage model 1210, based on the storage attributes of the file obtained from the client, and returns the same to the client 1300.
  • Step S230: storing the file to be stored in the determined distributed file system.
  • Specifically, for example, the client 1300 directly communicates with the distributed file system 1100, and the distributed file system stores the file in the determined specific distributed file system.
  • By using the file storage processing method according to the embodiment of the present disclosure, the specific distributed file system is selected according to the attributes of the file based on the predetermined storage rule, so as to, for example, improve storage performance and efficiency, and solve the technical problem of how to use different file systems for storage to improve storage efficiency. In order to improve universality of the file systems, a variety of distributed file systems are integrated, and system performance is comprehensively improved, by acquiring performance characteristics of various types of distributed file systems for various files through, for example, machine learning in advance, and by comprehensively utilizing advantages of different distributed file systems in a file access process. Specifically, for example, for different distributed file systems, processing attributes of files with different attributes when stored on these distributed file systems are obtained in advance, for example, I/O performances of files of different sizes on different distributed file systems may be obtained; rules may be established according to the knowledge obtained in advance; and these rules are used when a file is stored subsequently.
  • Hereinafter, the construction method of the decision tree model will be described in conjunction with one embodiment.
  • Before the construction method of the decision tree model is described, it is firstly explained how to obtain the training sample dataset.
  • In one example, file of different sizes are selected as experimental data, tested and assessed in a variety of distributed file systems, to acquire a read throughput rate and a write throughput rate Firt, Fiwt of different files in different distributed file systems; and then one with a maximum result is selected as a training data label according a formula below.

  • dfs=max(F irt +F iwt),i=1,2 . . . ,m(m file systems)
  • In a specific embodiment, the storage attributes of the file are extracted, including file size, access mode, access permission, and owner; a training data label of each file determined through the above-described experiment is obtained; and data shown in Table 1 is acquired as the training data.
  • TABLE 1
    Training data
    File Access Label
    Size Access Model Permission Owner (R + W)/2
     5K Read-only 0777 Root DFS1
     50K Read-only 0777 User1 DFS2
    500K Read-only 0777 User2 DFS3
     5M Write-only 0777 User1 DFS1
     5M Write-only 077 User2 DFS1
     5M Read-write 0777 Root DFS2
     10M Exec 0777 User1 DFS3
  • An example of a simplified decision tree construction process is given below with reference to FIGS. 3A to 3E.
  • In the example of FIGS. 3A to 3E, a simplified training data form is used, to acquire a 3-tiple dataset including size, permission, and target DFS; each sample includes features such as size, permission, and target DFS; and the training dataset is as shown in FIG. 3A.
  • Then, on a principle of maximum information entropy, a size that has greatest impact on classification is selected as a classification node to construct the decision tree in FIG. 3C, and the training data is divided into m groups according to the size (the file sizes are divided into m categories) based on the decision tree, m is an integer greater than or equal to 2; in the example, m=3, so in FIG. 3B, the data is divided into 3 groups, respectively, 1M, 5M and 9M in FIG. 3B, which are further divided into 3 branches as shown in FIG. 3D with permission selected as a classification node, on the principle of maximum information entropy again. At this time, all data has been classified. Finally, part of leaf nodes are combined and constructed to obtain FIG. 3E, and thus the decision tree is constructed.
  • In one example, optimization processing including pruning and cross-validation, etc. is performed in construction of the decision tree model.
  • It should be noted that, in the disclosure, as the artificial intelligence learning algorithm for determining the distributed file system in which the file should be stored according to the file attributes, the decision tree is provided as a preferred example, not as a limitation; on the contrary, other artificial intelligence learning algorithm may also be selected, for example, a deep neural network, a support vector machine, nearest neighbor learning, etc.
  • File operations on the file system include initial storage operation (write operation), and subsequent read and possible update operations.
  • FIG. 4 shows a sequence chart of writing a file in a hybrid file system architecture according to an embodiment of the present disclosure.
  • As shown in FIG. 4, in step S410, a client sends a file write access request to a metadata manage server.
  • In step S420, the metadata manage server acquires file attribute information.
  • In step S430, the metadata manage server acquires a decision tree model maintained by the metadata manage server.
  • In step S440, the metadata manage server obtains a storage location of the file to be written, based on the file storage attribute information and the decision tree model.
  • In step S450, the metadata manage server returns the storage location of the file to the client.
  • In step S460, the client communicates with a corresponding distributed file system according to the returned storage location, to perform an actual file write operation.
  • FIG. 5 shows a sequence chart of corresponding operations caused by a file read request or update request from a client after a file has been stored in a hybrid file system architecture.
  • In step S510, the client sends a file read request or update request to a metadata manage server.
  • In step S520, the metadata manage server acquires a file path from the read request or the update request.
  • In step S530, the metadata manage server queries a metadata database, to acquire a storage location of the file to be read or updated.
  • In step S540, the metadata manage server feeds back the storage location of the file to the client.
  • In step S550, the client communicates with a corresponding distributed file system according to the returned storage location, and performs actual file read or update operations.
  • In the storage process, with increase of file storage, storage efficiency of storage space of some distributed file systems will decrease; in order to solve the problem, in an optional implementation mode, file migration may also be performed, that is, a file stored in one distributed file system is migrated to another distributed file system, so that storage capacity of the system may be further improved through migration, to promote load equalization between respective distributed file systems.
  • Hereinafter, an embodiment of a method 600 for migrating the file between distributed file systems will be described in conjunction with FIG. 6.
  • Step S610: determining a distributed file system that needs file migration.
  • In one example, it is determined every preset period whether there is a distributed file system that needs file migration.
  • Alternatively, usage situation of respective distributed file systems may also be continuously monitored, to judge whether file migration is needed.
  • Usage rates of the respective distributed file systems may be investigated, to determine a situation of load equalization, or say, usage equalization between the respective distributed file systems; and in a case where severe disequilibrium in usage rate occurs, file migration, specifically, file emigration, is performed on a distributed file system with an excessively high usage rate.
  • Specifically, in one example, the determining a distributed file system that needs file migration includes: calculating a difference in usage rate between any two distributed file systems; and determining that a distributed file system with a higher usage rate needs file migration, when the difference in usage rate is greater than a predetermined threshold.
  • For example, if a usage rate of a distributed file system A is 90% while a usage rate of a distributed file system B is only 10%, it is obvious that severe load disequilibrium occurs, then a file migration operation may be performed on the distributed file system A.
  • In the disclosure, a usage rate of a distributed file system represents that the file system usage rate is a ratio of actual use capacity of the file system to original capacity.
  • Step S620: determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration.
  • In one example, the determining a file to be migrated on the distributed file system, for the distributed file system that needs file migration includes: calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems; and determining the file to be migrated and the migration destination of the file based on sorting of migration gains of migrating respective files to other distributed file systems.
  • In one example, the calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems may be performed as follows:
  • For convenience of description, referring to the distributed file system that needs file migration as a distributed file system i, referring to any one of the other distributed file systems as a distributed file system j, and referring to the file on the distributed file system i as a file x;
  • Obtaining read throughput and write throughput of the file x on the distributed file system i, and predicting read throughput and write throughput of the file x on the distributed file system j;
  • Obtaining a read frequency and a write frequency of the file x on the distributed file system i; and
  • Calculating a migration gain of migrating the file x from the distributed file system i to the distributed file system j, at least based on the size of the file x, the read frequency and the write frequency of the file x on the distributed file system i, the read throughput and the write throughput of the file x on the distributed file system i, as well as the read throughput and the write throughput of the file x on the distributed file system j.
  • In a preferred example, the migration gain of migrating the file x from the distributed file system i to the distributed file system j is calculated based on a formula below:

  • diffx(DFSi,DFSj)=(s x /F xrt(DFSi)−s x /F xrt(DFSj))*F xrf+(s x /F xwt(DFSi)−s x /F xwt(DFSj))*F xwf  (1)
  • DFSi and DFSj represent the distributed file systems i, j; Fxrt(DFSi) and Fxrt(DFSj) are respectively read throughput rates of the file x in the distributed file systems i, j; Fxwt(DFSi) and Fxwt(DFSj) are write throughput rates of the file x in the distributed file systems i, j; a throughput rate is a size of a file read and written per second; the read throughput rate and the write throughput rate are functions of the file size; Fxrf and Fxwf are respectively the read frequency and the write frequency of the file x in the distributed file system i; and sx is a size of the file x to be migrated in the file system.
  • In the above-described Formula (1), a first part of the summation on the right side of the equal sign represents an overall performance improvement made by migrating the file x from the distributed file system i to the distributed file system j, or say, a comprehensive migration gain in file size and read performance, in consideration of file size (a factor of file system usage rate level), read performance throughput rate, and read frequency; and a second part of the summation represents an overall performance improvement made by migrating the file x from the distributed file system i to the distributed file system j, or say, a comprehensive migration gain in file size and write performance, in consideration of file size, write performance throughput rate, and write frequency.
  • Formula (1) indicates that, the larger the file size, the higher the read and write frequencies, the greater the throughput rate of the file on the distributed file system j, and the higher the migration gain of migrating the file to the distributed file system j with respect to the distributed file system i.
  • In one example, in the above-described Formula (1), the read frequency and the write frequency of the file x in the distributed file system i may be obtained by querying the metadata manage server.
  • It should be noted that, Formula (1) is a preferred example of calculating a migration gain of a file, but it is not a limitation; and other calculation formulas may also be designed according to needs.
  • Here, the read throughput and the write throughput of the file x on the distributed file system i may be obtained by, for example, actual observation, or may also be obtained by prediction; while the read throughput and the write throughput of the file x on the distributed file system j may only be obtained by prediction.
  • In one example, predicting the read throughput and the write throughput of the file x on a distributed file system may be performed, for example, by using a predetermined regression model, and the regression model is selected from one of:
  • Model Regression equation
    First-order model y(k) = a0 + a1e−pk
    Second-order model y(k) = a0 + a1e−pk + a2e−p 2 k
    Third-order model y(k) = a0 + a1e−pk +
    be−δwk cos (w√{square root over (1 − δ2)}k) +
    ce−δwk sin (w√{square root over (1 − δ2)}k)
    Fourth-order model y(k) = a0 + b1e−δ 1 w 1k cos (w1√{square root over (1 − δ1 2)}k) +
    c1e−δ 1 w 1k sin (w1√{square root over (1 − δ1 2)}k) +
    b2e−δ 2 w 2k cos (w2√{square root over (1 − δ2 2)}k) +
    c2e−δ 2 w 2k sin (w2√{square root over (1 − δ2 2)}k)
  • Table 2 Formula Expressions of Respective Regression Models
  • As an example, the predetermined regression model may be determined through a fitting process and a selecting process below: inputting file training data to different types of regression model formulas; calculating unknown parameters by using a least square method; fitting to obtain the different types of regression models after the fitting; and selecting a regression model with a best fitting effect from the different types of regression models after the fitting as the predetermined regression model.
  • FIG. 7 shows a schematic diagram of comparison between a throughput fit curve and an actual curve of respective distributed file systems obtained through experiments according to an embodiment of the present disclosure. In FIG. 7, an abscissa represents different file sizes, and an ordinate represents throughput rates.
  • Target distributed file systems as experimental objects are respectively Ceph, HDFS and GlusterFs. According to actual running results, the file sizes are substituted into the respective regression model formulas shown in Table 2, and an error is calculated by using a least square method; when the overall error is minimal, a curve fitting effect is optimal, wherein, read and write curves of several types of distributed file systems are fitted respectively, and it can be seen from FIG. 7 that, it is only necessary to perform first-order fitting on HDFS write with Ceph Write and Ceph Read to achieve an optimal effect, while other types require higher-order fitting.
  • Table 3 shows throughput rate fit curves of different distributed file systems based on experiments and fitting calculations. In Table 3, as described above, the target file systems are respectively Ceph, HDFS and GlusterFs; and it is found through experiments that, HDFS write, Ceph Write, and CephRead achieve optimal effects with only the first-order fitting, while other types require higher-order fitting.
  • TABLE 3
    Fitting parameters of target file systems
    Curves Fitting results
    HDFS write curve y(k) = 10.39065 − 6.38257e−0.54163k
    Ceph write curve y(k) = 8.79252 − 4.65085e−0.06894ky(k)
    GlusterFS write y(k) = 8.43731 + 0.10894e−0.04518kcos(−38.07854k) −
    curve 1.89347e−0.04518k sin(−38.07854k) +
    1.49443e−0.61613k cos(33.75146k) −
    0.05625e−0.61613k sin(33.75146k)
    HDFS read curve y(k) = 11.0027 − 49.0537e−97.8321k
    5.3826e−2.9596k cos(25.1327k) −
    42.3298e−2.9596k sin(25.1327k)
    Ceph read curve y(k) = 11.128770 − 1.063236e−0.718258k
    GlusterFS read y(k) = −0.0433 + 0.1108e0.00013k
    curve 6.2434e−4.3548k cos(0.000019k) +
    17.2060e−4.3548k sin(0.000019k)
  • Table 4 is a physical environment configuration example of a high-performance hybrid file system architecture experiment as an example; and as shown below, in order to meet architecture requirements, the physical environment of the experiment is mainly divided into one node for a client and 6 nodes for underlying storage servers, as well as one metadata manage server node, wherein, the underlying physical storage node may be expanded and hidden from the client, and all node operating systems are ubuntu14.04, with 1T capacity.
  • TABLE 4
    Physical environment for experiment
    Node
    number File system Host name Usage Notes
    Node 1 MMS Master Metadata 1TB capacity
    management
    Node
    2 HDFS HDFS1 Name node 1TB capacity
    Node
    3 HDFS HDFS2 Datanode 1TB capacity
    Node 4 Ceph Ceph1 mds, mon, osd 1TB capacity
    Node
    5 Ceph Ceph2 osd 1TB capacity
    Node 6 GlusterFS GlusterFS1 Glsuterfs server1 1TB capacity
    Node 7 GlusterFS GlusterFS2 Glsuterfs server2 1TB capacity
    Node 8 Client Client Client 1TB capacity
  • By using the curve of relationship between the throughput rate of the respective distributed file systems and the file size obtained by fitting in this way, throughput rates of the file on different distributed file systems may be predicted, in a case where file sizes of different files are known.
  • After migration gains are sorted, a file to be migrated may be determined; the migration gain is an expected gain of migrating the file from the file system where it is located to a certain distributed file system, and thus, a destination distributed file system to which the file is to be migrated is also determined.
  • Step S630: migrating the file that has been determined to be migrated.
  • For the respective files sorted according to the migration gains, file migration can be performed in order from a file with a largest migration gain, until a usage rate difference between file systems meets requirements, and the migration is complete. The migration process is a C-D process, that is, copying and then deleting, wherein, mandatory locks are added in a file operation process.
  • A pseudo code example that implements the dynamic migration process is given below.
  • Algorithm 1 The Dynamic File Migration Function
    Input: p0, DFSs
    Output: null
     1: for i = 0 to DFSs.size( ) do
     2: for j = i to DFSs.size( ) do
     3: if (DFSs[i].usage − DFSs[j].usage) > p0 then
     4: originalLoc = i
     5: destinationLoc = j
     6: stop
     7: end if
     8: end for
     9: end for
    10: if i=j then
    11: return null
    12: end if
    13: files[ ] = DFSs[originalLoc].files
    14: Throuhput [ ] = ClaculateThroughputDegrade (files[ ],
    DFSs[originalLoc],DFSs[destinationLoc])
    15: migrateList[ ] = sort(Throuhput [ ])
    16: for I = 0 to migrateList.size( ) do
    17: data = readFile(migrateList[i], DFSs[originalLoc])
    18: writeFile(data,DFSs[destinatioLoc])
    19: deleteFile(migrateList[i], DFSs[originalLoc])
    20: if (DFSs[orig].usage − DFSs[des].usage )< p0 then
    21: return null
    22: end if
    23: end for
  • In the above-described pseudo code, a first “for” loop is to determine a difference in usage rate between any two file systems; when there is a difference in usage rate between two file systems that is greater than p0, that is, when load disequilibrium occurs to the file system architecture, a migration procedure is enabled; line 14 is to calculate a migration degree of all files of a file system that needs migration and other file systems; and line 15 is to sort according to the calculated migration degree. Lines 16 to 23 are to migrate: firstly copy the file to the target file system, and then delete the file from the original file system, until the difference in usage rate between file systems meets conditions.
  • Through the experiments, it is validated that, for the hybrid file system, dynamic file migration may be performed to achieve usage equalization of the different file systems, and better comprehensive performances that ensures better read and write performance throughput rates.
  • According to another embodiment of the present disclosure, there is provided a file storage processing system, comprising a memory and a processor, the memory having computer-executable instructions stored thereon, and when executed by a controller, the computer-executable instructions being operable to execute the above-described file storage processing method.
  • According to another embodiment of the present disclosure, there is provided a file migration processing system, comprising a memory and a processor, the memory having computer-executable instructions stored thereon, and when executed by a controller, the computer-executable instructions being operable to execute the above-described file dynamic migration method.
  • According to another embodiment of the present disclosure, there is provided a computer-readable storage medium, having computer-executable instructions stored thereon, and when executed by a computing device, the computer-executable instructions being operable to execute the above-described file storage processing method.
  • According to another embodiment of the present disclosure, there is provided a computer-readable storage medium, having computer-executable instructions stored thereon, and when executed by a computing device, the computer-executable instructions being operable to execute the above-described file dynamic migration method.
  • According to another embodiment of the present disclosure, there is provided a metadata manage server in a hybrid file system architecture system, which interacts with a client and a plurality of distributed file systems, the metadata manage server maintaining a pre-configured storage rule below, and being configured to perform a method below: acquiring storage attributes of a file to be stored, wherein, the storage attributes at least include a size of the file; determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored; determining a distributed file system that needs file migration; determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration; and migrating the file that has been determined to be migrated.
  • According to another embodiment of the present disclosure, there is provided a hybrid file system architecture system, comprising the above-described metadata manage server and a plurality of different types of distributed file systems.
  • There may be one or more of the above-described processors, which may be concentrated on one physical address or distributed on a plurality of physical addresses. Each of the one or more processors may be a device that can execute machine-readable and executable instructions, for example, a computer, a microprocessor, a microcontroller, an integrated circuit, a microchip, or any other computing device. The one or more processors may be coupled to a communication path that provides signal interconnection between different devices, components and/or modules. The communication path may cause any number of processors to be communicatively coupled to each other, and may allow modules coupled to the communication path to operate in a distributed computing environment. Specifically, each module may be operated as a node that can send and/or receive data. In addition, “being communicatively coupled” refers to that mutually coupled components may exchange data with each other, for example, in a form of electrical signals, electromagnetic signals, and optical signals.
  • In addition, the above-described memory may include one or more memory modules. The memory module may be configured to include a volatile memory, for example, a Static Random Access Memory (S-RAM) and a Dynamic Random Access Memory (D-RAM), as well as a non-volatile memory, for example, a flash memory, a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) and an Electrically Erasable Programmable Read-Only Memory (EEPROM). In the memory module, any form of machine-readable and executable instruction is stored for accessing by a processor. The machine-readable and executable instructions may be logics or algorithms written in any programming language, for example, a machine language that can be directly executed by a processor, or an assembly language that can be compiled or assembled into machine-readable instructions and stored in the memory module, an Object-Oriented Programming (OOP) language, Javascript language, a microcode, etc. Alternatively, the machine-readable and executable instructions may also be written in a hardware description language, for example, logics implemented by a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC), etc.
  • The high-performance hybrid file system architecture structure, the file storage processing method, the file dynamic migration method and the metadata manage server according to the embodiments of the present disclosure, make comprehensive use of the performance advantages of a variety of distributed file systems to process various file storage problems, which, committed to improving a universal high-performance file system, can cope with storage problems of files of various types under various complex environments, and all have high performance.
  • It should be understood by those skilled in the art that the embodiments of the present disclosure as described above and shown in the drawings are only examples and do not limit the present disclosure. The objective of the present disclosure has been fully and effectively achieved. The functional and structural principles of the present disclosure have been shown and described in the embodiments; and any transformation or modification may be made to the implementing modes of the present disclosure without departing from the principles.

Claims (24)

1. A file storage processing method applied in a hybrid file system architecture including a plurality of different types of distributed file systems, for determining in which distributed file system a file to be stored is stored, the file storage processing method comprising:
acquiring storage attributes of the file to be stored, wherein, the storage attributes at least include a size of the file;
determining, according to a pre-configured storage rule and the storage attributes of the file to be stored, in which distributed file system the file to be stored is stored; and
storing the file to be stored in the determined distributed file system,
wherein, the storage rule is an intelligent storage model obtained through learning by using an artificial intelligence learning algorithm based on a training sample set; and features of each training sample of the training sample set include storage attributes of a file and a label of the file system to which the file has been determined to be assigned.
2. (canceled)
3. The file storage processing method according to claim 1, wherein, the storage attributes of the file further include:
access mode, access permission, and associated owner of the file,
an access mode type is selected from one of: read-only, write-only, read-write, and executable.
4. The file storage processing method according to claim 1, the hybrid file system architecture including a metadata manage server,
wherein, the storage rule is stored in a non-volatile storage medium, and meanwhile maintained in a metadata manage server memory; and
the storage rule is dynamically updated,
wherein, the determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored includes: reading the storage rule from the metadata manage server, and determining, according the read storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored.
5. The file storage processing method according to claim 4, wherein, the storage rule is further maintained in a remote standby node.
6. The file storage processing method according to claim 1, wherein, the artificial intelligence learning algorithm is a decision tree, and the intelligent storage model is a decision tree model constructed based on training data.
7. The file storage processing method according to claim 5, wherein, optimization processing including pruning and cross-validation is performed in construction of the decision tree model.
8. The file storage processing method according to claim 6, further comprising:
receiving, by the metadata manage server, from a client a request to read a file from the hybrid file system architecture or update a file therein;
acquiring, by the metadata manage server, path information of the file to be read or updated, to further obtain storage location information of the file;
returning, by the metadata manage server, the storage location of the file to be read or updated to the client; and
communicating, by the client, with a corresponding distributed file system according to the returned storage location, to perform actual read operation or update operation.
9. The file storage processing method according to claim 5, wherein, the label of the file system to which the file has been determined to be assigned is determined based on I/O performance of the file on each of the distributed file systems, and the I/O performance of the file on each of the distributed file systems is determined experimentally as follows:
acquiring a read throughput rate Firt and a write throughput rate Fiwt of the file on each distributed file system through experiments, the read throughput rate Firt being a data size of the file read per second, and the write throughput rate Fiwt being a data size of the file written per second; and
calculating a sum of the read throughput rate Firt and the write throughput rate Fiwt of the file in each distributed file system as the I/O performance of the file on each of the distributed file systems.
10. The file storage processing method according to claim 1, further comprising:
determining a distributed file system that needs file migration;
determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration; and
migrating the file that has been determined to be migrated.
11. The file storage processing method according to claim 10, wherein, the determining a distributed file system that needs file migration includes:
calculating a difference in usage rate between any two distributed file systems; and
determining that a distributed file system with a higher usage rate needs file migration, when the difference in usage rate is greater than a predetermined threshold.
12. The file storage processing method according to claim 10, wherein, the determining a file to be migrated on the distributed file system, for the distributed file system that needs file migration includes:
calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems; and
determining the file to be migrated and the migration destination of the file based on sorting of migration gains of migrating respective files to other distributed file systems.
13. The file storage processing method according to claim 12, wherein, the calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems includes:
referring to the distributed file system that needs file migration as a distributed file system i, referring to any one of the other distributed file systems as a distributed file system j, and referring to a file on the distributed file system i as a file x;
obtaining read throughput and write throughput of the file x on the distributed file system i, and predicting read throughput and write throughput of the file x on the distributed file system j;
obtaining a read frequency and a write frequency of the file x on the distributed file system i; and
calculating a migration gain of migrating the file x from the distributed file system i to the distributed file system j, at least based on the size of the file x, the read frequency and the write frequency of the file x on the distributed file system i, the read throughput and the write throughput of the file x on the distributed file system i, as well as the read throughput and the write throughput of the file x on the distributed file system j.
14. The file storage processing method according to claim 13, wherein, the migration gain of migrating the file x from the distributed file system i to the distributed file system j is calculated based on a formula below:

diffx(DFSi,DFSj)=(s x /F xrt(DFSi)−s x /F xrt(DFSj))*F xrf+(s x /F xwt(DFSi)−s x /F xwt(DFSj))*F xwf  (1)
DFSi and DFSi represent the distributed file systems i, j; Fxrt(DFSi) and Fxrt(DFSj) are respectively read throughput rates of the file x in the distributed file systems i, j; Fxwt(DFSi) and Fxwt(DFSj) are write throughput rates of the file x in the distributed file systems i, j; a throughput rate is a size of a file read and written per second; the read throughput rate and the write throughput rate are functions of the file size; Fxrf and Fxwf are respectively the read frequency and the write frequency of the file x in the distributed file system i; and sx is a size of the file x to be migrated in the file system.
15. The file storage processing method according to claim 13, wherein, the predicting read throughput and write throughput of the file x on the distributed file system j includes:
predicting by using a predetermined regression model, the regression model being selected from one of:
model regression equation first-order y(k) = a0 + a1e−pk model second-order y(k) = a0 + a1e−pk + a2e−p 2 k model third-order y(k) = a0 + a1e−pk + be
Figure US20200311035A1-20201001-P00899
wkcos(w√{square root over (1 − δ2)}k) +
model ce
Figure US20200311035A1-20201001-P00899
wksin(w√{square root over (1 − δ2)}k)
fourth-order y(k) = a0 +
Figure US20200311035A1-20201001-P00899
 cos(w1√{square root over (1 − δ1 2)}k) +
Figure US20200311035A1-20201001-P00899
model sin (w1√{square root over (1 − δ1 2)}k) + b2e
Figure US20200311035A1-20201001-P00899
w 2 k cos(w2√{square root over (1 − δ2 2)}k) + c2e
Figure US20200311035A1-20201001-P00899
w 2 k
sin(w2√{square root over (1 − δ2 2)}k)
Figure US20200311035A1-20201001-P00899
indicates data missing or illegible when filed
the predetermined regression model is determined through a fitting process and a selecting process below: inputting file training data to different types of regression models;
calculating unknown parameters by using a least square method; fitting to obtain the different types of regression models after the fitting; and selecting a regression model with a best fitting effect from the different types of regression models after the fitting as the predetermined regression model.
16. The file storage processing method according to claim 13, wherein, the obtaining a read frequency and a write frequency of the file x on the distributed file system i includes:
obtaining the read frequency and the write frequency of the file x on the distributed file system i by querying the metadata manage server.
17. A file dynamic migration method applied in a hybrid file system architecture including a plurality of different types of distributed file systems, comprising:
determining a distributed file system that needs file migration;
determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration; and
migrating the file that has been determined to be migrated,
wherein, the determining a distributed file system that needs file migration includes:
calculating a difference in usage rate between any two distributed file systems; and
determining that a distributed file system with a higher usage rate needs file migration, when the difference in usage rate is greater than a predetermined threshold.
18. (canceled)
19. The file dynamic migration method according to claim 17, wherein, the determining a file to be migrated on the distributed file system, for the distributed file system that needs file migration includes:
calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems; and
determining the file to be migrated and the migration destination of the file based on sorting of migration gains of migrating respective files to other distributed file systems.
20. The file dynamic migration method according to claim 19, wherein, the calculating a migration gain of migrating each file in the distributed file system that needs file migration to any one of other distributed file systems includes:
referring to the distributed file system that needs file migration as a distributed file system i, referring to any one of the other distributed file systems as a distributed file system j, and referring to a file on the distributed file system i as a file x;
obtaining read throughput and write throughput of the file x on the distributed file system i, and predicting read throughput and write throughput of the file x on the distributed file system j;
obtaining a read frequency and a write frequency of the file x on the distributed file system i; and
calculating a migration gain of migrating the file x from the distributed file system i to the distributed file system j, at least based on the size of the file x, the read frequency and the write frequency of the file x on the distributed file system i, the read throughput and the write throughput of the file x on the distributed file system i, as well as the read throughput and the write throughput of the file x on the distributed file system j.
21. The file dynamic migration method according to claim 20, wherein, the migration gain of migrating the file x from the distributed file system i to the distributed file system j is calculated based on a formula below:

diffx(DFSi,DFSj)=(s x /F xrt(DFSi)−s x /F xrt(DFSj))*F xrf+(s x /F xwt(DFSi)−s x /F xwt(DFSj))*F xwf  (1)
DFSi and DFSj represent the distributed file systems i, j; Fxrt(DFSi) and Fxrt(DFSj) are respectively read throughput rates of the file x in the distributed file systems i, j; Fxwt(DFSi) and Fxwt(DFSj) are write throughput rates of the file x in the distributed file systems i, j; a throughput rate is a size of a file read and written per second; the read throughput rate and the write throughput rate are functions of the file size; Fxrf and Fxwf are respectively the read frequency and the write frequency of the file x in the distributed file system i; and sx is a size of the file x to be migrated in the file system.
22-27. (canceled)
28. A metadata manage server in a hybrid file system architecture system, which interacts with a client and a plurality of distributed file systems, the metadata manage server maintaining a pre-configured storage rule below, and being configured to perform a method below:
acquiring storage attributes of a file to be stored, wherein, the storage attributes at least include a size of the file;
determining, according to a pre-configured storage rule and the attributes of the file to be stored, in which distributed file system the file to be stored is stored;
determining a distributed file system that needs file migration;
determining a file to be migrated on the distributed file system and a migration destination, for the distributed file system that needs file migration; and
migrating the file that has been determined to be migrated,
wherein, the storage rule is an intelligent storage model obtained through learning by using an artificial intelligence learning algorithm based on a training sample set; and features of each training sample of the training sample set include storage attributes of a file and a label of the file system to which the file has been determined to be assigned.
29. (canceled)
US16/831,964 2017-09-28 2020-03-27 Hybrid file system architecture, file storage, dynamic migration, and application thereof Active US10810169B1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/103907 WO2019061132A1 (en) 2017-09-28 2017-09-28 Hybrid file system architecture, file storage, dynamic migration, and application thereof

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/103907 Continuation WO2019061132A1 (en) 2017-09-28 2017-09-28 Hybrid file system architecture, file storage, dynamic migration, and application thereof

Publications (2)

Publication Number Publication Date
US20200311035A1 true US20200311035A1 (en) 2020-10-01
US10810169B1 US10810169B1 (en) 2020-10-20

Family

ID=65902527

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/831,964 Active US10810169B1 (en) 2017-09-28 2020-03-27 Hybrid file system architecture, file storage, dynamic migration, and application thereof

Country Status (3)

Country Link
US (1) US10810169B1 (en)
CN (1) CN111095233B (en)
WO (1) WO2019061132A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11409727B2 (en) * 2019-09-18 2022-08-09 International Business Machines Corporation Concurrent execution of database operations
US11606432B1 (en) * 2022-02-15 2023-03-14 Accenture Global Solutions Limited Cloud distributed hybrid data storage and normalization
CN115904263A (en) * 2023-03-10 2023-04-04 浪潮电子信息产业股份有限公司 Data migration method, system, equipment and computer readable storage medium
CN117370310A (en) * 2023-10-19 2024-01-09 中电云计算技术有限公司 Distributed file system cross-cluster data increment migration method

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762504A (en) * 2017-11-29 2021-12-07 华为技术有限公司 Model training system, method and storage medium
CN111177105B (en) * 2019-12-29 2022-03-22 浪潮电子信息产业股份有限公司 Mass file writing method, device, system and medium of distributed file system
CN111581178A (en) * 2020-05-12 2020-08-25 国网安徽省电力有限公司信息通信分公司 Ceph system performance tuning strategy and system based on deep reinforcement learning
CN112084156A (en) * 2020-09-24 2020-12-15 中国农业银行股份有限公司上海市分行 Hybrid storage system and self-adaptive backup method of file
CN112181951B (en) * 2020-10-20 2022-03-25 新华三大数据技术有限公司 Heterogeneous database data migration method, device and equipment
CN112596675A (en) * 2020-12-22 2021-04-02 平安银行股份有限公司 Multi-storage method and device of data, electronic equipment and computer storage medium
CN113282538A (en) * 2021-07-06 2021-08-20 中国工商银行股份有限公司 File system management method, device, equipment, storage medium and program product
CN113608876B (en) * 2021-08-12 2024-03-29 中国科学技术大学 Distributed file system metadata load balancing method based on load type perception
CN113741823A (en) * 2021-11-08 2021-12-03 杭州雅观科技有限公司 Cloud mixed distributed file storage method

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6990606B2 (en) * 2000-07-28 2006-01-24 International Business Machines Corporation Cascading failover of a data management application for shared disk file systems in loosely coupled node clusters
JP4265245B2 (en) * 2003-03-17 2009-05-20 株式会社日立製作所 Computer system
US7571168B2 (en) * 2005-07-25 2009-08-04 Parascale, Inc. Asynchronous file replication and migration in a storage network
US8135763B1 (en) * 2005-09-30 2012-03-13 Emc Corporation Apparatus and method for maintaining a file system index
US20070088717A1 (en) * 2005-10-13 2007-04-19 International Business Machines Corporation Back-tracking decision tree classifier for large reference data set
US7752206B2 (en) * 2006-01-02 2010-07-06 International Business Machines Corporation Method and data processing system for managing a mass storage system
JP4939152B2 (en) * 2006-09-13 2012-05-23 株式会社日立製作所 Data management system and data management method
JP5238235B2 (en) * 2007-12-07 2013-07-17 株式会社日立製作所 Management apparatus and management method
CN101944124B (en) * 2010-09-21 2012-07-04 卓望数码技术(深圳)有限公司 Distributed file system management method, device and corresponding file system
CN102456049A (en) * 2010-10-28 2012-05-16 无锡江南计算技术研究所 Data migration method and device, and object-oriented distributed file system
CN103593347B (en) * 2012-08-14 2017-06-13 中兴通讯股份有限公司 The method and distributed data base system of a kind of equally loaded
US20160291877A1 (en) * 2013-12-24 2016-10-06 Hitachi, Ltd. Storage system and deduplication control method
CN103778222A (en) * 2014-01-22 2014-05-07 浪潮(北京)电子信息产业有限公司 File storage method and system for distributed file system
US10534753B2 (en) * 2014-02-11 2020-01-14 Red Hat, Inc. Caseless file lookup in a distributed file system
US9489394B2 (en) * 2014-04-24 2016-11-08 Google Inc. Systems and methods for prioritizing file uploads
CN104994171A (en) * 2015-07-15 2015-10-21 上海斐讯数据通信技术有限公司 Distributed storage method and system
US10733153B2 (en) * 2016-02-29 2020-08-04 Red Hat, Inc. Snapshot management in distributed file systems
CN105912612B (en) * 2016-04-06 2019-04-05 中广天择传媒股份有限公司 A kind of distributed file system and the data balancing location mode towards the system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11409727B2 (en) * 2019-09-18 2022-08-09 International Business Machines Corporation Concurrent execution of database operations
US11606432B1 (en) * 2022-02-15 2023-03-14 Accenture Global Solutions Limited Cloud distributed hybrid data storage and normalization
US11876863B2 (en) * 2022-02-15 2024-01-16 Accenture Global Solutions Limited Cloud distributed hybrid data storage and normalization
CN115904263A (en) * 2023-03-10 2023-04-04 浪潮电子信息产业股份有限公司 Data migration method, system, equipment and computer readable storage medium
CN117370310A (en) * 2023-10-19 2024-01-09 中电云计算技术有限公司 Distributed file system cross-cluster data increment migration method

Also Published As

Publication number Publication date
CN111095233A (en) 2020-05-01
US10810169B1 (en) 2020-10-20
CN111095233B (en) 2023-09-26
WO2019061132A1 (en) 2019-04-04

Similar Documents

Publication Publication Date Title
US20200311035A1 (en) Hybrid file system architecture, file storage, dynamic migration, and application thereof
Fu et al. Fast approximate nearest neighbor search with the navigating spreading-out graph
US9928113B2 (en) Intelligent compiler for parallel graph processing
JP6160277B2 (en) Method for executing reconciliation processing, control unit, program, and data storage system
JP2017037648A (en) Hybrid data storage system, method, and program for storing hybrid data
CN110188210B (en) Cross-modal data retrieval method and system based on graph regularization and modal independence
Lee et al. Toward efficient multidimensional subspace skyline computation
Zhang et al. MRMondrian: Scalable multidimensional anonymisation for big data privacy preservation
US9852182B2 (en) Database controller, method, and program for handling range queries
CN108764726B (en) Method and device for making decision on request according to rules
US10915534B2 (en) Extreme value computation
CN112925821B (en) MapReduce-based parallel frequent item set incremental data mining method
Li et al. I/O efficient approximate nearest neighbour search based on learned functions
US20160246983A1 (en) Remote rule execution
CN108647266A (en) A kind of isomeric data is quickly distributed storage, exchange method
US11729268B2 (en) Computer-implemented method, system, and storage medium for prefetching in a distributed graph architecture
Sun Personalized music recommendation algorithm based on spark platform
US10235420B2 (en) Bucket skiplists
Elmeiligy et al. An efficient parallel indexing structure for multi-dimensional big data using spark
US11080301B2 (en) Storage allocation based on secure data comparisons via multiple intermediaries
Jayachitra Devi et al. Link prediction model based on geodesic distance measure using various machine learning classification models
Xie et al. Study of canopy and K-means clustering algorithm based on mahout for E-commerce product quality analysis
Liu et al. Social Network Community‐Discovery Algorithm Based on a Balance Factor
Kim et al. (p, n)-core: Core Decomposition in Signed Networks
Hong Lin et al. Towards publishing directed social network data with k‐degree anonymization

Legal Events

Date Code Title Description
AS Assignment

Owner name: RESEARCH INSTITUTE OF TSINGHUA UNIVERSITY IN SHENZHEN, CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHUNG, YEH-CHING;ZHANG, LIDONG;WU, YONGWEI;REEL/FRAME:052241/0973

Effective date: 20200310

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY