WO2018205689A1 - 合并文件的方法、存储装置、存储设备和存储介质 - Google Patents

合并文件的方法、存储装置、存储设备和存储介质 Download PDF

Info

Publication number
WO2018205689A1
WO2018205689A1 PCT/CN2018/074288 CN2018074288W WO2018205689A1 WO 2018205689 A1 WO2018205689 A1 WO 2018205689A1 CN 2018074288 W CN2018074288 W CN 2018074288W WO 2018205689 A1 WO2018205689 A1 WO 2018205689A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
file
data blocks
blocks
data block
Prior art date
Application number
PCT/CN2018/074288
Other languages
English (en)
French (fr)
Inventor
杰恩布丕·库马尔
辛格阿施施
张勇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2018205689A1 publication Critical patent/WO2018205689A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks

Definitions

  • the present application relates to the field of data processing, and more particularly to a method, storage device, storage device, and storage medium for merging files.
  • the following methods can be used to sequentially read the data in each file from the data blocks storing the files, and write the read data in a certain order. Created in the data block.
  • the data needs to be frequently read and written, and there are more input and output, which seriously affects the efficiency of file storage.
  • the data in the merged file is stored in a new data block, it also consumes storage space.
  • the present application provides a method, a storage device, a storage device, and a storage medium for merging files, which can improve the efficiency of file storage and save storage space.
  • a method for merging files comprising: determining a data storage range of N1 data blocks of a first file, and a data storage range of N2 data blocks of the second file, wherein N1 and N2 are positive Integer
  • the identifiers of the N1 data blocks and the identifiers of the N2 data blocks are added to the data block list of the target file, The data in the N1 data blocks and the data in the N2 data blocks are included in the target file, where the target file is a file obtained by combining the first file and the second file;
  • the M data blocks whose data storage ranges do not overlap among the N1 data blocks and the N2 data blocks are The identifier is added to the data block list of the target file, and the N1 data blocks and the data of the N2 data blocks except the M data blocks are written into the P data blocks of the target file, M and P are positive integers.
  • the merged file is obtained based on the data blocks of the at least two files to be merged, so that the merged file can be read from the data blocks of the at least two files, thereby avoiding a large amount of reading and writing. Operation, which improves the efficiency of file storage, and does not need to create a large number of new data blocks for storing the data of the merged file, and saves storage space.
  • the data storage range of the data block referred to herein refers to the data interval formed by all the data that can be included in these data blocks. For example, if the data storage range of the data block is the range of the Key value corresponding to the data in the data block, if the value of the Key corresponding to the data in the data block B1 is in the range of 40-70, in the data block B2 The value range of the Key value corresponding to the data is 30-100. Since both the data block B1 and the data block B2 include data whose key value is in the numerical interval 30-70, the data storage range of the data block B1 and the data block B2 is included. overlapping.
  • the data storage range is a value range of a keyword Key corresponding to data in a data block, or a value range of data in a data block, or data in a data block.
  • the scope of the logo is a value range of a keyword Key corresponding to data in a data block, or a value range of data in a data block, or data in a data block.
  • the writing the data of the N1 data blocks and the N2 data blocks except the M data blocks into the P data blocks of the target file includes: Resetting the data of the N1 data blocks and the N2 data blocks except the M data blocks according to the size of the key corresponding to the data, the size of the data value, or the size of the data identifier; The arranged data is written to the P data blocks.
  • the method is applied to a file system that uses data blocks as a unit of data storage.
  • the data block information of the data block indicated by the different data block ID is at least partially different, and the data block information includes at least one of the following information: The size of the data block, the data server DateNode where the data block is located, and the file to which the data block belongs.
  • the method for merging files described in the present application can also be applied to the merging of two or more files, such as when k files are merged (k is a positive integer and k>2), the ith (i ⁇ k, i traversing k) the t data blocks in the file do not overlap with the data blocks of other files, then the data block ID of the t data blocks can be directly written into the data block list of the target file, and the i-th file The data in the remaining data blocks (data blocks other than t data blocks) and the data in the remaining data blocks of other files need to be rearranged and written into new data blocks.
  • the values of t in different files may be the same or different.
  • a storage device that can be used to perform various processes in the methods of merging files described in the first aspect and various implementations described above.
  • the storage device includes: a determining unit, configured to determine a data storage range of the N1 data blocks of the first file, and a data storage range of the N2 data blocks of the second file, where N1 and N2 are positive integers;
  • a merging unit configured to add the identifier of the N1 data block and the identifier of the N2 data block to the target file, where the data storage range of the N1 data blocks and the N2 data blocks do not overlap a data block list such that the target file includes data in the N1 data blocks and data in the N2 data blocks, and the target file is a merge of the first file and the second file Obtained document;
  • the merging unit is further configured to store data in the N1 data blocks and the N2 data blocks if the data storage ranges of the N1 data blocks and the N2 data blocks overlap.
  • the identifiers of the M data blocks that do not overlap are added to the data block list of the target file, and the N1 data blocks and the data of the N2 data blocks except the M data blocks are written into the P data blocks of the target file, M and P are positive integers.
  • the storage device of the embodiment of the present application obtains the merged file based on the data blocks of the at least two files to be merged, so that the merged file can be read from the data blocks of the at least two files, avoiding a large number of Read and write operations, which improve the efficiency of file storage, and eliminate the need to generate a large number of new data blocks, saving storage space.
  • the data storage range is a value range of a keyword Key corresponding to data in a data block, or a value range of data in a data block, or data in a data block.
  • the scope of the logo is a value range of a keyword Key corresponding to data in a data block, or a value range of data in a data block, or data in a data block.
  • the merging unit is specifically configured to: in the N1 data blocks and the N2 data blocks, except for the M data blocks, according to the size and data of the Key corresponding to the data.
  • the size of the value or the size of the data identifier is rearranged; the rearranged data is written into the P data blocks.
  • the storage device is applied to a file system that uses data blocks as a unit of data storage.
  • the data block information of the data block indicated by the different data block ID is at least partially different, and the data block information includes at least one of the following information: The size of the data block, the data server DateNode where the data block is located, and the file to which the data block belongs.
  • a storage device in a third aspect, includes a transceiver, a processor, and a memory.
  • the memory stores a program that executes the program for performing the various processes in the methods of merging files described in the first aspect and various implementations described above.
  • the processor is specifically configured to:
  • the identifiers of the N1 data blocks and the identifiers of the N2 data blocks are added to the data block list of the target file, The data in the N1 data blocks and the data in the N2 data blocks are included in the target file, where the target file is a file obtained by combining the first file and the second file;
  • the M data blocks whose data storage ranges do not overlap among the N1 data blocks and the N2 data blocks are The identifier is added to the data block list of the target file, and the N1 data blocks and the data of the N2 data blocks except the M data blocks are written into the P data blocks of the target file, M and P are positive integers.
  • the storage device of the embodiment of the present application obtains the merged file based on the data blocks of the at least two files to be merged, so that the merged file can be read from the data blocks of the at least two files, avoiding a large number of Read and write operations, which improve the efficiency of file storage, and eliminate the need to generate a large number of new data blocks, saving storage space.
  • the data storage range is a value range of a keyword Key corresponding to data in a data block, or a value range of data in a data block, or data in a data block.
  • the scope of the logo is a value range of a keyword Key corresponding to data in a data block, or a value range of data in a data block, or data in a data block.
  • the processor is specifically configured to: in the N1 data blocks and the N2 data blocks, except for the M data blocks, according to the size and data of the key corresponding to the data.
  • the size of the value or the size of the data identifier is rearranged; the rearranged data is written into the P data blocks.
  • the storage device is applied to a file system in which data blocks are data storage units.
  • the data block information of the data block indicated by the different data block ID is at least partially different, and the data block information includes at least one of the following information: The size of the data block, the data server DateNode where the data block is located, and the file to which the data block belongs.
  • a computer readable storage medium storing a program, the program causing the apparatus to perform the consolidation of a file of any of the first aspect and various implementations thereof method.
  • a chip comprising an input interface, an output interface, a processor and a memory, the processor is configured to execute an instruction stored by the memory, and when the instruction is executed, the processor can implement the first Any of the aspects and various implementations thereof.
  • FIG. 1 is a schematic architectural diagram of a distributed file system.
  • FIG. 2 is a schematic diagram of file merging in the prior art.
  • FIG. 3 is a schematic flowchart of a method for merging files according to an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of a storage device of an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a storage device according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a system chip according to an embodiment of the present application.
  • DFS Distributed File System
  • the architectural diagram of the distributed file system shown in FIG. 1 includes a master server or a central server (NameNode), a data server or a data node (DataNode).
  • the NameNode is the brain of the entire file system, used to store the metadata of the file, it provides the directory information of the entire file system, and manages each DataNode.
  • Each file in the distributed file system is divided into several data blocks, each of which has a continuous piece of file content.
  • the data block is the basic unit of data storage, and each data block is stored in a different data.
  • this server is called a DataNode.
  • the client When reading data, the client can obtain block information of the data block in which the data is stored from the NameNode, such as the DataNode where the data block is located, the size of the data block, the file to which the data block belongs, and the like, and from the corresponding DataNode. Read the data block. When writing data, the client can obtain the data block allocated by the NameNode from the NameNode and write the data. Each data server DataNode in FIG. 1 may include several data blocks. In the distributed file system, the keyword Key in the key pair (Key, Value) corresponding to the data can be searched to quickly determine the value of the Key, so that the capability of processing the service in a large-scale real-time can be realized.
  • the NameNode such as the DataNode where the data block is located, the size of the data block, the file to which the data block belongs, and the like
  • the data in one file is divided into several data blocks, each data block corresponds to a certain data storage range, and the data storage range of each data block in Table 1 is The range of Keys of the data of the data block.
  • the range of the Key corresponding to the data in the data block B1 is 0-20
  • the file includes a plurality of data blocks.
  • the range of the Key corresponding to the data in the data block B2 is 20-40
  • the data corresponding to the data in the data block B3 has a range of 40-70.
  • the data block whose data block identifier (Identity, ID) is B1 is simply referred to as data block B1
  • the data block whose data block ID is B2 is simply referred to as data block B2
  • the data block whose data block ID is B3 is simply referred to as data block B3.
  • the data block identifier corresponds to the data block one by one
  • the identifier of each data block indicates information of the data block, such as the DataNode where the data block is located, the size of the data block, the file to which the data block belongs, and the like.
  • data merging is performed by sequentially scanning data in a plurality of files to be merged and writing the data into a new large file.
  • data merge when data merge is performed, each piece of data is first read from a data block for a plurality of files to be merged (file 1, file 2, and file 3), such as a key for the data.
  • the keys are arranged from small to large, and then written into new data blocks allocated for the merged file.
  • the reading of old files is to read the data blocks on the hard disk; on the other hand, the data to be merged read from these data blocks is also After the Key is sorted and written into the new data block, these read and write processes bring many problems, such as high input and output, which seriously affects the file storage efficiency and consumes a large amount of storage space.
  • file 1 shown in Table 2 is now merged with file 2 shown in Table 3 to generate merged file 3.
  • the data of the file 1 is stored in the data block B1, the data block B2, and the data block B3.
  • the range of the keyword corresponding to the data in the data block B1 is 0-20
  • the range of the keyword corresponding to the data in the data block B2 is 20-40
  • the range is 40-70.
  • the data of the file 2 is stored in the data block B10, the data block B11, the data block B12, and the data block B13.
  • the range of the keyword corresponding to the data in the data block B10 is 80-100
  • the range of the keyword corresponding to the data in the data block B11 is 100-140
  • the range is 140-160
  • the range of keywords corresponding to the data in the data block B13 is 160-200.
  • the data of the file 1 needs to be read from the data blocks B1, B2, B3, and the data of the file 2 needs to be read from the data blocks B10, B11, B12, B13.
  • the read data is also written to the new data blocks B20, B21, B22, B23, B24, B25 and B26. Therefore, the data needs to be frequently read and written, and there is a large amount of input and output, which affects the efficiency of file storage.
  • the data in the merged file 3 belongs to a new data block, and thus consumes storage space.
  • file 2 includes data blocks B10, B11, B12, and B13.
  • the key value range corresponding to the data block B10 is changed from 80-100 to 30-100.
  • the key value range of the file 1 is 0-70, and an overlap area is generated between the key value range of the file 2 and the key range of 30-200, and the overlapping Key value ranges from 30 to 70, corresponding to the data block B2 of the file 1. B3 and data block B10 of file 2.
  • the data in each data block can be read sequentially from B1, B11, B12 and B13, and new data is formed based on the data.
  • the blocks are, for example, data blocks B30, B33, B34 and B35 in Table 6.
  • data blocks with overlapping Key values that is, B2, B3 of file 1, and data block B10 of file 2
  • the data read from B2, B3, and B10 will be rearranged according to the key size of the data.
  • the rearranged data is then written into new blocks such as data blocks B31 and B32 in Table 6.
  • Table 6 shows the data storage of the merged file 3.
  • the data of file 1 needs to be read from data blocks B1, B2, and B3, and the data of file 2 needs to be read from data blocks B10, B11, B12, and B13 to form a merged
  • the read data also needs to be written into the new data blocks B30, B31, B32, B33, B34 and B35.
  • the object file is read, the data in the object file is read from the data blocks B30, B31, B32, B33, B34, and B35. Therefore, the data needs to be frequently read and written, and there are a large number of input and output, which affects the efficiency of file storage, and the new data block consumes a large amount of storage space.
  • the merged file is obtained based on the data blocks of the at least two files to be merged, so that the merged file can be read from the data blocks of the at least two files, thereby avoiding a large number of read and write operations. This improves the efficiency of file storage and eliminates the need to generate a large number of new data blocks, saving storage space.
  • FIG. 3 is a schematic flowchart of a method for merging files according to an embodiment of the present application.
  • the method can be performed by a storage device. As shown in FIG. 3, the method 300 includes:
  • a data storage range of N1 data blocks of the first file and a data storage range of N2 data blocks of the second file are determined.
  • the first file and the second file are existing files, and each of the N2 data blocks formed by the data in the first file and the data in the second file has a respective data block.
  • each data block includes data located within the data storage range.
  • N1 and N2 are positive integers.
  • the data storage range may be a value range of a keyword Key corresponding to data in a data block, or a value range of data in a data block, or an identifier of data in a data block. range.
  • the data in the distributed file system can be represented by the Key value corresponding to the data, and the data storage range corresponding to each data block is the range of the Key corresponding to the data in the data block, but the data structure is simple or In the case where the amount of data is small, the data storage range corresponding to each data block can also be directly expressed by the range of values of the data in the data block.
  • the title, the trailer, and each fragment of the movie may be identified using an identifier, and the range of the identifier is used to represent the data storage range corresponding to the data block of the file.
  • the data storage range is described as an example of the value range of the keyword Key corresponding to the data in the data block (referred to as the Key value range of the data block).
  • the execution 220 is performed. If the data storage ranges of the N1 data blocks and the N2 data blocks overlap, the execution 230 is performed.
  • the data storage range of the data block referred to herein refers to the data interval formed by all the data that can be included in these data blocks. For example, if the data storage range of the data block is the range of the Key value corresponding to the data in the data block, if the value of the Key corresponding to the data in the data block B1 is in the range of 40-70, in the data block B2 The value range of the Key value corresponding to the data is 30-100. Since both the data block B1 and the data block B2 include data whose key value is in the numerical interval 30-70, the data storage range of the data block B1 and the data block B2 is included. overlapping.
  • the target file is a file obtained by combining the first file and the second file.
  • the M data blocks in which the data storage ranges are not overlapped among the N1 data blocks and the N2 data blocks are added to the data block list of the target file, and the N1 data blocks and the data of the N2 data blocks except the M data blocks are written into the P data blocks of the target file, and M and P are positive Integer.
  • the first file and the second file to be merged are files existing in the file system
  • the target file is a combined file obtained by combining the two files
  • the merged object file includes all of the two files. data.
  • the data block of the target file may be determined according to the plurality of data blocks of the two files, that is, the data block of the target file may include multiple data blocks of the two files. At least part of the data block.
  • the identifier of the N1 data blocks may be determined. And adding the identifier of the N2 data block to the data block list of the target file, so that the target file includes the data in the N1 data block and the data in the N2 data block, so that no new data block is formed for the target file. .
  • the identifiers of the M data blocks whose data storage ranges do not overlap in the N1 data blocks and the N2 data blocks may be added to the N1 data blocks.
  • a data block list of the target file, and P data blocks of the N1 data blocks and the N 2 data blocks except the M data blocks are written into the target file, so that only the M data need to be deleted.
  • Data other than the data is written to the new data block, and the data belonging to the M data blocks in the target file does not need to be written into the new data block.
  • the P data blocks are new data blocks allocated by the NameNode for the target file, instead of the existing data blocks. And optionally, writing data other than the M data blocks of the N1 data blocks and the N2 data blocks into the P data blocks of the target file, including: the N1 data blocks and the N2 data blocks.
  • the data except the M data blocks in the block is rearranged according to the size of the key corresponding to the data, the size of the data value, or the size of the data identifier, and the rearranged data is written into the P data blocks.
  • the data block identifiers (Identity, ID) corresponding to the M data blocks are written into the data block list, and the data block list is located in the metadata of the target file, and the target file is The data in the corresponding range can be read from the data block indicated by the data block ID.
  • ID of the data block B1 is written in the metadata of the target file, it indicates that the corresponding data in the target file can be read from the data block indicated by B1; the identification B2 of the data block B2 is written into the target file.
  • the metadata it means that the corresponding data in the object file can be read from the data block indicated by B2.
  • the data block information corresponding to the data block ID may be searched in the file system according to the data block ID in the data block list, and according to the data block information from the DateNode The data is read in the data block.
  • the data block information of the data block indicated by the different data block ID is at least partially different, and the data block information includes at least one of the following information: the data block The size, the data server DateNode where the data block is located, and the file to which the data block belongs.
  • the metadata of the file system stores data block information of a plurality of data blocks, where the plurality of data blocks include the N1 data blocks, the N2 data blocks, and the P data blocks, where the plurality of data blocks
  • the data block information of each data block in the data block includes the data block identifier of each data block, and each data block size corresponding to the identity of each data block, the location of the data server DateNode where each data block is located, and Information such as files to which each data block belongs.
  • the data block identifier of the data block may be written into the data block list of the target file, or the link of the data block may be written in the data block list, and the data block may be obtained through the link.
  • the data block information is not limited herein. Other ways of obtaining the data block information are also within the protection scope of the embodiments of the present application.
  • N1+N2-M data blocks are included in the N1 data blocks of the first file and the N2 data blocks of the second file.
  • the data storage range of the N1+N2-M data blocks at least partially overlaps, that is, there is overlapping data storage between each of the N1+N2-M data blocks and at least one of the other data blocks. range.
  • the data block of file 1 includes data block B1 (with a value range of 1-10) and data block B2 (a range of key values). 10-20), the data block of file 2 is data block B3 (Key value range is 20-30) and data block B4 (Key value range is 30-40), it can be seen that data block B1, data block B2 The range of Key values corresponding to each data block in the data block B3 and the data block B4 does not overlap with the range of Key values corresponding to other data blocks.
  • the data of the target file can be from the data block B1, the data block B2, and the data block.
  • Reading in B3 and data block B4 it can be understood that the target file shares the data block B1 and the data block B2 with the file 1, and shares the data block B3 and the data block B4 with the file 2, and does not need to form a new physics in the target file. Piece.
  • File 1 includes data block B1 (with a value range of 1-10) and data block B2 (with a value range of 10).
  • file 2 includes data block B3 (Key value range is 15-25) and data block B4 (Key value range is 25-40), it can be seen that the key value range corresponding to data block B2 corresponds to data block B3.
  • the Key value ranges partially overlap (the overlapping Key values range from 15-20).
  • the target file includes data in the data block B1, the data block B2, the data block B3, and the data block B4, but since only the Key value range of the data block B1 and the data block B4 does not overlap with the Key value range of other data blocks, the data block
  • the data block ID of B1 and data block B4 can be directly written to the data block list of the target file so that the corresponding data in the target file can be read from the data block B1 and the data block B4.
  • the target file also needs to include the data in the data block B2 and the data block B3
  • the data in the data block B2 and the data block B3 can be read out and rearranged according to the size of the Key value, and the rearranged
  • the data is written into a new data block, that is, data block B5 (Key value range 10-25), and the data block B5 does not belong to the existing data block, but is a new data block formed for the target file, including file 1 and file 2
  • the data with a key value of 10-25 is written into a new data block, that is, data block B5 (Key value range 10-25)
  • the reason why the data in the data block B2 and the data block B3 is read out and rearranged according to the size of the Key value, and written in the new data block B5 is because the Key value is located in the data block B2 and the data block B3.
  • the data within the overlapping range is not necessarily identical.
  • the key value corresponding to the data block B2 ranges from 10 to 20
  • the key value corresponding to the data block B3 ranges from 15 to 25
  • the overlapping Key value ranges from 15 to 20, assuming that the data block B2 is located within 15-20.
  • the data is 15, 17, 19, and the data in the data block B3 located in 15-20 is 16, 18, 20, and the data to be included in the target file is 15, 16, 17, 18, 19, 20.
  • the data block ID of the data block B2 and the data block B3 are directly written into the data block list of the target file, the data whose Key value is in the range of 15-20 cannot be stored according to the size of the Key value, and subsequently from the target file. When reading data, it brings a lot of trouble to data retrieval.
  • the method of merging files described in the present application can also be applied to the combination of two or more files, for example, when k files are merged (k is positive) Integer and k>2), the t data blocks in the i-th (i ⁇ k, i traversal k) file do not overlap with the data blocks of other files, and the data block ID of the t data blocks can be directly written.
  • the data in the remaining data blocks in the i-th file data blocks other than t data blocks
  • the data in the remaining data blocks of other files need to be rearranged and written.
  • the values of t in different files can be the same or different.
  • file 1 includes data block B1 (Key value range 0-30) and data block B2 (Key value range 30-60)
  • file 2 includes data block B3 (Key value range 30-60), data block B4 (Key value) Range 60-90)
  • file 3 includes data block B5 (Key value range 60-90) and data block B6 (Key value range 90-120).
  • the data block ID of the data block B1 and the data block B6 may be directly written into the data block list of the target file, and the data in the data block B2, the data block B3, the data block B4, and the data block B5 according to the size of the data key value. After rearranging, write to a new data block.
  • file 1 shown in Table 2 is merged with the file 2 shown in Table 3 to generate the merged target.
  • file. Get the data block information of file 1 and file 2 to be merged.
  • file 1 includes data block B1, data block B2, and data block B3.
  • the key value range corresponding to the data block B1 is 0-20
  • the key value range corresponding to the data block B2 is 20-40
  • the key value range corresponding to the data block B3 is 40-70.
  • the file 2 includes the data block B10, the data block B11, the data block B12, and the data block B13.
  • the key value range corresponding to the data block B10 is 80-100
  • the key value range corresponding to the data block B11 is 100-140
  • the key value range corresponding to the data block B12 is 140-160
  • the key value range corresponding to the data block B13 is 160. -200.
  • the merged target file should include all the data in files 1 and 2.
  • the range of Key values corresponding to each data block in these data blocks is different from the range of Key values corresponding to other data blocks, that is, data block B1, data block B2, data block B3, data block B10, and data block.
  • B11, data block B12 and data block B13 do not have the same data with the same Key value, that is, the key value range of each data block does not overlap with the key value range of other data blocks, so the data in the merged target file, It can be read from the data block B1, the data block B2, and the data block B3 of the file 1, and the data blocks B10, B11, B12, and B13 of the file 2.
  • the data block ID of each of the data block B1, the data block B2, and the data block B3 of the file 1 can be written.
  • the data block list of the target file is such that the data in the target file can be read from the data block indicated by the data block ID.
  • Table 7 shows the data storage of the merged object files. It can be seen that, in the process of generating the target file, the data block ID of the data block B1, the data block B2, the data block B3, the data block B10, the data block B11, the data block B12, and the data block B13 are written into the data of the target file. In the block list, the corresponding data in the target file can be read from these data blocks, avoiding a large amount of IO, improving the efficiency of file storage, and not generating new data blocks for the target file, thereby saving storage space.
  • file 1 shown in Table 2 is merged with the file 2 shown in Table 4 to generate the merged target.
  • file. Get the data block information of file 1 and file 2 to be merged.
  • file 1 includes data block B1, data block B2, and data block B3.
  • the key value range corresponding to the data block B1 is 0-20
  • the key value range corresponding to the data block B2 is 20-40
  • the key value range corresponding to the data block B3 is 40-70.
  • the file 2 includes a data block B10, a data block B11, a data block B12, and a data block B13.
  • the key value range corresponding to the data block B10 is 30-100
  • the key value range corresponding to the data block B11 is 100-140
  • the key value range corresponding to the data block B12 is 140-160
  • the key value range corresponding to the data block B13 is 160-200.
  • the data of the merged object file includes all the data in file 1 and file 2.
  • the data block ID of each of B12 and data block B13 is written in the data block list of the target file, so that the data whose key value is located at 0-20 in the target file can be directly read from the data block B1, and the key value is located at 80-100.
  • the data can be directly read from the data block B10, the data whose key value is located at 100-140 can be directly read from the data block B11, and the data whose key value is located at 140-160 can be directly read from the data block B12, and the Key value is directly read.
  • Data located at 160-200 can be read directly from data block B13. Therefore, for the data in the target file whose Key value is 0-20 and 100-200, it is not necessary to write the data in the target file into the new data block as in the prior art, thereby avoiding a large number of read and write operations and saving. storage.
  • the data block B2 of the file 1 (the key value range is 20-40), the data block B3 (the key value range is 40-70), and the data block B10 of the file 2 (the key value range is 30-100)
  • the data block The range of Key values corresponding to B1 partially overlaps with the range of Key values corresponding to the data block B10, and the range of Key values corresponding to the data block B2 overlaps with the range of Key values corresponding to the data block B10. Since there is an overlap between the Key value ranges of the three data blocks, two new data blocks, that is, the data block B40 and the data block B41, can be generated for the target file, and are from the data block B1, the data block B2, and the data block B10.
  • the data is read, the read data is sorted according to the size of the Key value, and the read data is sorted according to the size of the Key value, and finally the rearranged data is written into the data block B40 and the data block B41.
  • Table 8 shows the data storage of the merged target files. It can be seen that the merged object file includes the data block B1 of the file 1, and the data block B11, the data block B12 and the data block B13 of the file 2, so only two data blocks, that is, the data block B40 and the data block B41 are generated. Avoid a lot of IO, improve the efficiency of file storage, and save storage space.
  • the merged files are obtained based on the data blocks of at least two files to be merged, so that the merged files can be read from the data blocks of the at least two files, avoiding a large number of reads.
  • Write operations which improve the efficiency of file storage, and save a lot of storage space without generating a large number of new data blocks.
  • the size of the sequence numbers of the foregoing processes does not mean the order of execution sequence, and the order of execution of each process should be determined by its function and internal logic, and should not be applied to the embodiment of the present application.
  • the implementation process constitutes any limitation.
  • a storage device according to an embodiment of the present application will be described below with reference to FIG. 4, and the technical features described in the method embodiments may be applied to the following device embodiments.
  • FIG. 4 is a schematic block diagram of a memory device 400 in accordance with an embodiment of the present application.
  • the storage device 400 includes a determining unit 410 and a merging unit 420. among them:
  • the determining unit 410 is configured to determine a data storage range of the N1 data blocks of the first file, and a data storage range of the N2 data blocks of the second file, where N1 and N2 are positive integers;
  • the merging unit 420 is configured to add the identifiers of the N1 data blocks and the identifiers of the N2 data blocks to the target if the data storage ranges of the N1 data blocks and the N2 data blocks do not overlap. a data block list of the file, such that the target file includes data in the N1 data blocks and data in the N2 data blocks, and the target file is the first file and the second file Merging the resulting documents;
  • the merging unit 420 is further configured to: in the case that the N1 data blocks overlap with the data storage range of the N2 data blocks, the data storage range of the N1 data blocks and the N2 data blocks An identifier of the M data blocks that do not overlap is added to the data block list of the target file, and data of the N1 data blocks and the N 2 data blocks except the M data blocks are written to the target P data blocks of the file, M and P are positive integers.
  • the merged files are obtained based on the data blocks of at least two files to be merged, so that the merged files can be read from the data blocks of the at least two files, avoiding a large number of reads.
  • Write operations which improve the efficiency of file storage, and save a lot of storage space without generating a large number of new data blocks.
  • the data storage range is a value range of a keyword Key corresponding to data in a data block, or a value range of data in a data block, or an identifier of data in a data block. range.
  • the merging unit 420 is configured to: use the size of the key corresponding to the data, and the value of the data according to the data of the N1 data blocks and the N2 data blocks except the M data blocks.
  • the size or the size of the data identifier is rearranged; the rearranged data is written to the P data blocks.
  • the storage device is applied to a file system in which data blocks are data storage units.
  • the data block information of the data block indicated by the different data block ID is at least partially different, and the data block information includes at least one of the following information: the data block The size, the data server DateNode where the data block is located, and the file to which the data block belongs.
  • FIG. 5 is a schematic block diagram of a storage device 500 in accordance with an embodiment of the present application.
  • the storage device 500 can include the storage device 400 shown in FIG. 4, such as a computer or the like.
  • the storage device 500 includes a processor 510, a transceiver 520, and a memory 530, wherein the processor 510, the transceiver 520, and the memory 530 communicate with each other through an internal connection path.
  • the memory 530 is used to store data and instructions in the file, and the processor 510 is configured to execute instructions stored in the memory 530 to control the transceiver 520 to receive signals or transmit signals.
  • the processor 510 is configured to: determine a data storage range of the N1 data blocks of the first file, and a data storage range of the N2 data blocks of the second file, where N1 and N2 are positive integers;
  • the identifiers of the N1 data blocks and the identifiers of the N2 data blocks are added to the data block list of the target file, The data in the N1 data blocks and the data in the N2 data blocks are included in the target file, where the target file is a file obtained by combining the first file and the second file;
  • the M data blocks whose data storage ranges do not overlap among the N1 data blocks and the N2 data blocks are The identifier is added to the data block list of the target file, and the N1 data blocks and the data of the N2 data blocks except the M data blocks are written into the P data blocks of the target file, M and P are positive integers.
  • the storage device of the embodiment of the present application obtains the merged file based on the data blocks of the at least two files to be merged, so that the merged file can be read from the data blocks of the at least two files, thereby avoiding a large number of Read and write operations, which improve the efficiency of file storage, and eliminate the need to generate a large number of new data blocks, saving storage space.
  • the data storage range is a value range of a keyword Key corresponding to data in a data block, or a value range of data in a data block, or an identifier of data in a data block. range.
  • the processor 510 is specifically configured to: the N1 data blocks and the data of the N2 data blocks except the M data blocks, according to the size and data of the key corresponding to the data.
  • the size or the size of the data identifier is rearranged; the rearranged data is written to the P data blocks.
  • the storage device is applied to a file system in which a data block is a data storage unit.
  • the data block information of the data block indicated by the different data block ID is at least partially different, and the data block information includes at least one of the following information: the data block The size, the data server DateNode where the data block is located, and the file to which the data block belongs.
  • the processor 510 may be a central processing unit (CPU), and the processor 510 may also be other general-purpose processors, digital signal processing (DSP). , Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc.
  • DSP digital signal processing
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the memory 530 can include read only memory and random access memory and provides instructions and data to the processor 510. A portion of the memory 530 may also include a non-volatile random access memory.
  • each step of the above method may be completed by an integrated logic circuit of hardware in the processor 510 or an instruction in a form of software.
  • the steps of the positioning method disclosed in the embodiment of the present application may be directly implemented by the hardware processor, or may be performed by a combination of hardware and software modules in the processor 510.
  • the software module can be located in a conventional storage medium such as random access memory, flash memory, read only memory, programmable read only memory or electrically erasable programmable memory, registers, and the like.
  • the storage medium is located in memory 530, and processor 510 reads the information in memory 530 and, in conjunction with its hardware, performs the steps of the above method. To avoid repetition, it will not be described in detail here.
  • the storage device 500 according to the embodiment of the present application may correspond to the storage device for performing the method 300 in the above method 300, and the storage device 400 according to the embodiment of the present application, and each unit or module in the storage device 500 is used for The operations or processes performed by the storage device in the above method 200 are performed.
  • each unit or module in the storage device 500 is used for The operations or processes performed by the storage device in the above method 200 are performed.
  • detailed description thereof will be omitted.
  • FIG. 6 is a schematic structural diagram of a chip of an embodiment of the present application.
  • the chip 600 of FIG. 6 includes an input interface 601, an output interface 602, at least one processor 603, and a memory 604.
  • the input interface 601, the output interface 602, the processor 603, and the memory 604 are connected to each other through an internal connection path.
  • the processor 603 is configured to execute code in the memory 604.
  • the processor 603 can implement the method 300 performed by the storage device in a method embodiment. For the sake of brevity, it will not be repeated here.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the coupling or direct coupling or new connection shown or discussed may be an indirect coupling or a new connection through some interface, device or unit, and may be in electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present application which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种合并文件的方法和存储装置,该方法包括:确定第一文件的N1个数据块的数据存储范围和第二文件的N2个数据块的数据存储范围;在N1个数据块与N2个数据块的数据存储范围不重叠的情况下,将N1个数据块的标识和N2个数据块的标识加入目标文件的数据块列表,以使得目标文件包括N1个数据块和N2个数据块中的数据;在N1个数据块与N2个数据块的数据存储范围有重叠的情况下,将N1个数据块和N2个数据块中,数据存储范围不重叠的M个数据块的标识加入目标文件的数据块列表,并将N1个数据块和N2个数据块中除该M个数据块外的数据写入目标文件的P个数据块。该合并文件的方法能够提高文件存储的效率并节省存储空间。

Description

合并文件的方法、存储装置、存储设备和存储介质
本申请要求于2017年05月10日提交中国专利局、申请号为201710326321.4、发明名称为“合并文件的方法、存储装置、存储设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理领域,并且更具体地,涉及一种合并文件的方法、存储装置、存储设备和存储介质。
背景技术
在一个文件系统中对多个文件进行合并时,可以采用如下方法,从存储这些文件的数据块中依次读取每个文件中的数据,并将读取出来的这些数据按照一定顺序写入新创建的数据块中。在该文件合并的过程中,数据需要进行频繁的读取和写入,存在较多的输入输出,严重了影响文件存储的效率。另外,由于合并后的文件中的数据都要在新的数据块中存储,因此也较为耗费存储空间。
发明内容
本申请提供一种合并文件的方法、存储装置、存储设备和存储介质,能够提高文件存储的效率,并节省了存储空间。
第一方面,提供了一种合并文件的方法,该方法包括:确定第一文件的N1个数据块的数据存储范围,以及第二文件的N2个数据块的数据存储范围,N1和N2为正整数;
在所述N1个数据块与所述N2个数据块的数据存储范围不重叠的情况下,将所述N1个数据块的标识和所述N2个数据块的标识加入目标文件的数据块列表,以使得所述目标文件中包括所述N1个数据块中的数据和所述N2个数据块中的数据,所述目标文件为所述第一文件和所述第二文件合并得到的文件;
在所述N1个数据块与所述N2个数据块的数据存储范围有重叠的情况下,将所述N1个数据块和所述N2个数据块中,数据存储范围不重叠的M个数据块的标识加入所述目标文件的数据块列表,以及将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据写入所述目标文件的P个数据块,M和P为正整数。
因此,本申请实施例中,基于待合并的至少两个文件的数据块得到合并后的文件,使得合并后的文件能够从该至少两个文件的数据块中读取,避免了大量的读写操作,从而提高了文件存储的效率,而且无需建立大量新的数据块用于存储合并后的文件的数据,还节省了存储空间。
这里所说的数据块的数据存储范围,是指这些数据块中能够包含的所有数据所形成的数据区间。例如,假设数据块的数据存储范围为数据块中的数据所对应的Key值的范围, 若数据块B1中的数据所对应的Key值所处的数值区间为40-70,数据块B2中的数据所对应的Key值所处的数值区间为30-100,由于数据块B1和数据块B2中均包括Key值位于数值区间30-70的数据,因而数据块B1和数据块B2的数据存储范围重叠。
在一种实现方式中,所述数据存储范围为一数据块中的数据对应的关键字Key的取值范围,或者为一数据块中的数据的取值范围,或者为一数据块中的数据的标识的范围。
在一种实现方式中,所述将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据写入所述目标文件的P个数据块,包括:将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据,按照数据对应的Key的大小、数据取值的大小或数据标识的大小进行重新排列;将所述重新排列后的数据写入所述P个数据块。
在一种实现方式中,所述方法应用于以数据块为数据存储单位的文件系统。
在一种实现方式中,在所述目标文件的数据块列表中,不同数据块ID所指示的数据块的数据块信息至少部分不同,所述数据块信息包括以下信息中的至少一种:所述数据块的大小、所述数据块所处的数据服务器DateNode、和所述数据块所属的文件。
应理解,本申请所述的合并文件的方法还可以应用于两个以上文件的合并,比如在k个文件进行合并时(k为正整数且k>2),第i个(i≤k,i遍历k)文件中的t个数据块与其他文件的数据块均不重叠,则这t个数据块的数据块ID可以直接写入目标文件的数据块列表中,而第i个文件中的剩余数据块(除t个数据块外的数据块)中的数据,与其他文件的剩余数据块中的数据需要进行重排并写入新的数据块,不同文件中t的取值可以相同或不同。
第二方面,提供了一种存储装置,该存储装置可以用于执行前述第一方面及各种实现方式中所述的合并文件的方法中的各个过程。该存储装置包括:确定单元,用于确定第一文件的N1个数据块的数据存储范围,以及第二文件的N2个数据块的数据存储范围,N1和N2为正整数;
合并单元,用于在所述N1个数据块与所述N2个数据块的数据存储范围不重叠的情况下,将所述N1个数据块的标识和所述N2个数据块的标识加入目标文件的数据块列表,以使得所述目标文件中包括所述N1个数据块中的数据和所述N2个数据块中的数据,所述目标文件为所述第一文件和所述第二文件合并得到的文件;
所述合并单元还用于,在所述N1个数据块与所述N2个数据块的数据存储范围有重叠的情况下,将所述N1个数据块和所述N2个数据块中,数据存储范围不重叠的M个数据块的标识加入所述目标文件的数据块列表,以及将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据写入所述目标文件的P个数据块,M和P为正整数。
因此,本申请实施例的存储装置,基于待合并的至少两个文件的数据块得到合并后的文件,使得合并后的文件能够从该至少两个文件的数据块中读取,避免了大量的读写操作,从而提高了文件存储的效率,而且无需生成大量新的数据块,节省了存储空间。
在一种实现方式中,所述数据存储范围为一数据块中的数据对应的关键字Key的取值范围,或者为一数据块中的数据的取值范围,或者为一数据块中的数据的标识的范围。
在一种实现方式中,所述合并单元具体用于:将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据,按照数据对应的Key的大小、数据取值的大小或数据标识的大小进行重新排列;将所述重新排列后的数据写入所述P个数据块。
在一种实现方式中,所述存储装置应用于以数据块为数据存储单位的文件系统。
在一种实现方式中,在所述目标文件的数据块列表中,不同数据块ID所指示的数据块的数据块信息至少部分不同,所述数据块信息包括以下信息中的至少一种:所述数据块的大小、所述数据块所处的数据服务器DateNode、和所述数据块所属的文件。
第三方面,提供了存储设备,该存储设备包括收发器、处理器和存储器。所述存储器存储了程序,所述处理器执行所述程序,以用于执行前述第一方面及各种实现方式中所述的合并文件的方法中的各个过程。其中,所述处理器具体用于:
确定第一文件的N1个数据块的数据存储范围,以及第二文件的N2个数据块的数据存储范围,N1和N2为正整数;
在所述N1个数据块与所述N2个数据块的数据存储范围不重叠的情况下,将所述N1个数据块的标识和所述N2个数据块的标识加入目标文件的数据块列表,以使得所述目标文件中包括所述N1个数据块中的数据和所述N2个数据块中的数据,所述目标文件为所述第一文件和所述第二文件合并得到的文件;
在所述N1个数据块与所述N2个数据块的数据存储范围有重叠的情况下,将所述N1个数据块和所述N2个数据块中,数据存储范围不重叠的M个数据块的标识加入所述目标文件的数据块列表,以及将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据写入所述目标文件的P个数据块,M和P为正整数。
因此,本申请实施例的存储装置,基于待合并的至少两个文件的数据块得到合并后的文件,使得合并后的文件能够从该至少两个文件的数据块中读取,避免了大量的读写操作,从而提高了文件存储的效率,而且无需生成大量新的数据块,节省了存储空间。
在一种实现方式中,所述数据存储范围为一数据块中的数据对应的关键字Key的取值范围,或者为一数据块中的数据的取值范围,或者为一数据块中的数据的标识的范围。
在一种实现方式中,所述处理器具体用于:将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据,按照数据对应的Key的大小、数据取值的大小或数据标识的大小进行重新排列;将所述重新排列后的数据写入所述P个数据块。
在一种实现方式中,所述存储设备应用于以数据块为数据存储单位的文件系统。
在一种实现方式中,在所述目标文件的数据块列表中,不同数据块ID所指示的数据块的数据块信息至少部分不同,所述数据块信息包括以下信息中的至少一种:所述数据块的大小、所述数据块所处的数据服务器DateNode、和所述数据块所属的文件。
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有程序,所述程序使得上述装置执行上述第一方面及其各种实现方式中的任一种合并文件的方法。
第五方面,提供了一种芯片,该芯片包括输入接口、输出接口、处理器和存储器,该处理器用于执行该存储器存储的指令,当该指令被执行时,该处理器可以实现前述第一方面及其各种实现方式中的任一种方法。
附图说明
图1是分布式文件系统的示意性架构图。
图2是现有技术中文件合并的示意图。
图3是本申请实施例的合并文件的方法的示意性流程图。
图4是本申请实施例的存储装置的示意性框图。
图5是本申请实施例的存储设备的示意性结构图。
图6是本申请实施例的系统芯片的示意性结构图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
应理解,本申请实施例所述的合并文件的方法可以应用于分布式文件系统(Distributed File System,DFS),也可以应用于其他以块的形式存储文件的文件系统。本申请实施例中仅以DFS为例进行描述,但本申请并不限于此。
如图1所示的分布式文件系统的构架图,包括主控服务器或称中心服务器(NameNode)、数据服务器或称数据节点(DataNode)。其中,NameNode为整个文件系统的大脑,用于存储文件的元数据,它提供整个文件系统的目录信息,并且管理各个DataNode。分布式文件系统中的每一个文件,都被切分成若干个数据块,每一个数据块都有连续的一段文件内容,数据块是数据存储的基本单位,每一个数据块都被存储在不同的服务器上,此服务器就称之为DataNode。
读数据时,客户端可以从NameNode获取存储有该数据的数据块的块信息例如数据块所在的DataNode、数据块的大小、数据块所属的文件(File)等等信息,并从相应的DataNode中读取数据块。写数据时,客户端可以从NameNode获取NameNode为其分配的数据块并写入数据。图1中的每个数据服务器DataNode中可以包括若干个数据块。在分布式文件系统中,可以通过查找数据对应的键值对(Key,Value)中的关键字Key,以快速确定Key所对应的值Value,从而能够实现大规模实时处理业务的能力。
如表一所示,在分布式文件系统中,一个文件中的数据被切分成若干个数据块,每个数据块对应一定的数据存储范围,表一中每个数据块的数据存储范围为该数据块的数据的Key的范围。如表一所示,数据块B1中的数据所对应的Key的范围为0-20,该文件包括多个数据块,如表一所示,数据块B2中的数据所对应的Key的范围为20-40,数据块B3中的数据所对应的Key的范围为40-70。这里将数据块标识(Identity,ID)为B1的数据块简称为数据块B1,将数据块ID为B2的数据块简称为数据块B2,将数据块ID为B3的数据块简称为数据块B3,依次类推。数据块标识与数据块一一对应,每个数据块的标识指示了该数据块的信息例如数据块所在的DataNode、数据块的大小、数据块所属的文件等。
表一
Figure PCTCN2018074288-appb-000001
在分布式文件系统中,进行数据合并是通过顺序扫描待合并的多个文件中的数据,并将这些数据写入一个新的大文件中。如图2所示,当进行数据合并时,首先从用于多个待合并文件(文件1、文件2和文件3)的数据块中按行读取每一条数据,如然后对这些数 据的Key进行比较,按照Key从小到大进行排列,之后写入为合并后的文件分配的新的数据块中。一方面,对旧文件的读取,无论是全部读取还是只是读取一部分,都是对硬盘上数据块的读取;另一方面,还要将从这些数据块读取的待合并数据按照Key排序后写入新的数据块中,因而这些读写过程带来了很多问题例如很高的输入输出,从而严重影响了文件存储效率,还耗费了大量的存储空间。
例如,现将表二中所示的文件1与表三中所示的文件2进行合并,以生成合并后的文件3。如表二所示,文件1的数据存储在数据块B1、数据块B2和数据块B3中。其中,数据块B1中的数据所对应的关键字的范围为0-20,数据块B2中的数据所对应的关键字的范围为20-40,数据块B3中的数据所对应的关键字的范围为40-70。
表二
Figure PCTCN2018074288-appb-000002
如表三所示,文件2的数据存储在数据块B10、数据块B11、数据块B12和数据块B13中。其中,数据块B10中的数据所对应的关键字的范围为80-100,数据块B11中的数据所对应的关键字的范围为100-140,数据块B12中的数据所对应的关键字的范围为140-160,数据块B13中的数据所对应的关键字的范围为160-200。
表三
Figure PCTCN2018074288-appb-000003
从用于存储文件1的数据块B1、B2、B3中,以及用于存储文件2的数据块B10、B11、B12、B13中,依次读取每个数据块中的数据,并根据这些数据生成新的数据块B20、B21、B22、B23、B24、B25和B26。在形成文件3的过程中,会将从B1、B2、B3以及B10、B11、B12、B13中读取出来的数据按照数据的Key值大小进行排序,并形成新的数据块B20、B21、B22、B23、B24、B25和B26中,表四为合并后的文件3的数据存储情况。可以看出,在整个文件合并的过程中,一方面,文件1的数据需要从数据块B1、B2、B3中读取,文件2的数据需要从数据块B10、B11、B12、B13中读取;另一方面,读取出来的这些数据,还要写入新的数据块B20、B21、B22、B23、B24、B25和B26。因此,数据需要进行频繁的读取和写入,存在大量的输入输出,这就影响文件存储的效率,合并后的文件3中的数据属于新的数据块,因此还耗费了存储空间。
表四
Figure PCTCN2018074288-appb-000004
又例如,如果在文件1和文件2的数据块中,存在数据存储范围重叠例如Key值重叠的数据块,假设文件1中的数据仍如表二所示,而文件2中的数据如表五所示,那么这时,合并后的文件3的数据存储情况可以如表六所示。在表五中,文件2包括数据块B10、B11、B12和B13中。其中,相比于表三,数据块B10对应的Key值范围由80-100变为30-100。那么,文件1的Key值范围为0-70,就与文件2的Key值范围30-200之间产生了重叠区域,重叠的Key值范围为30-70,对应于文件1的数据块B2、B3以及文件2的数据块B10。
表五
Figure PCTCN2018074288-appb-000005
对于文件1的数据块B1和文件2的数据块B11、B12和B13,可以直接从B1、B11、B12和B13中,依次读取每个数据块中的数据,并根据这些数据形成新的数据块例如表6中的数据块B30、B33、B34和B35。而对于Key值范围有重叠的数据块,即文件1的B2、B3以及文件2的数据块B10,会将从B2、B3以及B10中读取出来的数据按照数据的关键字大小进行重新排列,然后将重排后的数据写入新的数块例如表6中的数据块B31和B32中。表六为合并后的文件3的数据存储情况。
同样,在整个文件合并的过程中,文件1的数据需要从数据块B1、B2、B3中读取,文件2的数据需要从数据块B10、B11、B12、B13中读取,形成合并后的目标文件时,读取出来的这些数据还需要写入新的数据块B30、B31、B32、B33、B34和B35。后续在对目标文件进行读取时,从数据块B30、B31、B32、B33、B34和B35中读取目标文件中的数据。因此,数据需要进行频繁的读取和写入,存在大量的输入输出,这就影响文件存储的效率,而且新的数据块耗费了大量存储空间。
表六
Figure PCTCN2018074288-appb-000006
本申请实施例中,基于待合并的至少两个文件的数据块得到合并后的文件,使得合并后的文件能够从该至少两个文件的数据块中读取,避免了大量的读写操作,从而提高了文件存储的效率,而且无需生成大量新的数据块,节省了存储空间。
图3是本申请实施例的合并文件的方法的示意性流程图。该方法可以由存储设备来执行。如图3所示,该方法300包括:
在310中,确定第一文件的N1个数据块的数据存储范围,以及第二文件的N2个数据块的数据存储范围。
其中,第一文件和第二文件为已经存在的文件,在第一文件中的数据所形成的N1个数据块和第二文件中的数据形成的N2个数据块中,每个数据块具有各自对应的数据存储范围,每个数据块包括位于该数据存储范围内的数据。N1和N2为正整数。
可选地,该数据存储范围可以为一数据块中的数据对应的关键字Key的取值范围,或者为一数据块中的数据的取值范围,或者为一数据块中的数据的标识的范围。
通常,分布式文件系统中的数据可以通过该数据对应的Key值来表示,每个数据块对应的数据存储范围就为该数据块中的数据所对应的Key的范围,但是在数据结构简单或者数据量小的情况下,每个数据块对应的数据存储范围也可以直接使用该数据块中的数据的值的范围来表示。另外,对一些文件例如电影文件等,可以使用标识对该电影的片头、片尾以及每个片段进行标识,并使用该标识的范围来表示该文件的数据块对应的数据存储范围。后面均以该数据存储范围为数据块中的数据对应的关键字Key的取值范围(简称数据块的Key值范围)为例进行描述。
在确定第一文件的N1个数据块的数据存储范围,以及第二文件的N2个数据块的数据存储范围之后,可以判断N1个数据块与N2个数据块的数据存储范围之间是否存在重叠。在该N1个数据块与该N2个数据块的数据存储范围不重叠的情况下,执行220,在该N1个数据块与该N2个数据块的数据存储范围重叠的情况下,执行230。
这里所说的数据块的数据存储范围,是指这些数据块中能够包含的所有数据所形成的数据区间。例如,假设数据块的数据存储范围为数据块中的数据所对应的Key值的范围,若数据块B1中的数据所对应的Key值所处的数值区间为40-70,数据块B2中的数据所对应的Key值所处的数值区间为30-100,由于数据块B1和数据块B2中均包括Key值位于数值区间30-70的数据,因而数据块B1和数据块B2的数据存储范围重叠。
在320中,在该N1个数据块与该N2个数据块的数据存储范围不重叠的情况下,将该N1个数据块的标识和该N2个数据块的标识加入目标文件的数据块列表,以使得该目 标文件包括该N1个数据块中的数据和该N2个数据块中的数据,该目标文件为第一文件和第二文件合并得到的文件。
在330中,在该N1个数据块与该N2个数据块的数据存储范围有重叠的情况下,将该N1个数据块和该N2个数据块中,数据存储范围不重叠的M个数据块的标识加入该目标文件的数据块列表,以及将该N1个数据块和该N2个数据块中除该M个数据块外的数据写入该目标文件的P个数据块,M和P为正整数。
具体地说,待合并的第一文件和第二文件为文件系统中已有的文件,目标文件为这两个文件合并后的得到的文件,合并后的目标文件包括这两个文件中的所有数据。在对这两个文件进行合并形成目标文件的过程中,首先创建一个空的目标文件,即目标文件此时为空文件,之后需要为目标文件确定其数据块,并将该数据块的数据块标识写入该目标文件的数据块列表中,从而使这些数据块中的数据被加入该目标文件中。在本申请实施例中,可以根据这两个文件的多个数据块,确定该目标文件的数据块,也就是说,目标文件的数据块中,可以包括这两个文件的多个数据块中的至少部分数据块。
假设第一文件的数据块有N1个,第二文件的数据块有N2个,且该N1个数据块与该N2个数据块的数据存储范围不重叠,那么可以将该N1个数据块的标识和该N2个数据块的标识加入目标文件的数据块列表,以使得该目标文件包括该N1个数据块中的数据和该N2个数据块中的数据,从而不用为目标文件形成新的数据块。
而如果该N1个数据块与该N2个数据块的数据存储范围有重叠,则可以将该N1个数据块和该N2个数据块中,数据存储范围不重叠的M个数据块的标识加入该目标文件的数据块列表,以及将该N1个数据块和该N2个数据块中除该M个数据块外的数据写入该目标文件的P个数据块,从而只需要将除这M个数据之外的数据写入新的数据块,而目标文件中的属于这M个数据块的数据就无需写入新的数据块了。
其中,这P个数据块为NameNode为该目标文件分配的新的数据块,而不是已经存在的数据块。并且,可选地,将N1个数据块和N2个数据块中除该M个数据块外的数据写入该目标文件的P个数据块,包括:将该N1个数据块和该N2个数据块中除该M个数据块外的数据,按照数据对应的Key的大小、数据取值的大小或数据标识的大小进行重新排列,并将重新排列后的数据写入该P个数据块。
也就是说,在确定了这M个数据块后,将这M个数据块对应的数据块标识(Identity,ID)写入数据块列表,该数据块列表位于目标文件的元数据中,目标文件中处于对应范围内的数据能够从该数据块ID指示的数据块中读取。例如,将数据块B1的ID写入目标文件的元数据中时,表示该目标文件中的相应数据可以从B1所指示的数据块中读取;将数据块B2的标识B2写入目标文件的元数据中时,表示该目标文件中的相应数据可以从B2所指示的数据块中读取。
在从数据块ID所指示的数据块中读取数据时,可以根据数据块列表中的数据块ID在文件系统中查找与该数据块ID对应的数据块信息,并根据该数据块信息从DateNode中的数据块中读取数据。
可选地,在所述目标文件的数据块列表中,不同数据块ID所指示的数据块的数据块信息至少部分不同,所述数据块信息包括以下信息中的至少一种:所述数据块的大小、所述数据块所处的数据服务器DateNode、和所述数据块所属的文件。
具体地说,文件系统的元数据中存储有多个数据块的数据块信息,这多个数据块包括该N1个数据块、该N2个数据块和该P个数据块,其中,该多个数据块中每个数据块的数据块信息包括每个数据块的数据块标识,以及与每个数据块的标识对应的每个数据块大小、每个数据块所处的数据服务器DateNode的位置和每个数据块所属的文件等信息。
应理解,本申请实施例中可以将数据块的数据块标识写入该目标文件的数据块列表,也可以在数据块列表中写入该数据块的链接,通过该链接可以获取该数据块的数据块信息,这里不做限定,其他能够获取该数据块信息的方式也在本申请实施例的保护范围内。
具体地说,在第一文件的N1个数据块和第二文件的N2个数据块中,除了数据存储范围不重叠的M个数据块之外,还包括另外的N1+N2-M个数据块,这N1+N2-M个数据块的数据存储范围至少部分重叠,即这N1+N2-M个数据块中每个数据块与其他数据块中的至少一个数据块之间存在重叠的数据存储范围。这时,需要对这N1+N2-M个数据块中的数据按照Key值大小进行重新排列,并根据这些重新排列数据生成P个新的数据块。
举例来说,现在需要对已有的文件1和文件2进行合并以生成新的目标文件,文件1的数据块包括数据块B1(Key值范围为1-10)和数据块B2(Key值范围为10-20),文件2的数据块为数据块B3(Key值范围为20-30)和数据块B4(Key值范围为30-40),可以看出,数据块B1、数据块B2、数据块B3和数据块B4中每个数据块对应的Key值范围,都与其他数据块对应的Key值范围不重叠,这时,目标文件的数据可以从数据块B1、数据块B2、数据块B3和数据块B4中读取,可以理解为,目标文件与文件1共用数据块B1和数据块B2,并与文件2共用数据块B3和数据块B4,而在目标文件中无需形成新的物理块。
又例如,现在需要对已有的文件1和文件2进行合并以得到合并后的目标文件,文件1包括数据块B1(Key值范围为1-10)和数据块B2(Key值范围为10-20),文件2包括数据块B3(Key值范围为15-25)和数据块B4(Key值范围为25-40),可以看出,数据块B2对应的Key值范围与数据块B3对应的Key值范围部分重叠(重叠的Key值范围为15-20)。目标文件包括数据块B1、数据块B2、数据块B3和数据块B4中的数据,但是由于只有数据块B1和数据块B4的Key值范围与其他数据块的Key值范围不重叠,因此数据块B1和数据块B4的数据块ID可以直接写入目标文件的数据块列表,以使得目标文件中的相应数据能够从数据块B1和数据块B4中读取。
而由于目标文件中同样需要包括数据块B2和数据块B3中的数据,因而可以将数据块B2和数据块B3中的数据读取出来并按照Key值大小进行重新排列,并将重排后的数据写入新的数据块即数据块B5(Key值范围10-25)中,数据块B5不属于已有的数据块,而是为目标文件形成的新数据块,其中包括文件1和文件2中Key值处于10-25的数据。
之所以要对将数据块B2和数据块B3中的数据读取出来并按照Key值大小进行重新排列,并写入新的数据块B5中,是因为数据块B2和数据块B3中Key值位于重叠范围内的数据并不一定完全相同。例如,数据块B2对应的Key值范围为10-20,数据块B3对应的Key值范围为15-25,则重叠的Key值范围为15-20,假设数据块B2中位于15-20内的数据为15、17、19,数据块B3中位于15-20内的数据为16、18、20,则目标文件中需要包括的数据就为15、16、17、18、19、20。如果直接将数据块B2和数据块B3的数据块ID写入目标文件的数据块列表中,那么Key值位于15-20范围内的数据就无法按照Key 值大小进行存储,在后续从该目标文件中读取数据时,就为数据检索带来了很大的麻烦。
应理解,上面都是以两个文件合并为例进行描述的,但是本申请所述的合并文件的方法还可以应用于两个以上文件的合并,比如在k个文件进行合并时(k为正整数且k>2),第i个(i≤k,i遍历k)文件中的t个数据块与其他文件的数据块均不重叠,则这t个数据块的数据块ID可以直接写入目标文件的数据块列表中,而第i个文件中的剩余数据块(除t个数据块外的数据块)中的数据,与其他文件的剩余数据块中的数据需要进行重排并写入新的数据块,不同文件中t的取值可以相同或不同。
例如,文件1包括数据块B1(Key值范围0-30)和数据块B2(Key值范围30-60),文件2包括数据块B3(Key值范围30-60)、数据块B4(Key值范围60-90),文件3包括数据块B5(Key值范围60-90)和数据块B6(Key值范围90-120)。其中,数据块B1和数据块B6的数据块ID可以直接写入目标文件的数据块列表中,而数据块B2、数据块B3、数据块B4和数据块B5中的数据按照数据Key值的大小进行重新排列后写入新的数据块中。
下面结合表七和表八,以两个详细的例子来描述本申请实施例的合并文件方法。
情况1
以前述表二所示的文件1和表三所示的文件2的合并为例,现将表二中所示的文件1与表三中所示的文件2进行合并,以生成合并后的目标文件。获取待合并的文件1和文件2的数据块信息。如表二所示,文件1包括数据块B1、数据块B2和数据块B3中。其中,数据块B1对应的Key值范围为0-20,数据块B2对应的Key值范围为20-40,数据块B3对应的Key值范围为40-70。如表三所示,文件2包括数据块B10、数据块B11、数据块B12和数据块B13中。其中,数据块B10对应的Key值范围为80-100,数据块B11对应的Key值范围100-140,数据块B12对应的Key值范围为140-160,数据块B13对应的Key值范围为160-200。合并后的目标文件应包括文件1与文件2中的全部数据。
可以看出,这些数据块中每个数据块对应的Key值范围,与其他数据块对应的Key值范围都不相同,即数据块B1、数据块B2、数据块B3、数据块B10、数据块B11、数据块B12和数据块B13之间不存在Key值相同的数据,即每个数据块的Key值范围均与其他数据块的Key值范围不重叠,故合并后的目标文件中的数据,可以从文件1的数据块B1、数据块B2和数据块B3,以及文件2的数据块B10、B11、B12和B13中读取。例如可以将文件1的数据块B1、数据块B2和数据块B3各自的数据块ID,以及文件2的数据块B10、数据块B11、数据块B12和数据块B13各自的数据块ID,写入目标文件的数据块列表中,以使得目标文件中的数据能够从这些数据块ID指示的数据块中读取。
表七示出了合并后的目标文件的数据存储情况。可以看出,在目标文件生成的过程中,将数据块B1、数据块B2、数据块B3、数据块B10、数据块B11、数据块B12和数据块B13的数据块ID写入目标文件的数据块列表中,从而使得目标文件中的相应数据可以从这些数据块中读取,避免了大量的IO,提高了文件存储的效率,还不用为目标文件生成新的数据块,节省了存储空间。
表七
Figure PCTCN2018074288-appb-000007
情况2
以前述表二所示的文件1和表五所示的文件2的合并为例,现将表二中所示的文件1与表四中所示的文件2进行合并,以生成合并后的目标文件。获取待合并的文件1和文件2的数据块信息。如表二所示,文件1包括数据块B1、数据块B2和数据块B3。其中,数据块B1对应的Key值范围为0-20,数据块B2对应的Key值范围为20-40,数据块B3对应的Key值范围为40-70。如表五所示,文件2包括数据块B10、数据块B11、数据块B12和数据块B13。其中,数据块B10对应的Key值范围为30-100,数据块B11对应的Key值范围为100-140,数据块B12对应的Key值范围为140-160,数据块B13对应的Key值范围为160-200。合并后的目标文件的数据包括文件1与文件2中的全部数据。
可以看出,文件1的数据块B1,以及文件2的数据块B11、数据块B12和数据块B13之间不存在Key值重叠的数据,即数据块B1、数据块B11、数据块B12和数据块B13中的每个数据块对应的Key值范围均与其他数据块对应的Key值范围不重叠,故可以将文件1的数据块B1的数据块ID,以及文件2的数据块B11、数据块B12和数据块B13各自的数据块ID,写入目标文件的数据块列表中,以使得目标文件中Key值位于0-20的数据能够从数据块B1中直接读取,Key值位于80-100的数据能够从数据块B10中直接读取,Key值位于100-140的数据能够从数据块B11中直接读取,Key值位于140-160的数据能够从数据块B12中直接读取,Key值位于160-200的数据能够从数据块B13中直接读取。因而对于目标文件中Key值位于0-20以及100-200的数据,而无需像现有技术中一样将目标文件中的数据写入新的数据块中,避免了大量的读写操作并节省了存储空间。
而对于文件1的数据块B2(Key值范围为20-40)、数据块B3(Key值范围为40-70),以及文件2的数据块B10(Key值范围为30-100),数据块B1对应的Key值范围与数据块B10对应的Key值范围部分重叠,数据块B2对应的Key值范围与数据块B10对应的Key值范围全部重叠。由于这三个数据块的Key值范围之间存在重叠,因而可以为目标文件生成两个新的数据块即数据块B40和数据块B41,并从数据块B1、数据块B2和数据块B10中读取数据,将读取出来的数据按照Key值大小进行排序,并且将读取出来的数据按照Key值大小进行排序,最后将重排后的数据写入数据块B40和数据块B41。
表八为合并后的目标文件的数据存储情况。可以看出,合并后的目标文件包括文件1的数据块B1,以及文件2的数据块B11、数据块B12和数据块B13,因此只生了两个数据块即数据块B40和数据块B41,避免了大量的IO,提高了文件存储的效率,且节省了 存储空间。
表八
Figure PCTCN2018074288-appb-000008
因此,在文件合并的过程中,基于待合并的至少两个文件的数据块得到合并后的文件,使得合并后的文件能够从该至少两个文件的数据块中读取,避免了大量的读写操作,从而提高了文件存储的效率,而且无需生成大量新的数据块,节省了存储空间。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
下面将结合图4,描述根据本申请实施例的存储装置,方法实施例所描述的技术特征可以适用于以下装置实施例。
图4是根据本申请实施例的存储装置400的示意性框图。如图4所示,该存储装置400包括确定单元410和合并单元420。其中:
确定单元410,用于确定第一文件的N1个数据块的数据存储范围,以及第二文件的N2个数据块的数据存储范围,N1和N2为正整数;
合并单元420,用于在所述N1个数据块与所述N2个数据块的数据存储范围不重叠的情况下,将所述N1个数据块的标识和所述N2个数据块的标识加入目标文件的数据块列表,以使得所述目标文件中包括所述N1个数据块中的数据和所述N2个数据块中的数据,所述目标文件为所述第一文件和所述第二文件合并得到的文件;
合并单元420还用于,在所述N1个数据块与所述N2个数据块的数据存储范围有重叠的情况下,将所述N1个数据块和所述N2个数据块中,数据存储范围不重叠的M个数据块的标识加入所述目标文件的数据块列表,以及将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据写入所述目标文件的P个数据块,M和P为正整数。
因此,在文件合并的过程中,基于待合并的至少两个文件的数据块得到合并后的文件,使得合并后的文件能够从该至少两个文件的数据块中读取,避免了大量的读写操作,从而提高了文件存储的效率,而且无需生成大量新的数据块,节省了存储空间。
可选地,所述数据存储范围为一数据块中的数据对应的关键字Key的取值范围,或者为一数据块中的数据的取值范围,或者为一数据块中的数据的标识的范围。
可选地,所述合并单元420具体用于:将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据,按照数据对应的Key的大小、数据取值的大小或数据标识的大小进行重新排列;将所述重新排列后的数据写入所述P个数据块。
可选地,所述存储装置应用于以数据块为数据存储单位的文件系统。
可选地,在所述目标文件的数据块列表中,不同数据块ID所指示的数据块的数据块信息至少部分不同,所述数据块信息包括以下信息中的至少一种:所述数据块的大小、所述数据块所处的数据服务器DateNode、和所述数据块所属的文件。
图5是根据本申请实施例的存储设备500的示意性框图。该存储设备500可以包括图4所示的存储装置400,该存储设备500例如为计算机等。如图5所示,该存储设备500包括处理器510、收发器520和存储器530,其中,该处理器510、收发器520和存储器530之间通过内部连接通路互相通信。该存储器530用于存储文件中的数据以及指令,该处理器510用于执行该存储器530存储的指令,以控制该收发器520接收信号或发送信号。
其中,该处理器510用于:确定第一文件的N1个数据块的数据存储范围,以及第二文件的N2个数据块的数据存储范围,N1和N2为正整数;
在所述N1个数据块与所述N2个数据块的数据存储范围不重叠的情况下,将所述N1个数据块的标识和所述N2个数据块的标识加入目标文件的数据块列表,以使得所述目标文件中包括所述N1个数据块中的数据和所述N2个数据块中的数据,所述目标文件为所述第一文件和所述第二文件合并得到的文件;
在所述N1个数据块与所述N2个数据块的数据存储范围有重叠的情况下,将所述N1个数据块和所述N2个数据块中,数据存储范围不重叠的M个数据块的标识加入所述目标文件的数据块列表,以及将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据写入所述目标文件的P个数据块,M和P为正整数。
因此,本申请实施例的存储设备,基于待合并的至少两个文件的数据块得到合并后的文件,使得合并后的文件能够从该至少两个文件的数据块中读取,避免了大量的读写操作,从而提高了文件存储的效率,而且无需生成大量新的数据块,节省了存储空间。
可选地,所述数据存储范围为一数据块中的数据对应的关键字Key的取值范围,或者为一数据块中的数据的取值范围,或者为一数据块中的数据的标识的范围。
可选地,所述处理器510具体用于:将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据,按照数据对应的Key的大小、数据取值的大小或数据标识的大小进行重新排列;将所述重新排列后的数据写入所述P个数据块。
可选地,所述存储设备应用于以数据块为数据存储单位的文件系统。
可选地,在所述目标文件的数据块列表中,不同数据块ID所指示的数据块的数据块信息至少部分不同,所述数据块信息包括以下信息中的至少一种:所述数据块的大小、所述数据块所处的数据服务器DateNode、和所述数据块所属的文件。
应理解,在本申请实施例中,该处理器510可以是中央处理单元(Central Processing Unit,CPU),该处理器510还可以是其他通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
该存储器530可以包括只读存储器和随机存取存储器,并向处理器510提供指令和数据。存储器530的一部分还可以包括非易失性随机存取存储器。
在实现过程中,上述方法的各步骤可以通过处理器510中的硬件的集成逻辑电路或者软件形式的指令完成。结合本申请实施例所公开的定位方法的步骤可以直接体现为硬件处 理器执行完成,或者用处理器510中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器530,处理器510读取存储器530中的信息,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。
根据本申请实施例的存储设备500可以对应于上述方法300中用于执行方法300的存储设备,以及根据本申请实施例的存储装置400,且该存储设备500中的各单元或模块分别用于执行上述方法200中存储设备所执行的各动作或处理过程,这里,为了避免赘述,省略其详细说明。
图6是本申请实施例的芯片的一个示意性结构图。图6的芯片600包括输入接口601、输出接口602、至少一个处理器603、存储器604,所述输入接口601、输出接口602、所述处理器603以及存储器604之间通过内部连接通路互相连接。所述处理器603用于执行所述存储器604中的代码。
可选地,当所述代码被执行时,所述处理器603可以实现方法实施例中由存储设备执行的方法300。为了简洁,这里不再赘述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通新连接可以是通过一些接口,装置或单元的间接耦合或通新连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代 码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (16)

  1. 一种合并文件的方法,其特征在于,所述方法包括:
    确定第一文件的N1个数据块的数据存储范围,以及第二文件的N2个数据块的数据存储范围,N1和N2为正整数;
    在所述N1个数据块与所述N2个数据块的数据存储范围不重叠的情况下,将所述N1个数据块的标识和所述N2个数据块的标识加入目标文件的数据块列表,以使得所述目标文件中包括所述N1个数据块中的数据和所述N2个数据块中的数据,所述目标文件为所述第一文件和所述第二文件合并得到的文件;
    在所述N1个数据块与所述N2个数据块的数据存储范围有重叠的情况下,将所述N1个数据块和所述N2个数据块中,数据存储范围不重叠的M个数据块的标识加入所述目标文件的数据块列表,以及将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据写入所述目标文件的P个数据块,M和P为正整数。
  2. 根据权利要求1所述的方法,其特征在于,所述数据存储范围为一数据块中的数据对应的关键字Key的取值范围,或者为一数据块中的数据的取值范围,或者为一数据块中的数据的标识的范围。
  3. 根据权利要求1或2所述的方法,其特征在于,所述将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据写入所述目标文件的P个数据块,包括:
    将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据,按照数据对应的Key的大小、数据取值的大小或数据标识的大小进行重新排列;
    将所述重新排列后的数据写入所述P个数据块。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述方法应用于以数据块为数据存储单位的文件系统。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,在所述目标文件的数据块列表中,不同数据块ID所指示的数据块的数据块信息至少部分不同,所述数据块信息包括以下信息中的至少一种:
    所述数据块的大小、所述数据块所处的数据服务器DateNode、和所述数据块所属的文件。
  6. 一种存储装置,其特征在于,所述存储装置包括:
    确定单元,用于确定第一文件的N1个数据块的数据存储范围,以及第二文件的N2个数据块的数据存储范围,N1和N2为正整数;
    合并单元,用于在所述N1个数据块与所述N2个数据块的数据存储范围不重叠的情况下,将所述N1个数据块的标识和所述N2个数据块的标识加入目标文件的数据块列表,以使得所述目标文件中包括所述N1个数据块中的数据和所述N2个数据块中的数据,所述目标文件为所述第一文件和所述第二文件合并得到的文件;
    所述合并单元还用于,在所述N1个数据块与所述N2个数据块的数据存储范围有重叠的情况下,将所述N1个数据块和所述N2个数据块中,数据存储范围不重叠的M个数据块的标识加入所述目标文件的数据块列表,以及将所述N1个数据块和所述N2个数据 块中除所述M个数据块外的数据写入所述目标文件的P个数据块,M和P为正整数。
  7. 根据权利要求6所述的存储装置,其特征在于,所述数据存储范围为一数据块中的数据对应的关键字Key的取值范围,或者为一数据块中的数据的取值范围,或者为一数据块中的数据的标识的范围。
  8. 根据权利要求6或7所述的存储装置,其特征在于,所述合并单元具体用于:
    将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据,按照数据对应的Key的大小、数据取值的大小或数据标识的大小进行重新排列;
    将所述重新排列后的数据写入所述P个数据块。
  9. 根据权利要求6至8中任一项所述的存储装置,其特征在于,所述存储装置应用于以数据块为数据存储单位的文件系统。
  10. 根据权利要求6至9中任一项所述的存储装置,其特征在于,在所述目标文件的数据块列表中,不同数据块ID所指示的数据块的数据块信息至少部分不同,所述数据块信息包括以下信息中的至少一种:
    所述数据块的大小、所述数据块所处的数据服务器DateNode、和所述数据块所属的文件。
  11. 一种存储设备,其特征在于,所述储设备包括:收发器、存储器以及处理器,其中,所述存储器用于存储指令,所述处理器与所述存储器和所述收发器相连,用于执行所述存储器存储的所述指令,以在执行所述指令时执行如下步骤:
    确定第一文件的N1个数据块的数据存储范围,以及第二文件的N2个数据块的数据存储范围,N1和N2为正整数;
    在所述N1个数据块与所述N2个数据块的数据存储范围不重叠的情况下,将所述N1个数据块的标识和所述N2个数据块的标识加入目标文件的数据块列表,以使得所述目标文件中包括所述N1个数据块中的数据和所述N2个数据块中的数据,所述目标文件为所述第一文件和所述第二文件合并得到的文件;
    在所述N1个数据块与所述N2个数据块的数据存储范围有重叠的情况下,将所述N1个数据块和所述N2个数据块中,数据存储范围不重叠的M个数据块的标识加入所述目标文件的数据块列表,以及将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据写入所述目标文件的P个数据块,M和P为正整数。
  12. 根据权利要求11所述的存储设备,其特征在于,所述数据存储范围为一数据块中的数据对应的关键字Key的取值范围,或者为一数据块中的数据的取值范围,或者为一数据块中的数据的标识的范围。
  13. 根据权利要求11或12所述的存储设备,其特征在于,所述处理器具体用于:
    将所述N1个数据块和所述N2个数据块中除所述M个数据块外的数据,按照数据对应的Key的大小、数据取值的大小或数据标识的大小进行重新排列;
    将所述重新排列后的数据写入所述P个数据块。
  14. 根据权利要求11至13中任一项所述的存储设备,其特征在于,所述存储设备应用于以数据块为数据存储单位的文件系统。
  15. 根据权利要求11至14中任一项所述的存储设备,其特征在于,在所述目标文件的数据块列表中,不同数据块ID所指示的数据块的数据块信息至少部分不同,所述数据 块信息包括以下信息中的至少一种:
    所述数据块的大小、所述数据块所处的数据服务器DateNode、和所述数据块所属的文件。
  16. 一种计算机可读存储介质,包括指令,当所述指令在计算机上运行时,使得所述计算机执行如权利要求1至5中任一项所述的方法。
PCT/CN2018/074288 2017-05-10 2018-01-26 合并文件的方法、存储装置、存储设备和存储介质 WO2018205689A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710326321.4 2017-05-10
CN201710326321.4A CN108874297A (zh) 2017-05-10 2017-05-10 合并文件的方法、存储装置、存储设备和存储介质

Publications (1)

Publication Number Publication Date
WO2018205689A1 true WO2018205689A1 (zh) 2018-11-15

Family

ID=64104277

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/074288 WO2018205689A1 (zh) 2017-05-10 2018-01-26 合并文件的方法、存储装置、存储设备和存储介质

Country Status (2)

Country Link
CN (1) CN108874297A (zh)
WO (1) WO2018205689A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108958659A (zh) * 2018-06-29 2018-12-07 郑州云海信息技术有限公司 一种分布式存储系统的小文件聚合方法、装置及介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112988696B (zh) * 2019-12-18 2022-08-23 浙江宇视科技有限公司 文件整理方法、装置及相关设备
CN113032340B (zh) * 2019-12-24 2024-05-14 阿里巴巴集团控股有限公司 数据文件的合并方法、装置、存储介质及处理器
CN115421649B (zh) * 2022-08-02 2023-10-20 佳源科技股份有限公司 一种可索引、可扩展的参数文件分片存储系统及方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102508880A (zh) * 2011-10-18 2012-06-20 广东威创视讯科技股份有限公司 一种文件合并方法及分解方法
CN103914522A (zh) * 2014-03-20 2014-07-09 电子科技大学 一种应用于云存储重复数据删除的数据块合并方法
CN106294585A (zh) * 2016-07-28 2017-01-04 四川新环佳科技发展有限公司 一种云计算平台下的存储方法
CN106528763A (zh) * 2016-10-28 2017-03-22 北京海誉动想科技股份有限公司 两路及多路文件块合并的方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100347705C (zh) * 2004-12-24 2007-11-07 北京中星微电子有限公司 一种合并文件的方法
CN105243027A (zh) * 2015-09-24 2016-01-13 华为技术有限公司 在存储设备中存储数据的方法和存储控制器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102508880A (zh) * 2011-10-18 2012-06-20 广东威创视讯科技股份有限公司 一种文件合并方法及分解方法
CN103914522A (zh) * 2014-03-20 2014-07-09 电子科技大学 一种应用于云存储重复数据删除的数据块合并方法
CN106294585A (zh) * 2016-07-28 2017-01-04 四川新环佳科技发展有限公司 一种云计算平台下的存储方法
CN106528763A (zh) * 2016-10-28 2017-03-22 北京海誉动想科技股份有限公司 两路及多路文件块合并的方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108958659A (zh) * 2018-06-29 2018-12-07 郑州云海信息技术有限公司 一种分布式存储系统的小文件聚合方法、装置及介质

Also Published As

Publication number Publication date
CN108874297A (zh) 2018-11-23

Similar Documents

Publication Publication Date Title
US11074245B2 (en) Method and device for writing service data in block chain system
WO2018205689A1 (zh) 合并文件的方法、存储装置、存储设备和存储介质
US10374792B1 (en) Layout-independent cryptographic stamp of a distributed dataset
US9298774B2 (en) Changing the compression level of query plans
US8924373B2 (en) Query plans with parameter markers in place of object identifiers
US10289568B2 (en) Application-driven storage systems for a computing system
US8812489B2 (en) Swapping expected and candidate affinities in a query plan cache
CN102129425B (zh) 数据仓库中大对象集合表的访问方法及装置
US11650990B2 (en) Method, medium, and system for joining data tables
CN109564566B (zh) 对调用应用的发现以用于控制文件水化行为
US11030178B2 (en) Data storage method and apparatus
US20240126817A1 (en) Graph data query
CN111104426B (zh) 一种数据查询方法及系统
CN111611249A (zh) 数据管理方法、装置、设备及存储介质
WO2017020735A1 (zh) 一种数据处理方法、备份服务器及存储系统
US8396858B2 (en) Adding entries to an index based on use of the index
WO2019165763A1 (zh) 一种用于查询数据的方法
WO2017206562A1 (zh) 一种数据表的处理方法、装置及系统
US9507794B2 (en) Method and apparatus for distributed processing of file
CN114817257A (zh) 数据表关联生成及业务处理方法、装置、设备及存储介质
CN112905587B (zh) 数据库的数据管理方法、装置及电子设备
CN114020745A (zh) 一种索引构建方法、装置、电子设备和存储介质
CN112667682A (zh) 数据处理方法、装置、计算机设备和存储介质
US20150106884A1 (en) Memcached multi-tenancy offload
CN111143326A (zh) 减少数据库操作的方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18797927

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18797927

Country of ref document: EP

Kind code of ref document: A1