WO2020125630A1 - File reading - Google Patents

File reading Download PDF

Info

Publication number
WO2020125630A1
WO2020125630A1 PCT/CN2019/126003 CN2019126003W WO2020125630A1 WO 2020125630 A1 WO2020125630 A1 WO 2020125630A1 CN 2019126003 W CN2019126003 W CN 2019126003W WO 2020125630 A1 WO2020125630 A1 WO 2020125630A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
target
association relationship
files
association
Prior art date
Application number
PCT/CN2019/126003
Other languages
French (fr)
Chinese (zh)
Inventor
王勇
Original Assignee
新华三大数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 新华三大数据技术有限公司 filed Critical 新华三大数据技术有限公司
Publication of WO2020125630A1 publication Critical patent/WO2020125630A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Definitions

  • Hadoop is generally adopted as the storage technology, and the Hadoop is an open source distributed system infrastructure.
  • Each file stored in the Hadoop Distributed File System (Hadoop Distributed File System, HDFS) needs to correspond to a block, and the master node (NameNode) in HDFS establishes a mapping relationship between each file and its corresponding block.
  • HDFS Hadoop Distributed File System
  • FIG. 1-1 shows a flowchart of a file reading method according to an embodiment of the present disclosure
  • Figure 1-2 shows a schematic diagram of a possible application system architecture according to Embodiment 1 of the present disclosure
  • FIG. 2 shows a flowchart of a file reading method according to an embodiment of the present disclosure
  • FIG. 3 shows a flowchart of determining a first association relationship according to an embodiment of the present disclosure
  • FIG. 4 shows a flowchart of an associated file acquisition method according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic diagram of a process for acquiring a file association relationship according to an embodiment of the present disclosure
  • FIG. 6 shows a block diagram of a file reading device according to an embodiment of the present disclosure
  • FIG. 7 shows a block diagram of a file reading device according to an embodiment of the present disclosure
  • FIG. 8 shows a schematic diagram of a second determination module according to an embodiment of the present disclosure
  • FIG. 9 shows a schematic diagram of a first determination module according to an embodiment of the present disclosure.
  • FIG. 10 shows a structural block diagram of a server according to an embodiment of the present disclosure.
  • HDFS is more suitable for storing files with a large amount of data (for example, files with a data amount greater than 64M or 128MB), and can fully utilize the storage resources of HDFS. If HDFS stores a large number of files with a data volume of less than 64M (such as files with only 10KB to 10MB such as pictures and documents), because these files are much smaller than the block size in HDFS, and a large number of files with a small amount of data will occupy more The storage block will therefore reduce the storage resource utilization of HDFS. The higher the number of files stored in HDFS, the more mapping relationships need to be established, and the more memory is occupied by the master node, so this will greatly occupy the memory of the master node, resulting in greatly reduced efficiency of HDFS access to data.
  • 64M such as files with only 10KB to 10MB such as pictures and documents
  • Metadata information which is used to describe data attributes
  • metadata information is a type of electronic directory, such as a tree-like directory structure, file attributes, files, and data
  • the mapping relationship of blocks, etc. are usually stored in the NameNode, which will cause the memory bottleneck of the NameNode.
  • reading a large number of files with a small amount of data will cause the client to frequently communicate with the NameNode node, which in turn will reduce the I/O performance of the NameNode.
  • the present disclosure proposes a file reading method to improve the efficiency of reading files through HDFS.
  • the file may be a file with a small amount of data or a file with a large amount of data, and the disclosure is not particularly limited.
  • FIG. 1-1 shows a flowchart of a file reading method according to an embodiment of the present disclosure.
  • FIG. 1-2 shows a schematic diagram of a system architecture of a file reading method according to an embodiment of the present disclosure.
  • the method shown in Figure 1-1 can be applied to server 1 to enable server 1 to read files from HDFS2.
  • the system may include server 1 and HDFS2.
  • the server 1 may be a client server, and a user accesses the server 1 through the client, so that the server 1 uses the file reading method of the embodiment of the present disclosure to read files from the HDFS2.
  • the system may include server 1, server 3, and HDFS2.
  • the method can also be applied to other servers.
  • the user can call the resources of the server 3 through the server 1 to execute the method, thereby obtaining the target file and the associated file.
  • the method described in the present disclosure can also be applied to other processing devices (such as terminals) that can perform calculations, and the system architecture shown in FIGS. 1-2 is not intended to limit the present disclosure.
  • the method includes steps S110-S150, and the method is applied to a server as an example.
  • the description of each step is as follows.
  • Step S110 Receive a file reading request, where the file reading request includes the identifier of the target file to be read.
  • the file reading request may be a file reading instruction sent by the user through the client in the terminal.
  • the user may manipulate the client to make the client send the file reading request,
  • the server receives the file reading request sent by the client.
  • the identification of the target file may be the unique identification information of the target file, which is used to uniquely determine the target file.
  • the unique identification information may be a hash value obtained by hashing information such as the name of the target file.
  • the identification of the target file may also be other information that is different from the unique identification information, which is used to identify a certain type of file or a certain range of files, for example, information such as date and category.
  • the identification of the target file is such information, the reading of the file is fuzzy reading.
  • Step S120 according to the target file identifier, in the mapping relationship between the subfile identifier and the merged file identifier included in the first index information stored locally, find the target subfile identifier that matches the target file identifier and the corresponding target merged file identifier .
  • the merged file is stored in HDFS, and the subfiles in the merged file have an association relationship, that is, the merged file is formed by merging a plurality of subfiles with an association relationship.
  • the association relationship may be an access association relationship. For example, after file 1 is accessed, the next file to be accessed is file 2, then file 2 and file 1 have an association relationship, and file 1 and file 2 may be merged into a merged file, and Store the merged file in HDFS.
  • the server may store the first index information in advance, and the process of creating the first index information will be described later.
  • the first index information may include the mapping relationship between the sub-file and the merged file.
  • the relationship between the sub-file and the merged file is called: the first mapping relationship.
  • the first mapping relationship may be expressed as a correspondence between the sub-file identification and the merged file identification. Through the first mapping relationship, the corresponding merged file may be found using the target file identification.
  • the first index information may also include the offset of the subfile in the merged file, and the size of the subfile.
  • the size of the sub-file may be the length or specific gravity occupied by the sub-file in the merged file, and the offset may be the starting position of the sub-file in the merged file.
  • Step S130 according to the target merged file identifier, in the mapping relationship between the merged file identifier included in the locally stored second index information and the HDFS storage block identifier, search for a target storage block corresponding to the target merged file identifier Logo.
  • the server may also store the second index information in advance, and the process of creating the second index information will be described later.
  • the second index information may include the mapping relationship between the merged file and the HDFS storage block.
  • the mapping relationship between the merged file and the storage block of the HDFS is called: a second mapping relationship.
  • the second mapping relationship may be expressed as a correspondence between the merged file identifier and the storage block identifier of HDFS.
  • the target merge file ID can be searched to obtain the target storage block ID of the target merge file.
  • the HDFS storage block identifier may include HDFS block address information.
  • the second mapping relationship may also be expressed as a correspondence between the identification of the merged file and the storage location of the merged file in HDFS, and the storage location of the merged file in HDFS may be found according to the second mapping relationship.
  • Step S140 Determine the number of sub-files to be acquired associated with the target file according to a preset acquisition condition, and send a file acquisition request to the HDFS, where the file acquisition request includes a target storage block identifier, a target sub-file identifier, The target merge file ID and the number of subfiles, so that the HDFS searches for the target merge file corresponding to the target merge file ID in the target storage block corresponding to the target storage block ID, and finds the target merge file in the target merge file Find the target file and related files whose number is the number of the sub-files.
  • HDFS After receiving the file acquisition request, HDFS acquires the target file and the associated files whose number is the number of subfiles according to the target storage block identifier, target subfile identifier, target merged file identifier, and subfile quantity included in the file acquisition request. After HDFS finds the target file and associated files whose number is the number of subfiles, it sends the target file and associated files to the server.
  • HDFS can acquire the number of sub-files (ie, associated files) in the target merged text that are close to the storage location of the target sub-file by the number of sub-files.
  • HDFS queries the metadata information corresponding to the target sub-file (that is, the target file), the target merged file, and the target storage block through the namenode. After determining the target subfile, determine the metadata information of the subfiles with the number of subfiles adjacent to the target subfile in the target merge file through the namenode, and then obtain the target file and the number of related files with the number of subfiles from the datanode, And send it to the server.
  • step S140 it is possible to match multiple target merged file identifiers according to the first index information, and then match multiple target storage block identifiers according to the second index information. At this time, one of the target merge file identifier and the corresponding target storage block identifier can be selected from them, and step S140 is executed.
  • step 140 is executed, that is, a file acquisition request is sent to obtain the target file and the associated files whose number is the number of sub-files.
  • Step S150 Receive and cache the target file and associated file returned by HDFS.
  • the target file and associated files returned by HDFS can be cached in the server's cache space or other storage space.
  • the server can directly obtain the file from the cache, thereby reducing the interaction between the server and HDFS, saving HDFS resources, and improving HDFS access efficiency.
  • the embodiment of the present disclosure stores the merged file in HDFS, and records the mapping relationship between the merged file and each sub-file in the first index information, and the mapping relationship between the merged file and the storage block of HDFS in the second index information.
  • the method described in the embodiments of the present disclosure can be used to quickly obtain the target file and the associated file by using the target file identifier, the first index information, and the second index information, and store it in the cache.
  • the method provided by the embodiment of the present disclosure can obtain the related files that may be accessed at the next moment while acquiring the target files, and store the target files and related files in the cache.
  • these associated files stored in the cache can be queried first and hit with a high probability, which can reduce the interaction with HDFS, reduce the resource utilization rate of HDFS, and improve HDFS Access efficiency, and improve the efficiency of HDFS processing a large number of files.
  • the files stored in HDFS are composed of multiple files with access association, so the advantages of HDFS sequential file access can be used.
  • the preset acquisition condition may include:
  • M M ⁇ t1 ⁇ tm-th, where M is the number of sub-files, t1 is the time it takes to read a sub-file, tm is the user’s maximum waiting time, and th is the return time for obtaining HDFS data.
  • the optimal number of sub-files can be determined by the user's maximum wait time, HDFS data return time, and time spent reading a sub-file, which improves the user experience (maximum wait time) while improving read effectiveness.
  • FIG. 2 shows a flowchart of a file reading method according to an embodiment of the present disclosure, where steps S201-S260 mainly describe a file merging process, which may be performed before the foregoing S110.
  • Step S210 Acquire historical access logs of multiple files.
  • the historical access log includes the access time and access times of multiple files.
  • the acquisition time of the historical access log may be limited, for example, the historical access log may be acquired within a certain period of time.
  • the format of the historical access log may be as shown in Table 1 below.
  • the acquired historical access logs include the accessed time and the number of accessed times of files 1, 2, and 3.
  • Step S220 For each file in the plurality of files, according to the accessed time and the number of times of access of the plurality of files, among the files other than the file in the plurality of files, it is determined that the file is accessed. The file then has access to at least one file associated with the file, and determines multiple first association relationships of the file.
  • the first association relationship is used to indicate that the file is associated with any file in at least one file.
  • the first association relationship is expressed in the way of (File A, File B), where this method can indicate that after File A is accessed, File B is accessed accordingly, that is, the user accesses File B next time after accessing File A .
  • the first association relationship of file 1 may be (file 1, file 2), (file 1, file 3), the first association relationship of file 2 may be (file 2, file 3), the first association relationship of file 3 Can be (File 3, File 1).
  • Step S230 Acquire the first file with the largest number of first association relationships according to the first association relationship of each file in the plurality of files, and according to the plurality of first association relationships of the first file, in the multiple The file determines at least one associated file that is accessed in sequence after the first file is accessed.
  • comparing the number of first association relationships of files 1-3 it can be determined that the first file with the largest number of first association relationships: file 1. Then, it can be determined that the files that are accessed in sequence after file 1 is accessed are file 2 and file 3.
  • Step S240 Store the first file and at least one associated file in the first merged file.
  • the first file and at least one associated file may be combined to obtain a combined file.
  • the first file and the at least one associated file may be stored sequentially and merged into the first merged file in the order of being accessed.
  • sequential succession means that the storage location of each file is consecutive.
  • the first file and the at least one associated file may be sequentially and successively stored in the first merged file in the order of being accessed.
  • a storage space may be opened in advance as the storage space of the first merged file.
  • the space indicated by the addresses 0000H to FFFFH may be used as the storage space of the first merged file, and then files 1-3 are stored to 0000H to 0FFFH, 1000H to EFFFH, and 0000H to FFFFH, respectively.
  • Step S250 in the first association relationship of each file in the plurality of files, delete the first association relationship applied when determining at least one associated file, and obtain the remaining first association relationship; according to the remaining first association relationship, Obtain the new first file with the largest number of first associations.
  • the first association relationship applied is (file 1, file 2), (file 2, file 3), then the remaining first association relationship is ( File 1, File 3), (File 3, File 1), execute the acquisition of the new first file with the largest number of first association relationships. Since file 1 (file 1, file 3) and file 3 (file 3, file 1) have the same number of first association relationships, at this time, a file can be arbitrarily selected as a new first file: file 3.
  • Step S260 among the plurality of files, repeatedly execute, according to the plurality of first association relationships of the new first file, determine at least one associated file that is sequentially accessed after the new first file is accessed, and change the The process of storing the new first file and at least one associated file that is sequentially accessed after the new first file is accessed is stored in the new first merged file until the remaining first associated relationship cannot be obtained.
  • step S260 after merging files 3 and 1, file 1 and file 3 through step S260, there is no remaining first association relationship, and the process ends at this time.
  • the embodiments provided by the present disclosure can merge files that have relevance among multiple files into one merged file, and the merged file includes multiple sub-files, and each sub-file in the merged file has relevance.
  • the association relationship may include identification information of the associated file.
  • the association relationship may be (sub-file A , Sub-file B); sub-file A, sub-file B, sub-file C... sub-file N has a file association relationship, the association relationship may be (sub-file A, file B, sub-file C, ..., sub-file N ).
  • other forms may be used to record the association relationship of multiple files, which is not limited herein.
  • the method for determining the first association relationship is taken as an example to introduce the method for determining the association relationship.
  • FIG. 3 illustrates a flowchart of determining a first association relationship according to an embodiment of the present disclosure.
  • the first association relationship of the file may be determined in the following manner.
  • Step S310 according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed, obtain a first probability that the third file is accessed after the second file is accessed.
  • the second file and the third file are any two different files among the multiple files.
  • the first probability can be obtained by the following formula: P(B
  • A) NAB/NA, where P(B
  • Step S320 according to the number of times the third file is accessed after the second file is accessed and the total number of times all files in the historical access log are accessed, obtain a second probability that both the second file and the third file are accessed.
  • Step S330 Acquire the second file according to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, the number of times the second file is accessed and the number of times the third file is accessed The influence value of being accessed on the third file being accessed.
  • the influence value is obtained by the following formula: I(B
  • A) (N ⁇ NAB)/(NA ⁇ NB), where I(B
  • Step S340 When the first probability is greater than the first probability threshold, the second probability is greater than the second probability threshold, and the influence value is greater than the influence threshold, it is determined that the second file and the third file have a first association relationship.
  • A) is the first probability threshold
  • min_P(AB) is the second probability threshold
  • A) is the influence threshold
  • (A, B) is the second file A and the third file B The first association.
  • the file set at this time includes A, B, C, D, F.
  • the file set includes A, B, C, D, and F
  • the file set includes A, B, C, F.
  • the influence value of a certain file and other files in the file set meeting the second probability threshold is greater than the influence threshold.
  • the file set includes A, B, C, and F
  • the influence value of file A on file C and the influence value of file C and file F are greater than the influence threshold, you can determine file A and file C, and File C and file F have a first association relationship
  • the first association set at this time may include (file A, file C), (file C, file F), corresponding to this, in the file collection at this time Including A, C, F three files.
  • the above process of obtaining the first association relationship set and the file collection that meets the association relationship in the first association relationship set is exemplary, and the number of files in the example is not used to limit the present disclosure.
  • the first association relationship can be used to represent the association relationship between two files. If two files with the first association relationship are merged, since the file size may be between 10KB and 10MB, the merged file will still be smaller than the HDFS block storage size (for example: 64MB), and the number of merged files is still Huge, this does not minimize the number of interactions with HDFS and the memory of the master node in HDFS. Therefore, it is necessary to determine the relationship between as many files as possible to merge as many files as possible. Please refer to FIG. 4.
  • FIG. 4 shows a flowchart of a method for obtaining an associated file according to an embodiment of the present disclosure. This embodiment can determine the association relationship between as many files as possible to merge as many as possible. document.
  • one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is a file that is accessed after accessing the predecessor file.
  • the method shown in FIG. 4 will be described below with reference to FIG. 5.
  • Step S231 Acquire a first association set containing the first association of each of the files.
  • the first association relationship set 250 includes multiple first association relationships of each file, for example, the first association relationship of file file1 (file1, file7), the first association relationship of file file3 (file3, file5) Wait.
  • Each first association relationship includes a predecessor file and a successor file.
  • the corresponding predecessor file is file1 and the successor file is file7.
  • Step S232 In the first association relationship set, obtain the first target association relationship set that uses the first file as the predecessor file most frequently, and obtain the second association relationship in the first target association relationship set.
  • the second association relationship is: the first association relationship in which the subsequent files in the first target association relationship set are accessed the most.
  • the first target association relationship in the first association relationship set 250 is obtained, that is, the first association relationship in which the first file is used as the precursor file has the highest number of occurrences, to obtain the first target association relationship set 260. Then, the first target association set 260 is selected: the first association relationship in which the subsequent files in the first target association set are accessed the most (that is, the first association relationship with the largest first probability). In the first target association set 260, (file1, file7) has the largest first probability, so (file1, file7) is the second association.
  • Step S233 If there is a third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, then the target association relationship with the most occurrences of the successor files is determined from the third association relationship, and the target association relationship is determined The file in is determined to be the associated file.
  • the subsequent file file7 of the second association relationship (file1, file7) is used as a predecessor file to obtain a plurality of first association relationships in the first association set 250 that take file7 as the predecessor file as the third association relationship 270 ,
  • the third association relationship 270 may be a set.
  • the third association relationship 270 includes two first association relationships (file7, file5), (file7, file3) with file7 as a predecessor file.
  • file5 which is a subsequent file, is accessed the most (ie, the first probability is the largest), so the first association relationship (file7, file5) is used as the target association relationship, and the files file7 and file5 in the target association relationship are used as associations file.
  • the subsequent file file5 of the first association relationship may be merged (recorded) into the second association relationship (file1, file7) to generate an updated second association relationship ( file1, file7, file5), and delete the first association relationship (file1, file7) from the first association set.
  • the first association relationship (file7, file5) is updated to the second association relationship (file1, file7, file5), it may be considered to have been deleted.
  • the first association relationship (file7, file5) is not covered by the second association relationship (file1, file7, file5), it may be deleted from the first association relationship set.
  • Step S234 If there is no third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, the successor file of the second association relationship is determined as the association file.
  • Step S235 Delete the target association relationship in the first association relationship set to obtain a new first association relationship set.
  • step S236 the following operations are repeatedly performed until there is no third association relationship in the new first association relationship set where the predecessor file and the successor file of the new second association relationship are the same:
  • the new second association relationship is: the first association relationship in which a subsequent file in the new first target association set is accessed most often;
  • the new target association with the highest number of subsequent file occurrences is determined from the new third association relationship Relationship, determining the file in the new target association relationship as an association file; and, deleting the new target association relationship to obtain the new first association set.
  • file5 (in this case, file5 is a successor file) can also be used as a precursor file to find whether file5 is used in the first association set 250. The first association of the predecessor file. If it does not exist, then file7 and file5 are finally used as the associated files of the first file file1; if they exist, follow the steps S231 to S234 described above to continue to obtain the associated files.
  • the associated files of the first file file1 include file file7 and file file5.
  • the determined target association relationship may be deleted in sequence in the first association relationship set until the first association relationship set is empty, and the determination of all associated files of the first file is completed .
  • Embodiments provided by the present disclosure can use the first association relationship in the first association set to obtain as many association files as possible associated with the first file, and after obtaining the association files of the first file, associate the first file with the association
  • the files are merged to obtain a merged file, and the merged file obtained after the merger can meet the storage requirements of HDFS to the greatest extent possible.
  • the method may further include:
  • the first merged file may be stored in a pre-established merged file space in HDFS, and the merged file space may be an integer multiple of the "block" size in HDFS, for example, when a "block" size When it is 64MB, the size of the preset merged file space can be set to 64MB, 128MB, 256MB, or 512MB.
  • the first index information and the second index information may be stored in a local storage system to facilitate subsequent retrieval.
  • HDFS-based file access mechanism will inevitably consume a large amount of HDFS NameNode node memory.
  • the number of interactions between the client and the NameNode node is the same as the number of files to be acquired. At this time, HDFS performance Will be reduced, the efficiency of file access is low.
  • the server when the server requests to obtain the target file required by the user, it also requests to obtain at least one associated file associated with the target file, and sends the acquired target file and associated file to the cache.
  • the server can match the file in the cache with the target file ID in the file read request. Since the file in the cache is related to access, it is likely to match This file reads the requested target file. This not only improves the file reading speed and hit rate, but also reduces the memory usage of the NameNode node, reduces the number of interactions between the client and the NameNode node, and improves the performance of the system.
  • multiple associated files can be merged into a merged file to conform to the mechanism of HDFS storage and merged files, thereby improving the storage efficiency of files.
  • HDFS memory and other resources The use of is also reduced, improving the performance of the system.
  • FIG. 6 shows a block diagram of a file reading device according to an embodiment of the present disclosure.
  • the device includes:
  • the receiving module 10 is configured to receive a file reading request, where the file reading request includes an identification of the target file to be read;
  • the first searching module 20 is connected to the receiving module 10, and is used for searching and searching for the mapping relationship between the sub-file identifier and the merged file identifier included in the first index information stored locally according to the identifier of the target file.
  • the second search module 30 is connected to the first search module 20, and is used for mapping the merged file ID included in the second index information stored locally to the HDFS storage block ID according to the target merged file ID In the search for the target storage block identifier corresponding to the target merged file identifier;
  • the sending module 40 is connected to the second searching module 30, and is configured to determine the number of sub-files to be acquired associated with the target file according to a preset acquiring condition, and send a file acquiring request to the HDFS, the file acquiring
  • the request includes the target storage block ID, target subfile ID, target merge file ID, and the number of subfiles, so that the HDFS searches for the target in the target storage block corresponding to the target storage block ID
  • the merged file identifier corresponds to the target merged file, and searches the target merged file for the target file and related files whose number is the number of the sub-files;
  • the cache module 50 is connected to the sending module 40 and is used to receive and cache the target file and associated file returned by the HDFS.
  • file reading device is a device item corresponding to the foregoing file reading method.
  • file reading method is a device item corresponding to the foregoing file reading method.
  • the device described in the present disclosure obtains the file to be obtained and other files related to the file to be obtained, and stores these files in the cache. When the file read request sent by the terminal next time is received, these are stored in the cache The files in can be retrieved first to reduce the interaction with HDFS, thereby reducing the resource usage of HDFS and improving the efficiency of HDFS in processing a large number of files.
  • FIG. 7 shows a block diagram of a file reading device according to an embodiment of the present disclosure.
  • the device further includes:
  • the first obtaining module 61 is configured to obtain historical access logs of multiple files, where the historical access logs include the accessed time and the number of accessed times of multiple files;
  • the first determining module 62 is connected to the first obtaining module 61, and is used for each file of the plurality of files according to the access time and the number of accesses of the plurality of files. In files other than the file, determine at least one file associated with the file after accessing the file, and determine a plurality of first association relationships of the file, where the first association relationship is used to indicate The file is associated with access to any file in at least one file;
  • the second determination module 63 is connected to the first determination module 62, and is configured to obtain the first file with the largest number of first association relationships according to the first association relationship of each file in the plurality of files, and according to the first Multiple first association relationships of a file, and determining, among the multiple files, at least one associated file that is sequentially accessed after the first file is accessed;
  • the storage module 64 is connected to the second determination module 63 and is used to store the first file and at least one associated file in the first merged file.
  • the second acquisition module 71 is connected to the storage module 64, and is used to delete the first association relationship applied to determine at least one associated file in the first association relationship of each file in the plurality of files to obtain the remaining The first association relationship; according to the remaining first association relationship, obtain the new first file with the largest number of first association relationships;
  • a third determination module 72 connected to the second acquisition module 71, is used to trigger the second determination module to repeatedly perform multiple first association determinations based on the new first file among the multiple files At least one associated file that is sequentially accessed after the new first file is accessed, storing the new first file and at least one associated file that is sequentially accessed after the new first file is accessed in the new first merged file Process until the second acquisition module cannot acquire the remaining first association relationship.
  • a sending and receiving module 81 connected to the storage module 64, for sending the first merged file to the HDFS, and receiving the first storage block identifier returned by the HDFS that stores the first merged file;
  • An index creation module 82 connected to the sending and receiving module 81, is used to create first index information including the mapping relationship between the first file identifier and the first merged file identifier, and includes the first merged file identifier and the first storage The second index information of the mapping relationship of the block identification.
  • the reading module 90 connected to the cache module 50, may include the file associated with the target file if the next file read request received includes the file associated with the target file, if the file associated with the target file is stored in the cache , The file associated with the target file is read from the cache.
  • file reading device is a device item corresponding to the foregoing file reading method.
  • file reading method is a device item corresponding to the foregoing file reading method.
  • FIG. 8 illustrates a schematic diagram of a second determination module according to an embodiment of the present disclosure.
  • one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is accessed after accessing the predecessor file file.
  • the second determining module 63 includes:
  • a first association relationship acquisition sub-module 631 configured to obtain a first association relationship set including the first association relationship of each file in the plurality of files
  • a second association relationship acquisition sub-module 632 connected to the first association relationship acquisition sub-module 631, is used to obtain the first target association with the first file as the predecessor file in the first association relationship set that has the most occurrences A relationship set, and in the first target association set, a second association is obtained, where the second association is: the first association in the first target association set where the subsequent files are accessed the most;
  • the first association file determination submodule 633 is connected to the second association relationship acquisition submodule 632, and is used for a third association relationship where the predecessor file and the successor file of the second association relationship are the same in the first association relationship set Determine the target association relationship with the highest number of subsequent files from the third association relationship, and determine the file in the target association relationship as the association file;
  • the second association file determination sub-module 634 is connected to the second association relationship acquisition sub-module 632, and is used for a third association if the predecessor file and the subsequent file of the second association relationship are not the same in the first association relationship set During the relationship, the subsequent file of the second related relationship is determined as the related file.
  • a deletion submodule 635 configured to delete the target association relationship in the first association set to obtain a new first association set
  • Repeat determination submodule 636 connected to delete submodule 635, for repeatedly triggering the second association relationship acquisition submodule and the first association file determination submodule to perform the following operations until the second association file determination submodule determines that the new There is no third association relationship in the first association relationship set where the predecessor file and the successor file of the new second association relationship are the same: in the new first association relationship set, the new file with the first file as the predecessor file is obtained most frequently A first target association relationship set, and in the new first target association relationship set, a new second association relationship is obtained, where the new second association relationship is: the most subsequent files in the new first target association relationship set are accessed the most The first association relationship;
  • the file in the new target association relationship is determined as the association file; and, the new Target association relationship, to obtain the new first association relationship set.
  • file reading device is a device item corresponding to the foregoing file reading method.
  • file reading method is a device item corresponding to the foregoing file reading method.
  • FIG. 9 illustrates a schematic diagram of a first determination module according to an embodiment of the present disclosure.
  • the first determining module 62 includes:
  • the first probability obtaining submodule 621 is configured to obtain the third file after the second file is accessed according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed The first probability of access, wherein the second file and the third file are any two different files in the plurality of files;
  • the second probability obtaining sub-module 622 is used to obtain the second file and the total number of times all files in the historical access log are accessed according to the number of times the third file is accessed and the total number of times all files in the historical access log are accessed A second probability that all the third files are accessed;
  • the influence value obtaining sub-module 623 is used for according to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, and the second file being accessed The number of times and the number of times the third file is accessed, to obtain the influence value of the second file being accessed on the third file being accessed;
  • the first determination submodule 624 is connected to the first probability acquisition submodule 621, the second probability acquisition submodule 622, and the influence value acquisition submodule 623, and is used when the first probability is greater than the first probability threshold, When the second probability is greater than the second probability threshold and the influence value is greater than the influence threshold, it is determined that the second file and the third file have the first association relationship.
  • file reading device is a device item corresponding to the foregoing file reading method.
  • file reading method is a device item corresponding to the foregoing file reading method.
  • an embodiment of the present disclosure also provides a server 900.
  • the server 900 includes a processor 901, a machine-readable storage medium 902, and a transceiver 903, and the machine-readable storage medium stores The machine-executable instructions executed by the processor 901 and the transceiver 903, and the processor 901, the transceiver 903, and the machine-readable storage medium 902 can communicate via the system bus 904.
  • the machine executable instruction causes the transceiver 903 to receive a file request and send the file reading request to the processor 901, where the file reading request includes the identification of the target file to be read;
  • the machine-executable instructions cause the processor 901 to:
  • the mapping relationship between the sub-file identification included in the first index information stored locally and the merged file identification find the target sub-file identification and the corresponding target merged file that match the target file identification Identification; wherein, the merged file is stored in HDFS, and the sub-files in the merged file are associated;
  • the target merged file identifier in the mapping relationship between the merged file identifier included in the locally stored second index information and the HDFS storage block identifier, search for the target storage block identifier corresponding to the target merged file identifier;
  • the machine-executable instructions also cause the transceiver 903 to:
  • the file acquisition request includes the target storage block identifier, the target subfile identifier, the target merged file identifier, and the number of subfiles, so that the HDFS Searching for a target merge file corresponding to the target merge file ID in the target storage block corresponding to the target storage block ID, and searching the target merge file for the target file and associated files whose number is the number of the sub-files ;
  • the machine executable instructions also cause the processor 901 to receive and cache the target file and associated file returned by the HDFS sent by the transceiver 903.
  • machine executable instructions cause the processor 901 to:
  • each file in the plurality of files obtain the first file with the largest number of first association relationships, and determine among the plurality of files according to the plurality of first association relationships of the first file At least one associated file that is sequentially accessed after the first file is accessed;
  • machine executable instructions cause the processor 901 to:
  • one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is a file that is accessed after accessing the predecessor file;
  • the machine-executable instructions cause the processor 901 to:
  • first association relationship set a first target association relationship set that uses the first file as a predecessor file to appear most often is obtained, and in the first target association relationship set, a second association relationship is obtained.
  • the second association relationship is: the first association relationship in which the subsequent files in the first target association set are accessed the most;
  • the successor file of the second association relationship is determined as the association file.
  • machine executable instructions cause the processor 901 to:
  • the new second association relationship is: the first association relationship in which a subsequent file in the new first target association set is accessed most often;
  • a new target with the largest number of subsequent file occurrences is determined from the new third association relationship
  • the association relationship determines the file in the new target association relationship as the association file; and deletes the new target association relationship to obtain the new first association set.
  • machine executable instructions cause the processor 901 to:
  • the multiple first associations of the file are determined in the following ways:
  • the number of times the third file is accessed after the second file is accessed, the number of times the second file is accessed, and the third file being accessed The number of times to obtain the influence value of the second file being accessed on the third file being accessed;
  • the second probability is greater than the second probability threshold, and the influence value is greater than the influence threshold, it is determined that the second file and the third file have all Describe the first association.
  • the machine executable instruction further causes the transceiver 903 to send the first merged file to the HDFS and receive the first storage block returned by the HDFS that stores the first merged file Identifier, and send the first storage block identifier of the first merged file to the processor 901;
  • the machine-executable instructions further cause the processor 901 to receive the first storage block identifier of the first merged file, and create first index information including the mapping relationship between the first file identifier and the first merged file identifier And second index information including the mapping relationship between the first merged file identifier and the first storage block identifier.
  • machine executable instructions cause the processor 901 to:
  • the next file read request received includes the file associated with the target file
  • the file associated with the target file is stored in the cache, the read and The file associated with the target file.
  • the machine-readable storage medium 902 mentioned herein may be any electronic, magnetic, optical, or other physical storage system, and may contain or store information, such as executable instructions, data, and so on.
  • the machine-readable storage medium may be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drive (such as a hard disk drive), solid-state drive, any type of storage disk (Such as optical discs, DVDs, etc.), or similar storage media, or a combination thereof.
  • the embodiments of the present disclosure also provide a machine-readable storage medium that stores machine-executable instructions.
  • the machine-executable instructions When invoked and executed by the processor, the machine-executable instructions cause the processor to implement the foregoing FIGS. 1-5 Any of the file reading method steps shown.
  • the embodiments of the present disclosure also provide a machine-executable instruction.
  • the machine-executable instruction When called and executed by the processor, the machine-executable instruction prompts the processor to read any of the files shown in FIGS. 1-5. Method steps.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A file reading request is received, wherein the file reading request comprises an identifier of a target file to be read; according to the identifier of the target file, a target sub-file identifier matching the identifier of the target file and a corresponding target merge file identifier are searched for in a mapping relationship, between sub-file identifiers and merge file identifiers, comprised in locally stored first index information; according to the target merge file identifier, a target storage block identifier corresponding to the target merge file identifier is searched for in a mapping relationship, between merge file identifiers and storage block identifiers of an HDFS, comprised in locally stored second index information; and the number of sub-files to be acquired that are associated with the target file is determined according to a pre-set acquisition condition, a file acquisition request is sent to the HDFS, and the target file and the associated files returned by the HDFS are received and cached.

Description

文件读取File reading
本申请要求于2018年12月17日提交中国专利局、申请号为201811541620.0发明名称为“文件读取方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 17, 2018 in the Chinese Patent Office with the application number 201811541620.0 and the invention titled "File Reading Method and Device", the entire contents of which are incorporated by reference in this application.
背景技术Background technique
随着大数据时代的到来,在电子商务、社交网站、科研计算等领域中,每天都会有大量的数据产生,传统的单机系统无法解决存储和数据分析等问题,为了提高大量的数据的存储效率,当前通常采用分布式存储体系对数据进行分布式存储。With the advent of the era of big data, in the fields of e-commerce, social networking sites, scientific research and calculation, a large amount of data is generated every day. Traditional stand-alone systems cannot solve problems such as storage and data analysis. In order to improve the storage efficiency of large amounts of data Currently, distributed storage systems are commonly used for distributed storage of data.
在当前的分布式存储体系中,一般采用Hadoop作为存储技术,所述Hadoop是一种开源的分布式系统基础架构。Hadoop分布式文件系统(Hadoop Distributed File System,HDFS)中存储的每个文件需要与一个块(Block)对应,HDFS中的主节点(NameNode)为每个文件和其对应的块建立映射关系。In the current distributed storage system, Hadoop is generally adopted as the storage technology, and the Hadoop is an open source distributed system infrastructure. Each file stored in the Hadoop Distributed File System (Hadoop Distributed File System, HDFS) needs to correspond to a block, and the master node (NameNode) in HDFS establishes a mapping relationship between each file and its corresponding block.
附图简要说明Brief description of the drawings
图1-1示出了根据本公开一实施方式的文件读取方法的流程图;FIG. 1-1 shows a flowchart of a file reading method according to an embodiment of the present disclosure;
图1-2示出了本公开实施例一可能应用的系统架构示意图;Figure 1-2 shows a schematic diagram of a possible application system architecture according to Embodiment 1 of the present disclosure;
图2示出了根据本公开一实施方式的文件读取方法的流程图;2 shows a flowchart of a file reading method according to an embodiment of the present disclosure;
图3示出了根据本公开一实施方式的确定第一关联关系的流程图;3 shows a flowchart of determining a first association relationship according to an embodiment of the present disclosure;
图4示出了根据本公开一实施方式的关联文件获取方法的流程图;4 shows a flowchart of an associated file acquisition method according to an embodiment of the present disclosure;
图5示出了根据本公开一实施方式的文件关联关系获取的过程示意图;FIG. 5 shows a schematic diagram of a process for acquiring a file association relationship according to an embodiment of the present disclosure;
图6示出了根据本公开一实施方式的文件读取装置的框图;6 shows a block diagram of a file reading device according to an embodiment of the present disclosure;
图7示出了根据本公开一实施方式的文件读取装置的框图;7 shows a block diagram of a file reading device according to an embodiment of the present disclosure;
图8示出了根据本公开一实施方式的第二确定模块的示意图;8 shows a schematic diagram of a second determination module according to an embodiment of the present disclosure;
图9示出了根据本公开一实施方式的第一确定模块的示意图;9 shows a schematic diagram of a first determination module according to an embodiment of the present disclosure;
图10示出了根据本公开一实施方式的服务器的结构框图。FIG. 10 shows a structural block diagram of a server according to an embodiment of the present disclosure.
具体实施方式detailed description
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。Various exemplary embodiments, features, and aspects of the present disclosure will be described in detail below with reference to the drawings. The same reference numerals in the drawings denote elements having the same or similar functions. Although various aspects of the embodiments are shown in the drawings, unless specifically noted, the drawings are not necessarily drawn to scale.
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。The word "exemplary" used exclusively here means "used as an example, embodiment, or illustrative". Any embodiments described herein as "exemplary" need not be construed as superior or better than other embodiments.
另外,为了更好的说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following specific implementations. Those skilled in the art should understand that the present disclosure can also be implemented without certain specific details. In some examples, methods, means, components and circuits well known to those skilled in the art are not described in detail in order to highlight the gist of the present disclosure.
发明人发现,通过HDFS存取大量的数据量较小的文件通常会存在以下问题:The inventor found that accessing a large number of files with a small amount of data through HDFS usually has the following problems:
HDFS比较适合存储数据量较大的文件(例如数据量大于64M或128MB的文件),可以充分利用HDFS的存储资源。若HDFS存储大量数据量小于64M(如图片、文档等仅有10KB~10MB的文件)的文件,由于这些文件远小于HDFS中block块大小,且存储大量数据量较小的文件将占据更多的存储块,因此会降低HDFS的存储资源利用率。而HDFS中存储的文件数量越多,需要建立的映射关系就越多,占用的主节点的内存越多,因此这会极大的占用主节点的内存,导致HDFS存取数据的效率大大降低。HDFS is more suitable for storing files with a large amount of data (for example, files with a data amount greater than 64M or 128MB), and can fully utilize the storage resources of HDFS. If HDFS stores a large number of files with a data volume of less than 64M (such as files with only 10KB to 10MB such as pictures and documents), because these files are much smaller than the block size in HDFS, and a large number of files with a small amount of data will occupy more The storage block will therefore reduce the storage resource utilization of HDFS. The higher the number of files stored in HDFS, the more mapping relationships need to be established, and the more memory is occupied by the master node, so this will greatly occupy the memory of the master node, resulting in greatly reduced efficiency of HDFS access to data.
在HDFS中,海量的数据量较小的文件的元数据信息(元数据信息,是用于描述数据属性的信息,属于一种电子式目录,例如树状目录结构、文件的属性、文件与数据块的映射关系等)通常都存储在NameNode中,这会造成NameNode节点的内存瓶颈问题。且,读取大量的数据量较小的文件将导致客户端频繁与NameNode节点进行通信,这又会降低NameNode的I/O性能。由此可以看出,从HDFS中读取数据量较小的文件,因为数据量较小的文件的数据读取粒度小,且大量的数据量较小的文件存储空间连续性不足,难以发挥HDFS顺序式文件访问的优势。In HDFS, metadata information (metadata information, which is used to describe data attributes) of files with a large amount of data is a type of electronic directory, such as a tree-like directory structure, file attributes, files, and data The mapping relationship of blocks, etc.) are usually stored in the NameNode, which will cause the memory bottleneck of the NameNode. Moreover, reading a large number of files with a small amount of data will cause the client to frequently communicate with the NameNode node, which in turn will reduce the I/O performance of the NameNode. It can be seen from this that files with a small amount of data are read from HDFS because the data read granularity of files with a small amount of data is small, and the storage space of a large number of files with a small amount of data is insufficient for continuity, making it difficult to play HDFS Advantages of sequential file access.
基于以上问题,本公开提出了一种文件读取方法,以提高通过HDFS读取文件的效率。其中,该文件可以是数据量较小的文件,也可以是数据量较大的文件,本公开并不作特别限定。Based on the above problems, the present disclosure proposes a file reading method to improve the efficiency of reading files through HDFS. The file may be a file with a small amount of data or a file with a large amount of data, and the disclosure is not particularly limited.
请参阅图1-1,图1-1示出了根据本公开一实施方式的文件读取方法的流程图。Please refer to FIG. 1-1, which shows a flowchart of a file reading method according to an embodiment of the present disclosure.
请参阅图1-2,图1-2示出了根据本公开一实施方式的文件读取方法的系统架构示意图。Please refer to FIG. 1-2, which shows a schematic diagram of a system architecture of a file reading method according to an embodiment of the present disclosure.
如图1-1所示的方法可以应用于服务器1中,以实现服务器1从HDFS2读取文件。The method shown in Figure 1-1 can be applied to server 1 to enable server 1 to read files from HDFS2.
在一种可能的实施方式中,该系统可以包括服务器1和HDFS2。服务器 1可以是客户端服务器,用户通过客户端访问服务器1,从而服务器1利用本公开实施例的文件读取方法从HDFS2中读取文件。In a possible implementation, the system may include server 1 and HDFS2. The server 1 may be a client server, and a user accesses the server 1 through the client, so that the server 1 uses the file reading method of the embodiment of the present disclosure to read files from the HDFS2.
在一种可能的实施方式中,该系统可以包括服务器1、服务器3和HDFS2。所述方法还可以应用于其他的服务器,例如,用户可以通过服务器1调用服务器3的资源来执行所述方法,从而获得目标文件及关联文件。In a possible implementation manner, the system may include server 1, server 3, and HDFS2. The method can also be applied to other servers. For example, the user can call the resources of the server 3 through the server 1 to execute the method, thereby obtaining the target file and the associated file.
在其他实施方式中,本公开所述的方法还可以应用于其他的可以进行运算的处理装置(例如终端)中,图1-2所示的系统架构也并非用于限制本公开。In other embodiments, the method described in the present disclosure can also be applied to other processing devices (such as terminals) that can perform calculations, and the system architecture shown in FIGS. 1-2 is not intended to limit the present disclosure.
如图1-1所示,所述方法包括步骤S110-S150,以该方法应用于服务器为例。各步骤描述具体如下。As shown in Figure 1-1, the method includes steps S110-S150, and the method is applied to a server as an example. The description of each step is as follows.
步骤S110,接收文件读取请求,所述文件读取请求中包括待读取的目标文件的标识。Step S110: Receive a file reading request, where the file reading request includes the identifier of the target file to be read.
在本实施方式中,文件读取请求可以是用户通过终端中的客户端发送的文件读取指令,当用户需要获取某个文件时,可以通过操控客户端,使得客户端发送文件读取请求,进而服务器接收客户端发送的文件读取请求。In this embodiment, the file reading request may be a file reading instruction sent by the user through the client in the terminal. When the user needs to obtain a certain file, the user may manipulate the client to make the client send the file reading request, Furthermore, the server receives the file reading request sent by the client.
在一个示例中,目标文件的标识可以是目标文件的唯一标识信息,用于唯一确定目标文件。例如,唯一标识信息可以是对目标文件的名称等信息进行哈希运算后获得的哈希值。当目标文件的标识是唯一标识信息时,文件的读取属于精确读取。In one example, the identification of the target file may be the unique identification information of the target file, which is used to uniquely determine the target file. For example, the unique identification information may be a hash value obtained by hashing information such as the name of the target file. When the identification of the target file is unique identification information, the reading of the file belongs to accurate reading.
在另一个示例中,目标文件的标识也可以是区别于唯一标识信息的其他信息,用于标识某类文件或某个范围内的文件,例如可以是日期、类别等信息。当目标文件的标识是这类信息时,文件的读取属于模糊读取。In another example, the identification of the target file may also be other information that is different from the unique identification information, which is used to identify a certain type of file or a certain range of files, for example, information such as date and category. When the identification of the target file is such information, the reading of the file is fuzzy reading.
步骤S120,根据目标文件的标识,在本地存储的第一索引信息包括的子文件标识与合并文件标识的映射关系中,查找与目标文件的标识匹配的目标子文件标识及对应的目标合并文件标识。Step S120, according to the target file identifier, in the mapping relationship between the subfile identifier and the merged file identifier included in the first index information stored locally, find the target subfile identifier that matches the target file identifier and the corresponding target merged file identifier .
其中,合并文件存储于HDFS、且合并文件中的子文件有关联关系,即合并文件由多个有关联关系的子文件合并而成。该关联关系可以是访问关联关系,例如,文件1被访问后,下一个被访问的文件是文件2,则文件2与文件1具有关联关系,可将文件1和文件2合并为合并文件,并将该合并文件存储在HDFS中。Among them, the merged file is stored in HDFS, and the subfiles in the merged file have an association relationship, that is, the merged file is formed by merging a plurality of subfiles with an association relationship. The association relationship may be an access association relationship. For example, after file 1 is accessed, the next file to be accessed is file 2, then file 2 and file 1 have an association relationship, and file 1 and file 2 may be merged into a merged file, and Store the merged file in HDFS.
在本实施方式中,服务器可以预先存储第一索引信息,第一索引信息的创建过程将在后续说明。具体的,第一索引信息可以包括子文件与合并文件 的映射关系,在本公开中,子文件与合并文件的关系,称为:第一映射关系。该第一映射关系可以表示为子文件标识与合并文件标识的对应关系,通过该第一映射关系,可以利用目标文件的标识找到对应的合并文件。In this embodiment, the server may store the first index information in advance, and the process of creating the first index information will be described later. Specifically, the first index information may include the mapping relationship between the sub-file and the merged file. In the present disclosure, the relationship between the sub-file and the merged file is called: the first mapping relationship. The first mapping relationship may be expressed as a correspondence between the sub-file identification and the merged file identification. Through the first mapping relationship, the corresponding merged file may be found using the target file identification.
在其他实施方式中,第一索引信息还可以包括子文件在合并文件中的偏移量、以及子文件的大小。子文件的大小可以是子文件在合并文件中占据的长度或者比重,所述的偏移量可以是子文件在合并文件中的起始位置。应用该第一索引信息,在查找到与所述目标文件的标识匹配的目标子文件标识以及该目标文件的标识对应的目标合并文件标识后,还可以依据所述第一索引信息包括的子文件在合并文件中的偏移量,查找目标子文件在目标合并文件中的存储位置。应当理解的是,由于目标子文件标识与目标文件的标识相匹配,因此,在可选的实施例中,查找到的目标子文件即为目标文件。In other embodiments, the first index information may also include the offset of the subfile in the merged file, and the size of the subfile. The size of the sub-file may be the length or specific gravity occupied by the sub-file in the merged file, and the offset may be the starting position of the sub-file in the merged file. Applying the first index information, after finding the target sub-file identifier that matches the target file identifier and the target merged file identifier corresponding to the target file identifier, the sub-file included in the first index information can also be used In the offset of the merged file, find the storage location of the target subfile in the target merged file. It should be understood that, since the target sub-file identification matches the target file's identification, in an alternative embodiment, the target sub-file found is the target file.
步骤S130,根据所述目标合并文件标识,在本地存储的第二索引信息包括的合并文件标识与所述HDFS的存储块标识的映射关系中,查找与所述目标合并文件标识对应的目标存储块标识。Step S130, according to the target merged file identifier, in the mapping relationship between the merged file identifier included in the locally stored second index information and the HDFS storage block identifier, search for a target storage block corresponding to the target merged file identifier Logo.
在本实施方式中,服务器还可以预先存储第二索引信息,第二索引信息的创建过程将在后续说明。具体的,第二索引信息可以包括合并文件与HDFS的存储块的映射关系。在本公开中,合并文件与所述HDFS的存储块的映射关系,称为:第二映射关系。In this embodiment, the server may also store the second index information in advance, and the process of creating the second index information will be described later. Specifically, the second index information may include the mapping relationship between the merged file and the HDFS storage block. In the present disclosure, the mapping relationship between the merged file and the storage block of the HDFS is called: a second mapping relationship.
在一个示例中,该第二映射关系可以表示为合并文件标识与HDFS的存储块标识的对应关系。通过该第二映射关系,可以查找目标合并文件标识,以得到目标合并文件的目标存储块标识。可选的,HDFS的存储块标识可以包括HDFS的块地址信息。In one example, the second mapping relationship may be expressed as a correspondence between the merged file identifier and the storage block identifier of HDFS. Through the second mapping relationship, the target merge file ID can be searched to obtain the target storage block ID of the target merge file. Optionally, the HDFS storage block identifier may include HDFS block address information.
在另一个示例中,该第二映射关系还可以表示为合并文件的标识与合并文件在HDFS中存储位置的对应关系,根据该第二映射关系可以找到合并文件在HDFS中的存储位置。In another example, the second mapping relationship may also be expressed as a correspondence between the identification of the merged file and the storage location of the merged file in HDFS, and the storage location of the merged file in HDFS may be found according to the second mapping relationship.
步骤S140,按照预设获取条件,确定与所述目标文件关联的待获取的子文件数量,向所述HDFS发送文件获取请求,所述文件获取请求中包含目标存储块标识、目标子文件标识、目标合并文件标识、子文件数量,以使所述HDFS在与所述目标存储块标识对应的目标存储块中查找与所述目标合并文件标识对应的目标合并文件,并在所述目标合并文件中查找所述目标文件及数量为所述子文件数量的关联文件。Step S140: Determine the number of sub-files to be acquired associated with the target file according to a preset acquisition condition, and send a file acquisition request to the HDFS, where the file acquisition request includes a target storage block identifier, a target sub-file identifier, The target merge file ID and the number of subfiles, so that the HDFS searches for the target merge file corresponding to the target merge file ID in the target storage block corresponding to the target storage block ID, and finds the target merge file in the target merge file Find the target file and related files whose number is the number of the sub-files.
HDFS在接收文件获取请求后,根据文件获取请求中包含的目标存储块标识、目标子文件标识、目标合并文件标识、子文件数量,获取目标文件及数量为子文件数量的关联文件。HDFS查找到目标文件和数量为子文件数量的关联文件后,向服务器发送目标文件和关联文件。After receiving the file acquisition request, HDFS acquires the target file and the associated files whose number is the number of subfiles according to the target storage block identifier, target subfile identifier, target merged file identifier, and subfile quantity included in the file acquisition request. After HDFS finds the target file and associated files whose number is the number of subfiles, it sends the target file and associated files to the server.
在本实施方式中,HDFS可以获取目标合并文中与目标子文件存储位置接近的数量为子文件数量的子文件(即关联文件)。In this embodiment, HDFS can acquire the number of sub-files (ie, associated files) in the target merged text that are close to the storage location of the target sub-file by the number of sub-files.
例如,HDFS在接收到文件获取请求后,通过namenode查询目标子文件(即目标文件)、目标合并文件、目标存储块对应的元数据信息。在确定目标子文件后,通过namenode确定目标合并文件中与目标子文件相邻的数量为子文件数量的子文件的元数据信息,然后从datanode获取目标文件及数量为子文件数量的关联文件,并发送给服务器。For example, after receiving the file acquisition request, HDFS queries the metadata information corresponding to the target sub-file (that is, the target file), the target merged file, and the target storage block through the namenode. After determining the target subfile, determine the metadata information of the subfiles with the number of subfiles adjacent to the target subfile in the target merge file through the namenode, and then obtain the target file and the number of related files with the number of subfiles from the datanode, And send it to the server.
在一种可能的情况中,有可能根据第一索引信息匹配出多个目标合并文件标识,进而根据第二索引信息匹配出多个目标存储块标识。此时可以从中任选一个目标合并文件标识及对应的目标存储块标识,并执行步骤S140。In a possible situation, it is possible to match multiple target merged file identifiers according to the first index information, and then match multiple target storage block identifiers according to the second index information. At this time, one of the target merge file identifier and the corresponding target storage block identifier can be selected from them, and step S140 is executed.
在其他示例中,还可以针对每一个目标合并文件标识及对应的目标存储块标识,均执行步骤140,即均发送一文件获取请求,以获取目标文件及数量为子文件数量的关联文件。In other examples, for each target merged file identifier and corresponding target storage block identifier, step 140 is executed, that is, a file acquisition request is sent to obtain the target file and the associated files whose number is the number of sub-files.
步骤S150,接收并缓存HDFS返回的目标文件以及关联文件。Step S150: Receive and cache the target file and associated file returned by HDFS.
具体的,可以将HDFS返回的目标文件以及关联文件缓存到服务器的缓存空间或其他存储空间中。在服务器下一次接收到针对相同文件的文件读取请求时,可以直接从缓存中获取文件,从而减少了服务器与HDFS之间的交互,可以节约HDFS的资源,提高HDFS的存取效率。Specifically, the target file and associated files returned by HDFS can be cached in the server's cache space or other storage space. The next time the server receives a file read request for the same file, it can directly obtain the file from the cache, thereby reducing the interaction between the server and HDFS, saving HDFS resources, and improving HDFS access efficiency.
由于本公开实施方式是将合并文件存储在HDFS中,并且在第一索引信息记录合并文件与各个子文件的映射关系、在第二索引信息中记录合并文件与HDFS的存储块的映射关系,因此,可以通过本公开实施方式所述的方法,利用目标文件的标识、第一索引信息、第二索引信息快速获取目标文件及关联文件,并存储在缓存中。由上可知,本公开实施例提供的方法在获取目标文件的同时,还可以获取在下一时刻有可能被访问的关联文件,并将目标文件及关联文件存储在缓存中。当用户在下一时刻发出文件读取请求时,这些存储在缓存中的关联文件可以首先被查询且有很大概率被命中,从而能够减少与HDFS的交互,降低了HDFS的资源使用率,提高HDFS的存取效率, 并提高了HDFS处理大量文件的效率。Since the embodiment of the present disclosure stores the merged file in HDFS, and records the mapping relationship between the merged file and each sub-file in the first index information, and the mapping relationship between the merged file and the storage block of HDFS in the second index information. , The method described in the embodiments of the present disclosure can be used to quickly obtain the target file and the associated file by using the target file identifier, the first index information, and the second index information, and store it in the cache. As can be seen from the above, the method provided by the embodiment of the present disclosure can obtain the related files that may be accessed at the next moment while acquiring the target files, and store the target files and related files in the cache. When the user issues a file read request at the next moment, these associated files stored in the cache can be queried first and hit with a high probability, which can reduce the interaction with HDFS, reduce the resource utilization rate of HDFS, and improve HDFS Access efficiency, and improve the efficiency of HDFS processing a large number of files.
此外,HDFS中存储的文件是由多个具有访问关联的文件合并而成,因此可以发挥HDFS顺序式文件访问的优势。In addition, the files stored in HDFS are composed of multiple files with access association, so the advantages of HDFS sequential file access can be used.
考虑到网络资源,一般情况下不会获取目标文件的所有关联文件,因此有必要提供一种方案,能够在平衡网络资源的情况下,获取最大数量的关联文件。因此在一种可能的实施方式中,所述预设获取条件可以包括:Considering the network resources, in general, all the associated files of the target file will not be obtained, so it is necessary to provide a solution that can obtain the maximum number of associated files under the condition of balancing network resources. Therefore, in a possible implementation manner, the preset acquisition condition may include:
M×t1<tm-th,其中,M表示子文件的数目,t1表示读取一个子文件耗费的时间,tm表示用户最大等待时间,th表示获取HDFS数据返回时间。M×t1<tm-th, where M is the number of sub-files, t1 is the time it takes to read a sub-file, tm is the user’s maximum waiting time, and th is the return time for obtaining HDFS data.
在本实施方式中,通过用户最大等待时间、HDFS数据返回时间及读取一个子文件耗费的时间可以确定获取子文件数量的最佳数目,在优化用户体验(最大等待时间)的同时提高读取效率。In this embodiment, the optimal number of sub-files can be determined by the user's maximum wait time, HDFS data return time, and time spent reading a sub-file, which improves the user experience (maximum wait time) while improving read effectiveness.
请参阅图2,图2示出了根据本公开一实施方式的文件读取方法的流程图,其中,步骤S201-S260主要描述文件合并的过程,该过程可以在前述S110之前执行。Please refer to FIG. 2, which shows a flowchart of a file reading method according to an embodiment of the present disclosure, where steps S201-S260 mainly describe a file merging process, which may be performed before the foregoing S110.
步骤S210,获取多个文件的历史访问日志。Step S210: Acquire historical access logs of multiple files.
在本实施方式中,历史访问日志中包括多个文件的被访问时间及被访问次数。In this embodiment, the historical access log includes the access time and access times of multiple files.
在一种可能的实施方式中,可以对历史访问日志的获取时间进行限定,例如,可以获取一定时间段内的历史访问日志。In a possible implementation manner, the acquisition time of the historical access log may be limited, for example, the historical access log may be acquired within a certain period of time.
在一个示例中,历史访问日志格式可以如下表1所示。In an example, the format of the historical access log may be as shown in Table 1 below.
表1Table 1
被访问文件Accessed file 被访问时间Time visited
文件1 File 1 2015/1/1 12:00:002015/1/1 12:00:00
文件2 File 2 2015/1/1 12:01:302015/1/1 12:01:30
文件3 File 3 2015/1/2 13:02:502015/1/2 13:02:50
文件1 File 1 2015/1/2 13:04:352015/1/2 13:04:35
文件1 File 1 2015/1/2 13:05:002015/1/2 13:05:00
文件3 File 3 2015/1/3 05:22:562015/1/3 05:22:56
文件4File 4 2015/1/4 15:07:262015/1/4 15:07:26
文件5File 5 2015/1/4 19:38:232015/1/4 19:38:23
文件6File 6 2015/1/6 09:18:072015/1/6 09:18:07
文件5File 5 2015/1/6 12:56:222015/1/6 12:56:22
假设获取2015/1/1至2015/1/3的历史访问日志,则获取的历史访问日志中包括文件1、2、3的被访问时间和被访问次数。Assuming that the historical access logs from 2015/1/1 to 2015/1/3 are acquired, the acquired historical access logs include the accessed time and the number of accessed times of files 1, 2, and 3.
步骤S220,针对所述多个文件中的每一文件,根据所述多个文件的被访问时间及被访问次数,在所述多个文件中除该文件之外的其他文件中,确定在访问该文件之后与该文件具有访问关联的至少一文件,并确定该文件的多个第一关联关系。Step S220: For each file in the plurality of files, according to the accessed time and the number of times of access of the plurality of files, among the files other than the file in the plurality of files, it is determined that the file is accessed. The file then has access to at least one file associated with the file, and determines multiple first association relationships of the file.
其中,第一关联关系用于表示该文件与至少一文件中任一文件的访问关联。The first association relationship is used to indicate that the file is associated with any file in at least one file.
以上述表1为例,根据文件1、2、3的被访问时间和被访问次数,可以确定在访问文件1后与文件1具有访问关联的文件包括文件2和文件3,在访问文件2后与文件2具有访问关联的文件包括文件3,在访问文件3后与文件3具有访问关联的文件包括文件1。可以确定文件1的2个第一关联关系,文件2的1个第一关联关系,文件3的1个第一关联关系。Taking the above Table 1 as an example, according to the access time and access times of files 1, 2, 3, it can be determined that the files associated with access to file 1 after accessing file 1 include file 2 and file 3, and after accessing file 2 The file associated with access to file 2 includes file 3, and the file associated with access to file 3 after file 3 is accessed includes file 1. It is possible to determine two first association relationships of file 1, one first association relationship of file 2, and one first association relationship of file 3.
假设用(文件A,文件B)的方式表示第一关联关系,其中,该方式可以表示在文件A被访问后、文件B被随之访问,即用户在访问文件A后,下一次访问文件B。那么文件1的第一关联关系可以为(文件1,文件2)、(文件1,文件3),文件2的第一关联关系可以为(文件2,文件3),文件3的第一关联关系可以为(文件3,文件1)。Suppose that the first association relationship is expressed in the way of (File A, File B), where this method can indicate that after File A is accessed, File B is accessed accordingly, that is, the user accesses File B next time after accessing File A . Then the first association relationship of file 1 may be (file 1, file 2), (file 1, file 3), the first association relationship of file 2 may be (file 2, file 3), the first association relationship of file 3 Can be (File 3, File 1).
步骤S230,根据所述多个文件中各文件的第一关联关系,获取第一关联关系数量最多的第一文件,并依据所述第一文件的多个第一关联关系,在所述多个文件中确定在所述第一文件被访问之后依次被访问的至少一关联文件。Step S230: Acquire the first file with the largest number of first association relationships according to the first association relationship of each file in the plurality of files, and according to the plurality of first association relationships of the first file, in the multiple The file determines at least one associated file that is accessed in sequence after the first file is accessed.
仍然以前述示例为例,比较文件1-3的第一关联关系数量,可以确定第一关联关系数量最多的第一文件:文件1。则可以确定文件1被访问之后依次被访问的文件是文件2、文件3。Still taking the foregoing example as an example, comparing the number of first association relationships of files 1-3, it can be determined that the first file with the largest number of first association relationships: file 1. Then, it can be determined that the files that are accessed in sequence after file 1 is accessed are file 2 and file 3.
步骤S240,将所述第一文件及至少一关联文件存储在第一合并文件中。Step S240: Store the first file and at least one associated file in the first merged file.
在本实施方式中,可以对第一文件及至少一关联文件进行合并,从而获得合并文件。In this embodiment, the first file and at least one associated file may be combined to obtain a combined file.
在一个示例中,可以按照被访问顺序,将第一文件及至少一关联文件顺次连续地存储,合并成第一合并文件。本公开中,顺次连续是指各文件的存 储位置连续。In one example, the first file and the at least one associated file may be stored sequentially and merged into the first merged file in the order of being accessed. In the present disclosure, sequential succession means that the storage location of each file is consecutive.
示例性的,可以将前述示例中的文件1存储在地址0000H~0FFFH(其中,H表示16进制),将文件2存储在地址1000H~EFFFH,将文件3存储在地址F000H~FFFFH,此时可以认为第一合并文件为地址0000H~FFFFH存储的数据。Exemplarily, you can store file 1 in the preceding example at addresses 0000H to 0FFFH (where H represents hexadecimal), store file 2 at addresses 1000H to EFFFH, and store file 3 at addresses F000H to FFFFH. It can be considered that the first merged file is data stored at addresses 0000H to FFFFH.
在另一个示例中,可以按照被访问顺序,将第一文件及至少一关联文件顺次连续地存储在第一合并文件中。In another example, the first file and the at least one associated file may be sequentially and successively stored in the first merged file in the order of being accessed.
示例性的,可以预先开辟一个存储空间作为第一合并文件的存储空间。例如,可以将地址0000H~FFFFH表示的空间作为第一合并文件的存储空间,然后将文件1-3分别存储至0000H~0FFFH、1000H~EFFFH、0000H~FFFFH。Exemplarily, a storage space may be opened in advance as the storage space of the first merged file. For example, the space indicated by the addresses 0000H to FFFFH may be used as the storage space of the first merged file, and then files 1-3 are stored to 0000H to 0FFFH, 1000H to EFFFH, and 0000H to FFFFH, respectively.
步骤S250,在所述多个文件中各文件的第一关联关系中,删除确定至少一关联文件时应用到的第一关联关系,获取剩余的第一关联关系;根据剩余的第一关联关系,获取第一关联关系数量最多的新第一文件。Step S250, in the first association relationship of each file in the plurality of files, delete the first association relationship applied when determining at least one associated file, and obtain the remaining first association relationship; according to the remaining first association relationship, Obtain the new first file with the largest number of first associations.
仍然以前述示例为例,将文件1、2、3合并后,应用到的第一关联关系为(文件1,文件2)、(文件2,文件3),则剩余的第一关联关系为(文件1,文件3)、(文件3,文件1),执行获取第一关联关系数量最多的新第一文件。由于文件1(文件1,文件3)和文件3(文件3,文件1)的第一关联关系数量一样,此时可以任意选择一文件作为新第一文件:文件3。Still taking the foregoing example as an example, after merging files 1, 2, and 3, the first association relationship applied is (file 1, file 2), (file 2, file 3), then the remaining first association relationship is ( File 1, File 3), (File 3, File 1), execute the acquisition of the new first file with the largest number of first association relationships. Since file 1 (file 1, file 3) and file 3 (file 3, file 1) have the same number of first association relationships, at this time, a file can be arbitrarily selected as a new first file: file 3.
步骤S260,在所述多个文件中,重复执行依据所述新第一文件的多个第一关联关系确定在所述新第一文件被访问之后依次被访问的至少一关联文件、将所述新第一文件及在新第一文件被访问之后依次被访问的至少一关联文件存储在新第一合并文件中过程,直到获取不到剩余的第一关联关系。Step S260, among the plurality of files, repeatedly execute, according to the plurality of first association relationships of the new first file, determine at least one associated file that is sequentially accessed after the new first file is accessed, and change the The process of storing the new first file and at least one associated file that is sequentially accessed after the new first file is accessed is stored in the new first merged file until the remaining first associated relationship cannot be obtained.
仍然以前述示例为例,通过步骤S260将文件3和1、文件1和文件3合并后,不存在剩余的第一关联关系,此时结束流程。Still taking the foregoing example as an example, after merging files 3 and 1, file 1 and file 3 through step S260, there is no remaining first association relationship, and the process ends at this time.
通过以上方法,本公开提供的实施方式可以将多个文件中具有关联性的文件合并为一个合并文件,在该合并文件中,包括多个子文件,合并文件中的各个子文件都具有关联性。Through the above method, the embodiments provided by the present disclosure can merge files that have relevance among multiple files into one merged file, and the merged file includes multiple sub-files, and each sub-file in the merged file has relevance.
在一种可能的实施方式中,关联关系可以包括关联的文件的标识信息,例如,在合并文件中,子文件A与子文件B具有文件关联关系,则所述关联关系可以为(子文件A,子文件B);子文件A、子文件B、子文件C…子文件N具有文件关联关系,则所述关联关系可以为(子文件A,文件B,子文 件C,…,子文件N)。当然,在其他实施方式中,可以采用其他的形式记录多个文件的关联关系,在此不做限定。另外,在下文中以确定第一关联关系的方法为例对所述关联关系的确定方法进行介绍。In a possible implementation manner, the association relationship may include identification information of the associated file. For example, in the merged file, sub-file A and sub-file B have a file association relationship, then the association relationship may be (sub-file A , Sub-file B); sub-file A, sub-file B, sub-file C... sub-file N has a file association relationship, the association relationship may be (sub-file A, file B, sub-file C, ..., sub-file N ). Of course, in other embodiments, other forms may be used to record the association relationship of multiple files, which is not limited herein. In addition, the method for determining the first association relationship is taken as an example to introduce the method for determining the association relationship.
请参阅图3,图3示出了根据本公开一实施方式的确定第一关联关系的流程图。在一种可能的实施方式中,如图3所示,可以通过以下方式确定文件的第一关联关系。Please refer to FIG. 3, which illustrates a flowchart of determining a first association relationship according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 3, the first association relationship of the file may be determined in the following manner.
步骤S310,根据第二文件的被访问次数、第二文件被访问后第三文件的被访问次数,获取第二文件被访问后第三文件被访问的第一概率。Step S310, according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed, obtain a first probability that the third file is accessed after the second file is accessed.
其中,第二文件和第三文件为多个文件中的任意两个不相同的文件。Among them, the second file and the third file are any two different files among the multiple files.
在一种可能的实施方式中,可以通过如下公式获取所述第一概率:P(B|A)=NAB/NA,其中,P(B|A)为第一概率,NAB为第二文件被访问后第三文件的被访问次数,NA为第二文件的被访问次数,A表示第二文件,B表示第三文件。In a possible implementation manner, the first probability can be obtained by the following formula: P(B|A)=NAB/NA, where P(B|A) is the first probability and NAB is the second file The number of times the third file is accessed after access, NA is the number of times the second file is accessed, A indicates the second file, and B indicates the third file.
步骤S320,根据第二文件被访问后第三文件的被访问次数及历史访问日志中所有文件被访问的总次数,获取第二文件和第三文件都被访问的第二概率。Step S320, according to the number of times the third file is accessed after the second file is accessed and the total number of times all files in the historical access log are accessed, obtain a second probability that both the second file and the third file are accessed.
在一种可能的实施方式中,通过如下公式获取第二概率:P(AB)=NAB/N,其中,P(AB)为第二概率,N为历史访问日志中所有文件被访问的总次数。In a possible implementation manner, the second probability is obtained by the following formula: P(AB)=NAB/N, where P(AB) is the second probability, and N is the total number of times all files in the historical access log are accessed .
步骤S330,根据历史访问日志中所有文件被访问的总次数、第二文件被访问后第三文件的被访问次数、第二文件被访问的次数及第三文件被访问的次数,获取第二文件被访问对第三文件被访问的影响力值。Step S330: Acquire the second file according to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, the number of times the second file is accessed and the number of times the third file is accessed The influence value of being accessed on the third file being accessed.
在一种可能的实施方式中,通过如下公式获取影响力值:I(B|A)=(N×NAB)/(NA×NB),其中I(B|A)为影响力值,NB为第三文件被访问的次数。In a possible implementation manner, the influence value is obtained by the following formula: I(B|A)=(N×NAB)/(NA×NB), where I(B|A) is the influence value and NB is The number of times the third file has been accessed.
步骤S340,当第一概率大于第一概率阈值、第二概率大于第二概率阈值及影响力值大于影响力阈值时,确定第二文件及第三文件具有第一关联关系。Step S340: When the first probability is greater than the first probability threshold, the second probability is greater than the second probability threshold, and the influence value is greater than the influence threshold, it is determined that the second file and the third file have a first association relationship.
在一种可能的实施方式中,通过如下公式确定第二文件、第三文件具有所述第一关联关系:In a possible implementation manner, it is determined that the second file and the third file have the first association relationship by the following formula:
(A,B)={(A,B)|P(B|A)>min_P(B|A)&&P(AB)>min_P(AB)&&I(B|A)>min_I(B|A)}。(A, B)={(A, B)|P(B|A)>min_P(B|A)&&P(AB)>min_P(AB)&&I(B|A)>min_I(B|A)}.
其中,min_P(B|A)为第一概率阈值,min_P(AB)为第二概率阈值,min_I(B|A)为影响力阈值,(A,B)为第二文件A和第三文件B具有的第一关联关系。Among them, min_P(B|A) is the first probability threshold, min_P(AB) is the second probability threshold, min_I(B|A) is the influence threshold, (A, B) is the second file A and the third file B The first association.
示例性的,可以先对多个文件中的某一个文件与其他文件的第一概率是否大于第一概率阈值进行判断,以获取大于第一概率阈值的文件集合。例如,在A,B,C,D,E,F及G文件中,文件A与文件B,文件A与文件C,文件A与文件D,文件C与文件F的第一概率大于第一概率阈值,则此时的文件集合包括A,B,C,D,F。Exemplarily, it may be judged first whether the first probability of one of the multiple files and other files is greater than the first probability threshold to obtain a file set greater than the first probability threshold. For example, in files A, B, C, D, E, F, and G, the first probability of file A and file B, file A and file C, file A and file D, and file C and file F is greater than the first probability Threshold, then the file set at this time includes A, B, C, D, F.
然后对符合第一概率阈值的文件集合中的某一个文件与其他文件的第二概率是否大于第二概率阈值进行判断,以获取符合第二概率阈值的文件集合。例如,在文件集合包括A,B,C,D,F时,若文件A与文件B,文件A与文件C,文件C与文件F的第二概率大于第二概率阈值,则此时的文件集合包括A,B,C,F。Then, it is determined whether the second probability of a certain file and other files in the file set that meets the first probability threshold is greater than the second probability threshold to obtain a file set that meets the second probability threshold. For example, when the file set includes A, B, C, D, and F, if the second probability of file A and file B, file A and file C, and file C and file F is greater than the second probability threshold, then the file at this time The set includes A, B, C, F.
最后对符合第二概率阈值的文件集合中的某一个文件与其他文件的影响力值是否大于影响力阈值进行判断。例如,在文件集合包括A,B,C,F时,若文件A对文件C的影响力值及文件C与文件F的影响力值大于影响力阈值,则可以确定文件A与文件C、以及文件C与文件F具有第一关联关系,那么此时的第一关联关系集合可以包括(文件A,文件C)、(文件C,文件F),与此对应的是,此时的文件集合中包括A,C,F三个文件。Finally, it is judged whether the influence value of a certain file and other files in the file set meeting the second probability threshold is greater than the influence threshold. For example, when the file set includes A, B, C, and F, if the influence value of file A on file C and the influence value of file C and file F are greater than the influence threshold, you can determine file A and file C, and File C and file F have a first association relationship, then the first association set at this time may include (file A, file C), (file C, file F), corresponding to this, in the file collection at this time Including A, C, F three files.
其中,以上获取第一关联关系集合及符合第一关联关系集合中的关联关系的文件集合的过程是示例性的,示例中的文件的数量并不用于限制本公开。The above process of obtaining the first association relationship set and the file collection that meets the association relationship in the first association relationship set is exemplary, and the number of files in the example is not used to limit the present disclosure.
由前述可知,第一关联关系可以用于表示两个文件之间的关联关系。若将具有第一关联关系的两个文件合并,由于文件的大小有可能在在10KB~10MB,合并后的文件依然会小于HDFS的块存储大小(例如:64MB),且合并后的文件依然数量庞大,这并不能最大限度地减少与HDFS交互的次数及HDFS中主节点的内存。因此有必要尽可能地确定尽量多的文件之间的关联关系,以尽可能地合并尽量多的文件。请参阅图4,图4示出了根据本公开一实施方式的关联文件获取方法的流程图,该实施方式可以尽可能地确定尽量多的文件之间的关联关系,以尽可能地合并尽量多的文件。As can be seen from the foregoing, the first association relationship can be used to represent the association relationship between two files. If two files with the first association relationship are merged, since the file size may be between 10KB and 10MB, the merged file will still be smaller than the HDFS block storage size (for example: 64MB), and the number of merged files is still Huge, this does not minimize the number of interactions with HDFS and the memory of the master node in HDFS. Therefore, it is necessary to determine the relationship between as many files as possible to merge as many files as possible. Please refer to FIG. 4. FIG. 4 shows a flowchart of a method for obtaining an associated file according to an embodiment of the present disclosure. This embodiment can determine the association relationship between as many files as possible to merge as many as possible. document.
在本实施方式中,第一关联关系中记录的相关联的两个文件中的一个为前驱文件,另一个为后继文件,后继文件为在访问所述前驱文件之后被访问的文件。下面结合图5,对图4所示的方法进行说明。In this embodiment, one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is a file that is accessed after accessing the predecessor file. The method shown in FIG. 4 will be described below with reference to FIG. 5.
步骤S231,获取包含所述多个文件中各文件的第一关联关系的第一关联关系集合。Step S231: Acquire a first association set containing the first association of each of the files.
以图5为例,第一关联关系集合250中包括各文件的多个第一关联关系,例如文件file1的第一关联关系(file1,file7)、文件file3的第一关联关系(file3,file5)等。各个第一关联关系都包括前驱文件及后继文件,例如对于第一关联关系(file1,file7),其对应的前驱文件为file1,后继文件为file7。Taking FIG. 5 as an example, the first association relationship set 250 includes multiple first association relationships of each file, for example, the first association relationship of file file1 (file1, file7), the first association relationship of file file3 (file3, file5) Wait. Each first association relationship includes a predecessor file and a successor file. For example, for the first association relationship (file1, file7), the corresponding predecessor file is file1 and the successor file is file7.
步骤S232,在第一关联关系集合中,获取以第一文件作为前驱文件出现次数最多的第一目标关联关系集合,并在第一目标关联关系集合中,获取第二关联关系。Step S232: In the first association relationship set, obtain the first target association relationship set that uses the first file as the predecessor file most frequently, and obtain the second association relationship in the first target association relationship set.
第二关联关系为:第一目标关联关系集合中后继文件被访问次数最多的第一关联关系。The second association relationship is: the first association relationship in which the subsequent files in the first target association relationship set are accessed the most.
以图5为例,获取第一关联关系集合250中的第一目标关联关系,也即以第一文件作为前驱文件出现次数最多的第一关联关系,以获得第一目标关联关系集合260。然后在第一目标关联关系集合260中选择:第一目标关联关系集合中后继文件被访问次数最多的第一关联关系(即第一概率最大的第一关联关系)。在第一目标关联关系集合260中,(file1,file7)的第一概率最大,因此,将(file1,file7)作为第二关联关系。Taking FIG. 5 as an example, the first target association relationship in the first association relationship set 250 is obtained, that is, the first association relationship in which the first file is used as the precursor file has the highest number of occurrences, to obtain the first target association relationship set 260. Then, the first target association set 260 is selected: the first association relationship in which the subsequent files in the first target association set are accessed the most (that is, the first association relationship with the largest first probability). In the first target association set 260, (file1, file7) has the largest first probability, so (file1, file7) is the second association.
步骤S233,若第一关联关系集合中存在前驱文件与第二关联关系的后继文件相同的第三关联关系,则从第三关联关系中确定后继文件出现次数最多的目标关联关系,将目标关联关系中的文件确定为关联文件。Step S233: If there is a third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, then the target association relationship with the most occurrences of the successor files is determined from the third association relationship, and the target association relationship is determined The file in is determined to be the associated file.
以图5为例,以第二关联关系(file1,file7)的后继文件file7作为前驱文件,获得第一关联关系集合250中以file7为前驱文件的多个第一关联关系作为第三关联关系270,其中,第三关联关系270可以是一个集合。在该示例中,第三关联关系270包括两个以file7为前驱文件的第一关联关系(file7,file5),(file7,file3)。其中,作为后继文件的file5被访问次数最多(即第一概率最大),因此将第一关联关系(file7,file5)作为所述目标关联关系,并将目标关联关系中的文件file7、file5作为关联文件。Taking FIG. 5 as an example, the subsequent file file7 of the second association relationship (file1, file7) is used as a predecessor file to obtain a plurality of first association relationships in the first association set 250 that take file7 as the predecessor file as the third association relationship 270 , Where the third association relationship 270 may be a set. In this example, the third association relationship 270 includes two first association relationships (file7, file5), (file7, file3) with file7 as a predecessor file. Among them, file5, which is a subsequent file, is accessed the most (ie, the first probability is the largest), so the first association relationship (file7, file5) is used as the target association relationship, and the files file7 and file5 in the target association relationship are used as associations file.
在一种可能的实施方式中,可以将第一关联关系(file7,file5)的后继文件file5合并(记录)到第二关联关系(file1,file7)中,以生成更新后的第二关联关系(file1,file7,file5),并将第一关联关系(file1,file7)从第一关联关系集合中删除。应该说明的是,第一关联关系(file7,file5)在被更新为第二关联关系(file1,file7,file5)后可以被认为已被删除。在其他实施方式中,若第一关联关系(file7,file5)未被第二关联关系(file1,file7,file5) 覆盖,则可以将其从第一关联关系集合中删除。In a possible implementation manner, the subsequent file file5 of the first association relationship (file7, file5) may be merged (recorded) into the second association relationship (file1, file7) to generate an updated second association relationship ( file1, file7, file5), and delete the first association relationship (file1, file7) from the first association set. It should be noted that, after the first association relationship (file7, file5) is updated to the second association relationship (file1, file7, file5), it may be considered to have been deleted. In other embodiments, if the first association relationship (file7, file5) is not covered by the second association relationship (file1, file7, file5), it may be deleted from the first association relationship set.
步骤S234,若第一关联关系集合中不存在前驱文件与第二关联关系的后继文件相同的第三关联关系,则将第二关联关系的后继文件确定为关联文件。Step S234: If there is no third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, the successor file of the second association relationship is determined as the association file.
以图5为例,若在第一关联关系集合中不存在前述的第一关联关系(file7,file5),(file7,file3),则可以将第二关联关系(file1,file7)的后继文件file7确定为第一文件file1的关联文件。Taking FIG. 5 as an example, if the aforementioned first association relationship (file7, file5), (file7, file3) does not exist in the first association relationship set, the subsequent file file7 of the second association relationship (file1, file7) can be Determine the associated file of the first file file1.
步骤S235,删除第一关联关系集合中的目标关联关系,得到新第一关联关系集合。Step S235: Delete the target association relationship in the first association relationship set to obtain a new first association relationship set.
步骤S236,重复执行以下操作,直到新第一关联关系集合中不存在前驱文件与新第二关联关系的后继文件相同的第三关联关系:In step S236, the following operations are repeatedly performed until there is no third association relationship in the new first association relationship set where the predecessor file and the successor file of the new second association relationship are the same:
在所述新第一关联关系集合中,获取以第一文件作为前驱文件出现次数最多的新第一目标关联关系集合,并在所述新第一目标关联关系集合中,获取新第二关联关系,所述新第二关联关系为:新第一目标关联关系集合中后继文件被访问次数最多的第一关联关系;In the new first association relationship set, obtain a new first target association relationship set with the first file as the predecessor file, and obtain the new second association relationship in the new first target association relationship set , The new second association relationship is: the first association relationship in which a subsequent file in the new first target association set is accessed most often;
若所述新第一关联关系集合中存在前驱文件与新第二关联关系的后继文件相同的新第三关联关系,则从所述新第三关联关系中确定后继文件出现次数最多的新目标关联关系,将新目标关联关系中的文件确定为关联文件;以及,删除所述新目标关联关系,获取所述新第一关联关系集合。If there is a new third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set, the new target association with the highest number of subsequent file occurrences is determined from the new third association relationship Relationship, determining the file in the new target association relationship as an association file; and, deleting the new target association relationship to obtain the new first association set.
以图5为例,在获得第一文件file1的关联文件file7及file5后,还可以以file5(此时,file5为后继文件)为前驱文件,寻找第一关联关系集合250中是否存在以file5为前驱文件的第一关联关系。如果不存在,则最终将file7及file5作为第一文件file1的关联文件;如果存在,则按照步骤前述步骤S231~步骤S234继续获取关联文件。Taking FIG. 5 as an example, after obtaining the associated files file7 and file5 of the first file file1, file5 (in this case, file5 is a successor file) can also be used as a precursor file to find whether file5 is used in the first association set 250. The first association of the predecessor file. If it does not exist, then file7 and file5 are finally used as the associated files of the first file file1; if they exist, follow the steps S231 to S234 described above to continue to obtain the associated files.
在本示例中,在第一关联关系集合250中,不存在以file5为前驱文件的第一关联关系,因此第一文件file1的关联文件包括文件file7及文件file5。In this example, in the first association relationship set 250, there is no first association relationship using file5 as a predecessor file, so the associated files of the first file file1 include file file7 and file file5.
当在第一关联关系集合中不存在目标关联关系,使得第一文件的关联文件的确定流程结束后,可以重新获得新第一文件,并按照步骤S231~步骤S235获取新第一文件的关联文件,直到第一关联关系集合为空。When there is no target association relationship in the first association relationship set, so that the process of determining the association file of the first file ends, a new first file can be obtained again, and the association file of the new first file is obtained according to steps S231 to S235 Until the first association set is empty.
以上仅是对步骤S231-步骤S235的过程的示例性说明,并非用于穷举,也并非用于限定本公开。The above is only an exemplary description of the process from step S231 to step S235, and is not intended to be exhaustive or to limit the present disclosure.
应该说明的是,根据以上步骤获取关联文件时,可以在第一关联关系集 合依次删除确定的目标关联关系,直到第一关联关系集合中为空时,完成所有的第一文件的关联文件的确定。It should be noted that when acquiring the associated files according to the above steps, the determined target association relationship may be deleted in sequence in the first association relationship set until the first association relationship set is empty, and the determination of all associated files of the first file is completed .
本公开提供的实施方式可以利用第一关联关系集合中的第一关联关系,获取尽量多的与第一文件相关联的关联文件,在获得第一文件的关联文件后,将第一文件及关联文件进行合并以得到合并文件,合并后得到的合并文件能够最大可能地符合HDFS的存储要求。Embodiments provided by the present disclosure can use the first association relationship in the first association set to obtain as many association files as possible associated with the first file, and after obtaining the association files of the first file, associate the first file with the association The files are merged to obtain a merged file, and the merged file obtained after the merger can meet the storage requirements of HDFS to the greatest extent possible.
在一种可能的实施方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
将第一合并文件发送至HDFS,并接收HDFS返回的存储第一合并文件的第一存储块标识。Send the first merged file to HDFS, and receive the first storage block identifier returned by HDFS that stores the first merged file.
创建包含第一文件标识与第一合并文件标识的映射关系的第一索引信息、以及包含第一合并文件标识与第一存储块标识的映射关系的第二索引信息。Create first index information including the mapping relationship between the first file identifier and the first merged file identifier, and create second index information including the mapping relationship between the first merged file identifier and the first storage block identifier.
在一种可能的实施方式中,第一合并文件可以存储在HDFS中预先建立的合并文件空间中,合并文件空间可以是HDFS中“块”大小的整数倍,例如,当一个“块”的大小为64MB时,可以设置所述预设合并文件空间的大小为64MB、128MB、256MB或512MB等。In a possible implementation manner, the first merged file may be stored in a pre-established merged file space in HDFS, and the merged file space may be an integer multiple of the "block" size in HDFS, for example, when a "block" size When it is 64MB, the size of the preset merged file space can be set to 64MB, 128MB, 256MB, or 512MB.
在一种可能的实施方式中,在创建第一索引信息及第二索引信息后,可以将第一索引信息及第二索引信息存储在本地的存储系统中,以便于后续调取。In a possible implementation manner, after the first index information and the second index information are created, the first index information and the second index information may be stored in a local storage system to facilitate subsequent retrieval.
通过将具有关联性的文件(相对而言是数据量较小的文件)合并成合并文件(相对而言是数据量较大的文件),将合并文件存储到HDFS中,可以节约HDFS的存储资源。By merging related files (files with relatively small data volume) into merged files (files with relatively large data volume), storing the merged files in HDFS can save HDFS storage resources .
在一种可能的应用场景中,用户通过客户端获取HDFS中的目标文件后,可能还会获取其他的文件。如果获取的文件的数目较多时,基于HDFS的文件存取机制,势必会大量消耗HDFS的NameNode节点的内存,客户端与NameNode节点的交互次数与待获取的文件的数目相同,此时HDFS的性能将会被降低,文件存取的效率低下。In a possible application scenario, after the user obtains the target file in HDFS through the client, other files may also be obtained. If the number of acquired files is large, the HDFS-based file access mechanism will inevitably consume a large amount of HDFS NameNode node memory. The number of interactions between the client and the NameNode node is the same as the number of files to be acquired. At this time, HDFS performance Will be reduced, the efficiency of file access is low.
基于此,服务器在请求获取用户需要的目标文件时,一并请求获取与目标文件相关联的至少一关联文件,并将获取到的目标文件及关联文件送入缓存中。在下一次接收到终端发送的的文件读取请求时,服务器可以将缓存中文件与文件读取请求中的目标文件标识匹配,由于缓存中的文件是具有访问关联性的,因此有大可能匹配到此文件读取请求的目标文件。这样不仅提高 了文件读取的速度、命中率,还降低了NameNode节点的内存占用率,减少了客户端与NameNode节点的交互次数,提升了系统的性能。Based on this, when the server requests to obtain the target file required by the user, it also requests to obtain at least one associated file associated with the target file, and sends the acquired target file and associated file to the cache. The next time the file read request sent by the terminal is received, the server can match the file in the cache with the target file ID in the file read request. Since the file in the cache is related to access, it is likely to match This file reads the requested target file. This not only improves the file reading speed and hit rate, but also reduces the memory usage of the NameNode node, reduces the number of interactions between the client and the NameNode node, and improves the performance of the system.
通过以上方法,可以将相关联的多个文件合并为合并文件,以符合HDFS的存储合并文件的机制,从而提高文件的存储效率,将多个文件合并成合并文件存储后,HDFS的内存等资源的使用也被降低,提升了系统的性能。Through the above method, multiple associated files can be merged into a merged file to conform to the mechanism of HDFS storage and merged files, thereby improving the storage efficiency of files. After multiple files are merged into merged file storage, HDFS memory and other resources The use of is also reduced, improving the performance of the system.
请参阅图6,图6示出了根据本公开一实施方式的文件读取装置的框图。Please refer to FIG. 6, which shows a block diagram of a file reading device according to an embodiment of the present disclosure.
如图6所示,所述装置包括:As shown in FIG. 6, the device includes:
接收模块10,用于接收文件读取请求,所述文件读取请求中包括待读取的目标文件的标识;The receiving module 10 is configured to receive a file reading request, where the file reading request includes an identification of the target file to be read;
第一查找模块20,连接于所述接收模块10,用于根据所述目标文件的标识,在本地存储的第一索引信息包括的子文件标识与合并文件标识的映射关系中,查找与所述目标文件的标识匹配的目标子文件标识及对应的目标合并文件标识;其中,合并文件存储于HDFS、且所述合并文件中的子文件有关联关系;The first searching module 20 is connected to the receiving module 10, and is used for searching and searching for the mapping relationship between the sub-file identifier and the merged file identifier included in the first index information stored locally according to the identifier of the target file. The target sub-file identifier matching the target file identifier and the corresponding target merged file identifier; wherein, the merged file is stored in HDFS, and the sub-files in the merged file are associated;
第二查找模块30,连接于所述第一查找模块20,用于根据所述目标合并文件标识,在本地存储的第二索引信息包括的合并文件标识与所述HDFS的存储块标识的映射关系中,查找与所述目标合并文件标识对应的目标存储块标识;The second search module 30 is connected to the first search module 20, and is used for mapping the merged file ID included in the second index information stored locally to the HDFS storage block ID according to the target merged file ID In the search for the target storage block identifier corresponding to the target merged file identifier;
发送模块40,连接于所述第二查找模块30,用于按照预设获取条件,确定与所述目标文件关联的待获取的子文件数量,向所述HDFS发送文件获取请求,所述文件获取请求中包含所述目标存储块标识、目标子文件标识、目标合并文件标识、所述子文件数量,以使所述HDFS在与所述目标存储块标识对应的目标存储块中查找与所述目标合并文件标识对应的目标合并文件,并在所述目标合并文件中查找所述目标文件及数量为所述子文件数量的关联文件;The sending module 40 is connected to the second searching module 30, and is configured to determine the number of sub-files to be acquired associated with the target file according to a preset acquiring condition, and send a file acquiring request to the HDFS, the file acquiring The request includes the target storage block ID, target subfile ID, target merge file ID, and the number of subfiles, so that the HDFS searches for the target in the target storage block corresponding to the target storage block ID The merged file identifier corresponds to the target merged file, and searches the target merged file for the target file and related files whose number is the number of the sub-files;
缓存模块50,连接于所述发送模块40,用于接收并缓存所述HDFS返回的目标文件以及关联文件。The cache module 50 is connected to the sending module 40 and is used to receive and cache the target file and associated file returned by the HDFS.
应该明白的是,所述的文件读取装置为前述的文件读取方法对应的装置项,其具体介绍请参考之前对方法的描述,在此不再赘述。It should be understood that the file reading device is a device item corresponding to the foregoing file reading method. For a specific introduction, please refer to the previous description of the method, and no more details are provided here.
本公开所述的装置,通过获取待获取的文件及待获取的文件相关的其他文件,并将这些文件存储在缓存中,当接收到终端下次发送的文件读取请求 时,这些存储在缓存中的文件可以首先被检索,以减少与HDFS的交互,从而降低了HDFS的资源使用率,并提高了HDFS处理大量文件的效率。The device described in the present disclosure obtains the file to be obtained and other files related to the file to be obtained, and stores these files in the cache. When the file read request sent by the terminal next time is received, these are stored in the cache The files in can be retrieved first to reduce the interaction with HDFS, thereby reducing the resource usage of HDFS and improving the efficiency of HDFS in processing a large number of files.
请参阅图7,图7示出了根据本公开一实施方式的文件读取装置的框图。Please refer to FIG. 7, which shows a block diagram of a file reading device according to an embodiment of the present disclosure.
如图7所示,所述装置还包括:As shown in FIG. 7, the device further includes:
第一获取模块61,用于获取多个文件的历史访问日志,所述历史访问日志中包括多个文件的被访问时间及被访问次数;The first obtaining module 61 is configured to obtain historical access logs of multiple files, where the historical access logs include the accessed time and the number of accessed times of multiple files;
第一确定模块62,连接于所述第一获取模块61,用于针对所述多个文件中的每一文件,根据所述多个文件的被访问时间及被访问次数,在所述多个文件中除该文件之外的其他文件中,确定在访问该文件之后与该文件具有访问关联的至少一文件,并确定该文件的多个第一关联关系,其中,第一关联关系用于表示该文件与至少一文件中任一文件的访问关联;The first determining module 62 is connected to the first obtaining module 61, and is used for each file of the plurality of files according to the access time and the number of accesses of the plurality of files. In files other than the file, determine at least one file associated with the file after accessing the file, and determine a plurality of first association relationships of the file, where the first association relationship is used to indicate The file is associated with access to any file in at least one file;
第二确定模块63,连接于所述第一确定模块62,用于根据所述多个文件中各文件的第一关联关系,获取第一关联关系数量最多的第一文件,并依据所述第一文件的多个第一关联关系,在所述多个文件中确定在所述第一文件被访问之后依次被访问的至少一关联文件;The second determination module 63 is connected to the first determination module 62, and is configured to obtain the first file with the largest number of first association relationships according to the first association relationship of each file in the plurality of files, and according to the first Multiple first association relationships of a file, and determining, among the multiple files, at least one associated file that is sequentially accessed after the first file is accessed;
存储模块64,连接于所述第二确定模块63,用于将所述第一文件及至少一关联文件存储在第一合并文件中。The storage module 64 is connected to the second determination module 63 and is used to store the first file and at least one associated file in the first merged file.
第二获取模块71,连接于所述存储模块64,用于在所述多个文件中各文件的第一关联关系中,删除确定至少一关联文件时应用到的第一关联关系,获取剩余的第一关联关系;根据剩余的第一关联关系,获取第一关联关系数量最多的新第一文件;The second acquisition module 71 is connected to the storage module 64, and is used to delete the first association relationship applied to determine at least one associated file in the first association relationship of each file in the plurality of files to obtain the remaining The first association relationship; according to the remaining first association relationship, obtain the new first file with the largest number of first association relationships;
第三确定模块72,连接于所述第二获取模块71,用于在所述多个文件中,触发所述第二确定模块重复执行依据所述新第一文件的多个第一关联关系确定在所述新第一文件被访问之后依次被访问的至少一关联文件、将所述新第一文件及在新第一文件被访问之后依次被访问的至少一关联文件存储在新第一合并文件中的过程,直到所述第二获取模块获取不到剩余的第一关联关系。A third determination module 72, connected to the second acquisition module 71, is used to trigger the second determination module to repeatedly perform multiple first association determinations based on the new first file among the multiple files At least one associated file that is sequentially accessed after the new first file is accessed, storing the new first file and at least one associated file that is sequentially accessed after the new first file is accessed in the new first merged file Process until the second acquisition module cannot acquire the remaining first association relationship.
发送接收模块81,连接于存储模块64,用于将所述第一合并文件发送至所述HDFS,并接收所述HDFS返回的存储所述第一合并文件的第一存储块标识;A sending and receiving module 81, connected to the storage module 64, for sending the first merged file to the HDFS, and receiving the first storage block identifier returned by the HDFS that stores the first merged file;
索引创建模块82,连接于所述发送接收模块81,用于创建包含所述第一文件标识与第一合并文件标识的映射关系的第一索引信息、以及包含第一合 并文件标识与第一存储块标识的映射关系的第二索引信息。An index creation module 82, connected to the sending and receiving module 81, is used to create first index information including the mapping relationship between the first file identifier and the first merged file identifier, and includes the first merged file identifier and the first storage The second index information of the mapping relationship of the block identification.
读取模块90,连接于缓存模块50,可以当接收到的下一文件读取请求中包括与所述目标文件相关联的文件时,若与所述目标文件相关联的文件存储于所述缓存中,则从所述缓存中读取与与所述目标文件相关联的文件。The reading module 90, connected to the cache module 50, may include the file associated with the target file if the next file read request received includes the file associated with the target file, if the file associated with the target file is stored in the cache , The file associated with the target file is read from the cache.
应该明白的是,所述的文件读取装置为前述的文件读取方法对应的装置项,其具体介绍请参考之前对方法的描述,在此不再赘述。It should be understood that the file reading device is a device item corresponding to the foregoing file reading method. For a specific introduction, please refer to the previous description of the method, and no more details are provided here.
请参阅图8,图8示出了根据本公开一实施方式的第二确定模块的示意图。Please refer to FIG. 8, which illustrates a schematic diagram of a second determination module according to an embodiment of the present disclosure.
在一种可能的实施方式中,所述第一关联关系中记录的相关联的两个文件中的一个为前驱文件,另一个为后继文件,后继文件为在访问所述前驱文件之后被访问的文件。In a possible implementation manner, one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is accessed after accessing the predecessor file file.
如图8所示,所述第二确定模块63,包括:As shown in FIG. 8, the second determining module 63 includes:
第一关联关系获取子模块631,用于获取包含所述多个文件中各文件的第一关联关系的第一关联关系集合;A first association relationship acquisition sub-module 631, configured to obtain a first association relationship set including the first association relationship of each file in the plurality of files;
第二关联关系获取子模块632,连接于所述第一关联关系获取子模块631,用于在所述第一关联关系集合中,获取以第一文件作为前驱文件出现次数最多的第一目标关联关系集合,并在所述第一目标关联关系集合中,获取第二关联关系,所述第二关联关系为:第一目标关联关系集合中后继文件被访问次数最多的第一关联关系;A second association relationship acquisition sub-module 632, connected to the first association relationship acquisition sub-module 631, is used to obtain the first target association with the first file as the predecessor file in the first association relationship set that has the most occurrences A relationship set, and in the first target association set, a second association is obtained, where the second association is: the first association in the first target association set where the subsequent files are accessed the most;
第一关联文件确定子模块633,连接于所述第二关联关系获取子模块632,用于若所述第一关联关系集合中存在前驱文件与第二关联关系的后继文件相同的第三关联关系,从所述第三关联关系中确定后继文件出现次数最多的目标关联关系,将目标关联关系中的文件确定为关联文件;The first association file determination submodule 633 is connected to the second association relationship acquisition submodule 632, and is used for a third association relationship where the predecessor file and the successor file of the second association relationship are the same in the first association relationship set Determine the target association relationship with the highest number of subsequent files from the third association relationship, and determine the file in the target association relationship as the association file;
第二关联文件确定子模块634,连接于所述第二关联关系获取子模块632,用于若所述第一关联关系集合中不存在前驱文件与第二关联关系的后继文件相同的第三关联关系时,将所述第二关联关系的后继文件确定为关联文件。The second association file determination sub-module 634 is connected to the second association relationship acquisition sub-module 632, and is used for a third association if the predecessor file and the subsequent file of the second association relationship are not the same in the first association relationship set During the relationship, the subsequent file of the second related relationship is determined as the related file.
删除子模块635,用于删除所述第一关联关系集合中的所述目标关联关系,得到新第一关联关系集合;A deletion submodule 635, configured to delete the target association relationship in the first association set to obtain a new first association set;
重复确定子模块636,连接于删除子模块635,用于重复触发所述第二关联关系获取子模块、第一关联文件确定子模块执行以下操作,直到所述第二关联文件确定子模块确定新第一关联关系集合中不存在前驱文件与新第二关联关系的后继文件相同的第三关联关系:在所述新第一关联关系集合中,获 取以第一文件作为前驱文件出现次数最多的新第一目标关联关系集合,并在所述新第一目标关联关系集合中,获取新第二关联关系,所述新第二关联关系为:新第一目标关联关系集合中后继文件被访问次数最多的第一关联关系; Repeat determination submodule 636, connected to delete submodule 635, for repeatedly triggering the second association relationship acquisition submodule and the first association file determination submodule to perform the following operations until the second association file determination submodule determines that the new There is no third association relationship in the first association relationship set where the predecessor file and the successor file of the new second association relationship are the same: in the new first association relationship set, the new file with the first file as the predecessor file is obtained most frequently A first target association relationship set, and in the new first target association relationship set, a new second association relationship is obtained, where the new second association relationship is: the most subsequent files in the new first target association relationship set are accessed the most The first association relationship;
若在所述新第一关联关系集合中存在前驱文件与新第二关联关系的后继文件相同的新目标关联关系时,将新目标关联关系中的文件确定为关联文件;以及,删除所述新目标关联关系,获取所述新第一关联关系集合。If there is a new target association relationship in which the precursor file and the successor file of the new second association relationship are the same in the new first association relationship set, the file in the new target association relationship is determined as the association file; and, the new Target association relationship, to obtain the new first association relationship set.
应该明白的是,所述的文件读取装置为前述的文件读取方法对应的装置项,其具体介绍请参考之前对方法的描述,在此不再赘述。It should be understood that the file reading device is a device item corresponding to the foregoing file reading method. For a specific introduction, please refer to the previous description of the method, and no more details are provided here.
请参阅图9,图9示出了根据本公开一实施方式的第一确定模块的示意图。Please refer to FIG. 9, which illustrates a schematic diagram of a first determination module according to an embodiment of the present disclosure.
如图9所示,所述第一确定模块62包括:As shown in FIG. 9, the first determining module 62 includes:
第一概率获取子模块621,用于根据第二文件的被访问次数、所述第二文件被访问后第三文件的被访问次数,获取所述第二文件被访问后所述第三文件被访问的第一概率,其中,所述第二文件和所述第三文件为所述多个文件中的任意两个不相同的文件;The first probability obtaining submodule 621 is configured to obtain the third file after the second file is accessed according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed The first probability of access, wherein the second file and the third file are any two different files in the plurality of files;
第二概率获取子模块622,用于根据所述第二文件被访问后所述第三文件的被访问次数及所述历史访问日志中所有文件被访问的总次数,获取所述第二文件和所述第三文件都被访问的第二概率;The second probability obtaining sub-module 622 is used to obtain the second file and the total number of times all files in the historical access log are accessed according to the number of times the third file is accessed and the total number of times all files in the historical access log are accessed A second probability that all the third files are accessed;
影响力值获取子模块623,用于根据所述历史访问日志中所有文件被访问的总次数、所述第二文件被访问后所述第三文件的被访问次数、所述第二文件被访问的次数及所述第三文件被访问的次数,获取所述第二文件被访问对所述第三文件被访问的影响力值;The influence value obtaining sub-module 623 is used for according to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, and the second file being accessed The number of times and the number of times the third file is accessed, to obtain the influence value of the second file being accessed on the third file being accessed;
第一确定子模块624,连接于所述第一概率获取子模块621、第二概率获取子模块622及影响力值获取子模块623,用于当所述第一概率大于第一概率阈值、所述第二概率大于第二概率阈值及所述影响力值大于所述影响力阈值时,确定所述第二文件及所述第三文件具有所述第一关联关系。The first determination submodule 624 is connected to the first probability acquisition submodule 621, the second probability acquisition submodule 622, and the influence value acquisition submodule 623, and is used when the first probability is greater than the first probability threshold, When the second probability is greater than the second probability threshold and the influence value is greater than the influence threshold, it is determined that the second file and the third file have the first association relationship.
应该明白的是,所述的文件读取装置为前述的文件读取方法对应的装置项,其具体介绍请参考之前对方法的描述,在此不再赘述。It should be understood that the file reading device is a device item corresponding to the foregoing file reading method. For a specific introduction, please refer to the previous description of the method, and no more details are provided here.
基于相同的技术构思,本公开实施例还提供一种服务器900,如图10所示,服务器900包括处理器901、机器可读存储介质902和收发器903,机器可读存储介质存储有能够被处理器901和收发器903执行的机器可执行指令,处理器901、收发器903与机器可读存储介质902可经由系统总线904通信。Based on the same technical concept, an embodiment of the present disclosure also provides a server 900. As shown in FIG. 10, the server 900 includes a processor 901, a machine-readable storage medium 902, and a transceiver 903, and the machine-readable storage medium stores The machine-executable instructions executed by the processor 901 and the transceiver 903, and the processor 901, the transceiver 903, and the machine-readable storage medium 902 can communicate via the system bus 904.
所述机器可执行指令促使所述收发器903:接收文件请求,并向处理器901发送所述文件读取请求,所述文件读取请求中包括待读取的目标文件的标识;The machine executable instruction causes the transceiver 903 to receive a file request and send the file reading request to the processor 901, where the file reading request includes the identification of the target file to be read;
所述机器可执行指令促使所述处理器901:The machine-executable instructions cause the processor 901 to:
接收所述文件读取请求;Receiving the file reading request;
根据所述目标文件的标识,在本地存储的第一索引信息包括的子文件标识与合并文件标识的映射关系中,查找与所述目标文件的标识匹配的目标子文件标识及对应的目标合并文件标识;其中,合并文件存储于HDFS、且所述合并文件中的子文件有关联关系;According to the identification of the target file, in the mapping relationship between the sub-file identification included in the first index information stored locally and the merged file identification, find the target sub-file identification and the corresponding target merged file that match the target file identification Identification; wherein, the merged file is stored in HDFS, and the sub-files in the merged file are associated;
根据所述目标合并文件标识,在本地存储的第二索引信息包括的合并文件标识与所述HDFS的存储块标识的映射关系中,查找与所述目标合并文件标识对应的目标存储块标识;According to the target merged file identifier, in the mapping relationship between the merged file identifier included in the locally stored second index information and the HDFS storage block identifier, search for the target storage block identifier corresponding to the target merged file identifier;
按照预设获取条件,确定与所述目标文件关联的待获取的子文件数量;According to the preset acquisition condition, determine the number of sub-files to be acquired associated with the target file;
所述机器可执行指令还促使所述收发器903:The machine-executable instructions also cause the transceiver 903 to:
向所述HDFS发送文件获取请求,所述文件获取请求中包含所述目标存储块标识、所述目标子文件标识、所述目标合并文件标识、所述子文件数量,以使所述HDFS在与所述目标存储块标识对应的目标存储块中查找与所述目标合并文件标识对应的目标合并文件,并在所述目标合并文件中查找所述目标文件及数量为所述子文件数量的关联文件;Send a file acquisition request to the HDFS, where the file acquisition request includes the target storage block identifier, the target subfile identifier, the target merged file identifier, and the number of subfiles, so that the HDFS Searching for a target merge file corresponding to the target merge file ID in the target storage block corresponding to the target storage block ID, and searching the target merge file for the target file and associated files whose number is the number of the sub-files ;
接收存所述HDFS返回的目标文件以及关联文件,并向所述处理器901发送所述HDFS返回的目标文件及关联文件;Receiving and storing the target file and associated file returned by the HDFS, and sending the target file and associated file returned by the HDFS to the processor 901;
所述机器可执行指令还促使所述处理器901:接收并缓存所述收发器903发送的所述HDFS返回的目标文件及关联文件。The machine executable instructions also cause the processor 901 to receive and cache the target file and associated file returned by the HDFS sent by the transceiver 903.
可选地,所述机器可执行指令促使所述处理器901:Optionally, the machine executable instructions cause the processor 901 to:
获取多个文件的历史访问日志,所述历史访问日志中包括多个文件的被访问时间及被访问次数;Obtain historical access logs of multiple files, where the historical access logs include the access time and the number of accesses of multiple files;
针对所述多个文件中的每一文件,根据所述多个文件的被访问时间及被访问次数,在所述多个文件中除该文件之外的其他文件中,确定在访问该文件之后与该文件具有访问关联的至少一文件,并确定该文件的多个第一关联关系,其中,第一关联关系用于表示该文件与至少一文件中任一文件的访问 关联;For each file in the plurality of files, according to the access time and the number of accesses of the plurality of files, among the files other than the file in the plurality of files, it is determined that after accessing the file Having at least one file associated with access to the file, and determining a plurality of first association relationships of the file, wherein the first association relationship is used to indicate that the file is associated with any file in at least one file;
根据所述多个文件中各文件的第一关联关系,获取第一关联关系数量最多的第一文件,并依据所述第一文件的多个第一关联关系,在所述多个文件中确定在所述第一文件被访问之后依次被访问的至少一关联文件;According to the first association relationship of each file in the plurality of files, obtain the first file with the largest number of first association relationships, and determine among the plurality of files according to the plurality of first association relationships of the first file At least one associated file that is sequentially accessed after the first file is accessed;
将所述第一文件及至少一关联文件存储在第一合并文件中。Storing the first file and at least one associated file in the first merged file.
可选地,所述机器可执行指令促使所述处理器901:Optionally, the machine executable instructions cause the processor 901 to:
在所述多个文件中各文件的第一关联关系中,删除确定至少一关联文件时应用到的第一关联关系,获取剩余的第一关联关系;根据剩余的第一关联关系,获取第一关联关系数量最多的新第一文件;In the first association relationship of each file in the plurality of files, delete the first association relationship applied when determining at least one associated file to obtain the remaining first association relationship; according to the remaining first association relationship, obtain the first The new first document with the largest number of associations;
在所述多个文件中,重复执行依据所述新第一文件的多个第一关联关系确定在所述新第一文件被访问之后依次被访问的至少一关联文件、将所述新第一文件及在新第一文件被访问之后依次被访问的至少一关联文件存储在新第一合并文件中的过程,直到获取不到剩余的第一关联关系。In the plurality of files, repeatedly execute at least one associated file that is sequentially accessed after the new first file is accessed according to the plurality of first association relationships of the new first file, and replace the new first file The process of storing files and at least one associated file that is accessed sequentially after the new first file is accessed in the new first merged file until the remaining first associated relationship cannot be obtained.
可选地,所述第一关联关系中记录的相关联的两个文件中的一个为前驱文件,另一个为后继文件,后继文件为在访问所述前驱文件之后被访问的文件;Optionally, one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is a file that is accessed after accessing the predecessor file;
所述机器可执行指令促使所述处理器901:The machine-executable instructions cause the processor 901 to:
获取包含所述多个文件中各文件的第一关联关系的第一关联关系集合;Acquiring a first association set containing the first association of each file in the plurality of files;
在所述第一关联关系集合中,获取以第一文件作为前驱文件出现次数最多的第一目标关联关系集合,并在所述第一目标关联关系集合中,获取第二关联关系,所述第二关联关系为:第一目标关联关系集合中后继文件被访问次数最多的第一关联关系;In the first association relationship set, a first target association relationship set that uses the first file as a predecessor file to appear most often is obtained, and in the first target association relationship set, a second association relationship is obtained. The second association relationship is: the first association relationship in which the subsequent files in the first target association set are accessed the most;
若所述第一关联关系集合中存在前驱文件与第二关联关系的后继文件相同的第三关联关系,从所述第三关联关系中确定后继文件出现次数最多的目标关联关系,将目标关联关系中的文件确定为关联文件;If there is a third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, determine the target association relationship with the highest number of subsequent files from the third association relationship, and associate the target association relationship The files in are determined to be related files;
若所述第一关联关系集合中不存在前驱文件与第二关联关系的后继文件相同的第三关联关系时,将所述第二关联关系的后继文件确定为关联文件。If there is no third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, the successor file of the second association relationship is determined as the association file.
可选地,所述机器可执行指令促使所述处理器901:Optionally, the machine executable instructions cause the processor 901 to:
删除所述第一关联关系集合中的所述目标关联关系,得到新第一关联关系集合;Deleting the target association relationship in the first association relationship set to obtain a new first association relationship set;
重复执行以下操作,直到所述新第一关联关系集合中不存在前驱文件与 新第二关联关系的后继文件相同的第三关联关系:Repeat the following operations until there is no third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set:
在所述新第一关联关系集合中,获取以第一文件作为前驱文件出现次数最多的新第一目标关联关系集合,并在所述新第一目标关联关系集合中,获取新第二关联关系,所述新第二关联关系为:新第一目标关联关系集合中后继文件被访问次数最多的第一关联关系;In the new first association relationship set, obtain a new first target association relationship set with the first file as the predecessor file, and obtain the new second association relationship in the new first target association relationship set , The new second association relationship is: the first association relationship in which a subsequent file in the new first target association set is accessed most often;
若在所述新第一关联关系集合中存在前驱文件与新第二关联关系的后继文件相同的新第三关联关系时,从所述新第三关联关系中确定后继文件出现次数最多的新目标关联关系,将新目标关联关系中的文件确定为关联文件;以及,删除所述新目标关联关系,获取所述新第一关联关系集合。If there is a new third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set, a new target with the largest number of subsequent file occurrences is determined from the new third association relationship The association relationship determines the file in the new target association relationship as the association file; and deletes the new target association relationship to obtain the new first association set.
可选地,所述机器可执行指令促使所述处理器901:Optionally, the machine executable instructions cause the processor 901 to:
通过以下方式确定文件的多个第一关联关系:The multiple first associations of the file are determined in the following ways:
根据第二文件的被访问次数、所述第二文件被访问后第三文件的被访问次数,获取所述第二文件被访问后所述第三文件被访问的第一概率,其中,所述第二文件和所述第三文件为所述多个文件中的任意两个不相同的文件;Obtaining the first probability that the third file is accessed after the second file is accessed according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed, wherein, the The second file and the third file are any two different files in the plurality of files;
根据所述第二文件被访问后所述第三文件的被访问次数及所述历史访问日志中所有文件被访问的总次数,获取所述第二文件和所述第三文件都被访问的第二概率;According to the number of times the third file is accessed after the second file is accessed and the total number of times all files in the historical access log are accessed, obtain the number of times the second file and the third file are accessed Second probability
根据所述历史访问日志中所有文件被访问的总次数、所述第二文件被访问后所述第三文件的被访问次数、所述第二文件被访问的次数及所述第三文件被访问的次数,获取所述第二文件被访问对所述第三文件被访问的影响力值;According to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, the number of times the second file is accessed, and the third file being accessed The number of times to obtain the influence value of the second file being accessed on the third file being accessed;
当所述第一概率大于第一概率阈值、所述第二概率大于第二概率阈值及所述影响力值大于所述影响力阈值时,确定所述第二文件及所述第三文件具有所述第一关联关系。When the first probability is greater than the first probability threshold, the second probability is greater than the second probability threshold, and the influence value is greater than the influence threshold, it is determined that the second file and the third file have all Describe the first association.
可选地,所述机器可执行指令还促使所述收发器903:将所述第一合并文件发送至所述HDFS,并接收所述HDFS返回的存储所述第一合并文件的第一存储块标识,并向所述处理器901发送所述第一合并文件的第一存储块标识;Optionally, the machine executable instruction further causes the transceiver 903 to send the first merged file to the HDFS and receive the first storage block returned by the HDFS that stores the first merged file Identifier, and send the first storage block identifier of the first merged file to the processor 901;
所述机器可执行指令还促使所述处理器901:接收所述第一合并文件的第一存储块标识,创建包含所述第一文件标识与第一合并文件标识的映射关系的第一索引信息、以及包含第一合并文件标识与第一存储块标识的映射关系的第二索引信息。The machine-executable instructions further cause the processor 901 to receive the first storage block identifier of the first merged file, and create first index information including the mapping relationship between the first file identifier and the first merged file identifier And second index information including the mapping relationship between the first merged file identifier and the first storage block identifier.
可选地,所述机器可执行指令促使所述处理器901:Optionally, the machine executable instructions cause the processor 901 to:
当接收到的下一文件读取请求中包括与所述目标文件相关联的文件时,若与所述目标文件相关联的文件存储于所述缓存中,则从所述缓存中读取与与所述目标文件相关联的文件。When the next file read request received includes the file associated with the target file, if the file associated with the target file is stored in the cache, the read and The file associated with the target file.
本文中提到的机器可读存储介质902可以是任何电子、磁性、光学或其它物理存储系统,可以包含或存储信息,如可执行指令、数据,等等。例如,机器可读存储介质可以是:RAM(Radom Access Memory,随机存取存储器)、易失存储器、非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、固态硬盘、任何类型的存储盘(如光盘、dvd等),或者类似的存储介质,或者它们的组合。The machine-readable storage medium 902 mentioned herein may be any electronic, magnetic, optical, or other physical storage system, and may contain or store information, such as executable instructions, data, and so on. For example, the machine-readable storage medium may be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drive (such as a hard disk drive), solid-state drive, any type of storage disk (Such as optical discs, DVDs, etc.), or similar storage media, or a combination thereof.
基于相同的技术构思,本公开实施例还提供了一种机器可读存储介质,存储有机器可执行指令,在被处理器调用和执行时,机器可执行指令促使处理器实现上述图1-5所示的任一文件读取方法步骤。Based on the same technical concept, the embodiments of the present disclosure also provide a machine-readable storage medium that stores machine-executable instructions. When invoked and executed by the processor, the machine-executable instructions cause the processor to implement the foregoing FIGS. 1-5 Any of the file reading method steps shown.
基于相同的技术构思,本公开实施例还提供了一种机器可执行指令,在被处理器调用和执行时,机器可执行指令促使处理器实现上述图1-5所示的任一文件读取方法步骤。Based on the same technical concept, the embodiments of the present disclosure also provide a machine-executable instruction. When called and executed by the processor, the machine-executable instruction prompts the processor to read any of the files shown in FIGS. 1-5. Method steps.
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。The embodiments of the present disclosure have been described above. The above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the illustrated embodiments. The selection of terms used herein is intended to best explain the principles, practical applications or technical improvements of the technologies in the embodiments, or to enable other persons of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (22)

  1. 一种文件读取方法,所述方法包括:A file reading method, the method includes:
    接收文件读取请求,所述文件读取请求中包括待读取的目标文件的标识;Receiving a file reading request, where the file reading request includes the identifier of the target file to be read;
    根据所述目标文件的标识,在本地存储的第一索引信息包括的子文件标识与合并文件标识的映射关系中,查找与所述目标文件的标识匹配的目标子文件标识及对应的目标合并文件标识;其中,合并文件存储于Hadoop分布式文件系统HDFS、且所述合并文件中的子文件有关联关系;According to the identification of the target file, in the mapping relationship between the sub-file identification included in the first index information stored locally and the merged file identification, find the target sub-file identification and the corresponding target merged file that match the target file identification Identification; wherein, the merged files are stored in the Hadoop distributed file system HDFS, and the sub-files in the merged files are associated;
    根据所述目标合并文件标识,在本地存储的第二索引信息包括的合并文件标识与所述HDFS的存储块标识的映射关系中,查找与所述目标合并文件标识对应的目标存储块标识;According to the target merged file identifier, in the mapping relationship between the merged file identifier included in the locally stored second index information and the HDFS storage block identifier, search for the target storage block identifier corresponding to the target merged file identifier;
    按照预设获取条件,确定与所述目标文件关联的待获取的子文件数量,向所述HDFS发送文件获取请求,所述文件获取请求中包含所述目标存储块标识、所述目标子文件标识、所述目标合并文件标识、所述子文件数量,以使所述HDFS在与所述目标存储块标识对应的目标存储块中查找与所述目标合并文件标识对应的目标合并文件,并在所述目标合并文件中查找所述目标文件及数量为所述子文件数量的关联文件;According to a preset acquisition condition, determine the number of sub-files to be acquired associated with the target file, and send a file acquisition request to the HDFS, the file acquisition request includes the target storage block identifier and the target sub-file identifier , The target merged file identifier and the number of sub-files, so that the HDFS searches for the target merged file corresponding to the target merged file identifier in the target storage block corresponding to the target storage block identifier, and Searching for the target file and related files whose number is the number of the sub-files in the target merge file;
    接收并缓存所述HDFS返回的目标文件以及关联文件。Receiving and caching the target files and associated files returned by the HDFS.
  2. 根据权利要求1所述的方法,所述方法还包括:The method according to claim 1, further comprising:
    获取多个文件的历史访问日志,所述历史访问日志中包括多个文件的被访问时间及被访问次数;Obtain historical access logs of multiple files, where the historical access logs include the access time and the number of accesses of multiple files;
    针对所述多个文件中的每一文件,根据所述多个文件的被访问时间及被访问次数,在所述多个文件中除该文件之外的其他文件中,确定在访问该文件之后与该文件具有访问关联的至少一文件,并确定该文件的多个第一关联关系,其中,第一关联关系用于表示该文件与至少一文件中任一文件的访问关联;For each file in the plurality of files, according to the access time and the number of accesses of the plurality of files, among the files other than the file in the plurality of files, it is determined that after accessing the file Having at least one file associated with access to the file, and determining a plurality of first association relationships of the file, wherein the first association relationship is used to indicate that the file is associated with any file in at least one file;
    根据所述多个文件中各文件的第一关联关系,获取第一关联关系数量最多的第一文件,并依据所述第一文件的多个第一关联关系,在所述多个文件中确定在所述第一文件被访问之后依次被访问的至少一关联文件;According to the first association relationship of each file in the plurality of files, obtain the first file with the largest number of first association relationships, and determine among the plurality of files according to the plurality of first association relationships of the first file At least one associated file that is sequentially accessed after the first file is accessed;
    将所述第一文件及至少一关联文件存储在第一合并文件中。Storing the first file and at least one associated file in the first merged file.
  3. 根据权利要求2所述的方法,所述方法还包括:The method according to claim 2, further comprising:
    在所述多个文件中各文件的第一关联关系中,删除确定至少一关联文件 时应用到的第一关联关系,获取剩余的第一关联关系;根据剩余的第一关联关系,获取第一关联关系数量最多的新第一文件;In the first association relationship of each file in the plurality of files, delete the first association relationship applied when determining at least one associated file to obtain the remaining first association relationship; according to the remaining first association relationship, obtain the first The new first document with the largest number of associations;
    在所述多个文件中,重复执行依据所述新第一文件的多个第一关联关系确定在所述新第一文件被访问之后依次被访问的至少一关联文件、将所述新第一文件及在新第一文件被访问之后依次被访问的至少一关联文件存储在新第一合并文件中的过程,直到获取不到剩余的第一关联关系。In the plurality of files, repeatedly execute at least one associated file that is sequentially accessed after the new first file is accessed according to the plurality of first association relationships of the new first file, and replace the new first file The process of storing the file and at least one associated file that is accessed in sequence after the new first file is accessed in the new first merged file until the remaining first associated relationship cannot be obtained.
  4. 根据权利要求2所述的方法,所述第一关联关系中记录的相关联的两个文件中的一个为前驱文件,另一个为后继文件,后继文件为在访问所述前驱文件之后被访问的文件;所述根据所述多个文件中各文件的第一关联关系,获取第一关联关系数量最多的第一文件,并依据所述第一文件的多个第一关联关系,在所述多个文件中确定在所述第一文件被访问之后依次被访问的至少一关联文件,包括:The method according to claim 2, one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is accessed after accessing the predecessor file File; the first file with the largest number of first relationship is obtained according to the first relationship of each file in the plurality of files, and according to the multiple first relationship of the first file, the The at least one associated file that is sequentially accessed after the first file is accessed among the files includes:
    获取包含所述多个文件中各文件的第一关联关系的第一关联关系集合;Acquiring a first association set containing the first association of each file in the plurality of files;
    在所述第一关联关系集合中,获取以第一文件作为前驱文件出现次数最多的第一目标关联关系集合,并在所述第一目标关联关系集合中,获取第二关联关系,所述第二关联关系为:第一目标关联关系集合中后继文件被访问次数最多的第一关联关系;In the first association relationship set, a first target association relationship set that uses the first file as a predecessor file to appear most frequently is obtained, and in the first target association relationship set, a second association relationship is obtained. The second association relationship is: the first association relationship in which the subsequent files in the first target association set are accessed the most;
    若所述第一关联关系集合中存在前驱文件与第二关联关系的后继文件相同的第三关联关系,则从所述第三关联关系中确定后继文件出现次数最多的目标关联关系,将目标关联关系中的文件确定为关联文件;If there is a third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, then from the third association relationship, the target association relationship with the highest number of subsequent file occurrences is determined, and the target association The files in the relationship are determined to be related files;
    若所述第一关联关系集合中不存在前驱文件与第二关联关系的后继文件相同的第三关联关系,则将所述第二关联关系的后继文件确定为关联文件。If there is no third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, the successor file of the second association relationship is determined as the association file.
  5. 根据权利要求4所述的方法,所述将目标关联关系中的文件确定为关联文件之后,还包括:The method according to claim 4, after determining that the file in the target association relationship is an associated file, further comprising:
    删除所述第一关联关系集合中的所述目标关联关系,得到新第一关联关系集合;Deleting the target association relationship in the first association relationship set to obtain a new first association relationship set;
    重复执行以下操作,直到所述新第一关联关系集合中不存在前驱文件与新第二关联关系的后继文件相同的第三关联关系:Repeat the following operations until there is no third association relationship in the new first association relationship set where the predecessor file and the successor file of the new second association relationship are the same:
    在所述新第一关联关系集合中,获取以第一文件作为前驱文件出现次数最多的新第一目标关联关系集合,并在所述新第一目标关联关系集合中,获取新第二关联关系,所述新第二关联关系为:新第一目标关联关系集合中后 继文件被访问次数最多的第一关联关系;In the new first association relationship set, a new first target association relationship set with the first file as the predecessor file is most frequently obtained, and in the new first target association relationship set, a new second association relationship is obtained , The new second association relationship is: the first association relationship in which a subsequent file in the new first target association set is accessed most often;
    若在所述新第一关联关系集合中存在前驱文件与新第二关联关系的后继文件相同的新第三关联关系时,从所述新第三关联关系中确定后继文件出现次数最多的新目标关联关系,将新目标关联关系中的文件确定为关联文件;以及,删除所述新目标关联关系,获取所述新第一关联关系集合。If there is a new third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set, a new target with the largest number of subsequent file occurrences is determined from the new third association relationship The association relationship determines the file in the new target association relationship as the association file; and deletes the new target association relationship to obtain the new first association set.
  6. 根据权利要求2所述的方法,通过以下方式确定文件的多个第一关联关系:According to the method of claim 2, a plurality of first association relationships of files are determined by:
    根据第二文件的被访问次数、所述第二文件被访问后第三文件的被访问次数,获取所述第二文件被访问后所述第三文件被访问的第一概率,其中,所述第二文件和所述第三文件为所述多个文件中的任意两个不相同的文件;Obtaining the first probability that the third file is accessed after the second file is accessed according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed, wherein, the The second file and the third file are any two different files in the plurality of files;
    根据所述第二文件被访问后所述第三文件的被访问次数及所述历史访问日志中所有文件被访问的总次数,获取所述第二文件和所述第三文件都被访问的第二概率;According to the number of times the third file is accessed after the second file is accessed and the total number of times all files in the historical access log are accessed, obtain the number of times the second file and the third file are accessed Second probability
    根据所述历史访问日志中所有文件被访问的总次数、所述第二文件被访问后所述第三文件的被访问次数、所述第二文件被访问的次数及所述第三文件被访问的次数,获取所述第二文件被访问对所述第三文件被访问的影响力值;According to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, the number of times the second file is accessed, and the third file being accessed The number of times to obtain the influence value of the second file being accessed on the third file being accessed;
    当所述第一概率大于第一概率阈值、所述第二概率大于第二概率阈值及所述影响力值大于所述影响力阈值时,确定所述第二文件及所述第三文件具有所述第一关联关系。When the first probability is greater than the first probability threshold, the second probability is greater than the second probability threshold, and the influence value is greater than the influence threshold, it is determined that the second file and the third file have all Describe the first association.
  7. 根据权利要求2所述的方法,所述方法还包括:The method according to claim 2, further comprising:
    将所述第一合并文件发送至所述HDFS,并接收所述HDFS返回的存储所述第一合并文件的第一存储块标识;Sending the first merged file to the HDFS, and receiving the first storage block identifier returned by the HDFS that stores the first merged file;
    创建包含所述第一文件标识与第一合并文件标识的映射关系的第一索引信息、以及包含第一合并文件标识与第一存储块标识的映射关系的第二索引信息。Create first index information that includes the mapping relationship between the first file identifier and the first merged file identifier, and second index information that includes the mapping relationship between the first merged file identifier and the first storage block identifier.
  8. 根据权利要求1所述的方法,所述方法还包括:The method according to claim 1, further comprising:
    当接收到的下一文件读取请求中包括与所述目标文件相关联的文件时,若与所述目标文件相关联的文件存储于所述缓存中,则从所述缓存中读取与与所述目标文件相关联的文件。When the next file read request received includes the file associated with the target file, if the file associated with the target file is stored in the cache, the read and The file associated with the target file.
  9. 一种文件读取装置,所述装置包括:A file reading device, the device includes:
    接收模块,用于接收文件读取请求,所述文件读取请求中包括待读取的目标文件的标识;A receiving module, configured to receive a file reading request, where the file reading request includes the identification of the target file to be read;
    第一查找模块,连接于所述接收模块,用于根据所述目标文件的标识,在本地存储的第一索引信息包括的子文件标识与合并文件标识的映射关系中,查找与所述目标文件的标识匹配的目标子文件标识及对应的目标合并文件标识;其中,合并文件存储于Hadoop分布式文件系统HDFS、且所述合并文件中的子文件有关联关系;A first search module, connected to the receiving module, is configured to search for the target file in the mapping relationship between the sub-file identification and the merged file identification included in the first index information stored locally according to the identification of the target file The matching target sub-file identification and the corresponding target merged file identification; where the merged file is stored in the Hadoop distributed file system HDFS, and the sub-files in the merged file are associated;
    第二查找模块,连接于所述第一查找模块,用于根据所述目标合并文件标识,在本地存储的第二索引信息包括的合并文件标识与所述HDFS的存储块标识的映射关系中,查找与所述目标合并文件标识对应的目标存储块标识;A second search module, connected to the first search module, for mapping the merged file ID included in the locally stored second index information and the HDFS storage block ID according to the target merged file ID, Find the target storage block identifier corresponding to the target merged file identifier;
    发送模块,连接于所述第二查找模块,用于按照预设获取条件,确定与所述目标文件关联的待获取的子文件数量,向所述HDFS发送文件获取请求,所述文件获取请求中包含所述目标存储块标识、所述目标子文件标识、所述目标合并文件标识、所述子文件数量,以使所述HDFS在与所述目标存储块标识对应的目标存储块中查找与所述目标合并文件标识对应的目标合并文件,并在所述目标合并文件中查找所述目标文件及数量为所述子文件数量的关联文件;A sending module, connected to the second search module, for determining the number of sub-files to be acquired associated with the target file according to a preset acquisition condition, and sending a file acquisition request to the HDFS, in the file acquisition request Including the target storage block identification, the target sub-file identification, the target merged file identification, and the number of sub-files, so that the HDFS can search and locate the target storage block corresponding to the target storage block identification Identifying the target merged file corresponding to the target merged file, and searching the target merged file for the target file and related files whose number is the number of the sub-files;
    缓存模块,连接于所述发送模块,用于接收并缓存所述HDFS返回的目标文件以及关联文件。A cache module, connected to the sending module, is used to receive and cache the target file and associated file returned by the HDFS.
  10. 根据权利要求9所述的装置,所述装置还包括:The device according to claim 9, further comprising:
    第一获取模块,用于获取多个文件的历史访问日志,所述历史访问日志中包括多个文件的被访问时间及被访问次数;The first obtaining module is used to obtain historical access logs of multiple files, and the historical access logs include the access time and the number of access times of multiple files;
    第一确定模块,连接于所述第一获取模块,用于针对所述多个文件中的每一文件,根据所述多个文件的被访问时间及被访问次数,在所述多个文件中除该文件之外的其他文件中,确定在访问该文件之后与该文件具有访问关联的至少一文件,并确定该文件的多个第一关联关系,其中,第一关联关系用于表示该文件与至少一文件中任一文件的访问关联;A first determination module, connected to the first acquisition module, for each file of the plurality of files, according to the access time and the number of accesses of the plurality of files, in the plurality of files In other files than the file, determine at least one file that has access association with the file after accessing the file, and determine a plurality of first association relationships of the file, where the first association relationship is used to represent the file Associated with access to any file in at least one file;
    第二确定模块,连接于所述第一确定模块,用于根据所述多个文件中各文件的第一关联关系,获取第一关联关系数量最多的第一文件,并依据所述第一文件的多个第一关联关系,在所述多个文件中确定在所述第一文件被访问之后依次被访问的至少一关联文件;A second determination module, connected to the first determination module, for acquiring the first file with the largest number of first association relationships according to the first association relationship of each file in the plurality of files, and based on the first file Multiple first association relationships of the multiple files, determining at least one associated file that is sequentially accessed after the first file is accessed among the multiple files;
    存储模块,连接于所述第二确定模块,用于将所述第一文件及至少一关联文件存储在第一合并文件中。A storage module, connected to the second determination module, is used to store the first file and at least one associated file in a first merged file.
  11. 根据权利要求10所述的装置,所述装置还包括:The device according to claim 10, further comprising:
    第二获取模块,连接于所述存储模块,用于在所述多个文件中各文件的第一关联关系中,删除确定至少一关联文件时应用到的第一关联关系,获取剩余的第一关联关系;根据剩余的第一关联关系,获取第一关联关系数量最多的新第一文件;A second acquisition module, connected to the storage module, for deleting the first association relationship applied when determining at least one associated file in the first association relationship of each file in the plurality of files, to obtain the remaining first Association relationship; according to the remaining first association relationship, obtain the new first file with the largest number of first association relationships;
    第三确定模块,连接于所述第二获取模块,用于在所述多个文件中,触发所述第二确定模块重复执行依据所述新第一文件的多个第一关联关系确定在所述新第一文件被访问之后依次被访问的至少一关联文件、将所述新第一文件及在新第一文件被访问之后依次被访问的至少一关联文件存储在新第一合并文件中的过程,直到所述第二获取模块获取不到剩余的第一关联关系。A third determination module, connected to the second acquisition module, is used to trigger the second determination module to repeatedly perform determination of multiple first association relationships based on the new first file in all of the multiple files. At least one associated file that is sequentially accessed after the new first file is accessed, storing the new first file and at least one associated file that is sequentially accessed after the new first file is accessed in the new first merged file Process until the second acquisition module cannot acquire the remaining first association relationship.
  12. 根据权利要求10所述的装置,所述第一关联关系中记录的相关联的两个文件中的一个为前驱文件,另一个为后继文件,后继文件为在访问所述前驱文件之后被访问的文件;所述第二确定模块,包括:The apparatus according to claim 10, one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is accessed after accessing the predecessor file File; the second determining module, including:
    第一关联关系获取子模块,用于获取包含所述多个文件中各文件的第一关联关系的第一关联关系集合;A first association relationship obtaining submodule, configured to obtain a first association relationship set containing the first association relationship of each file in the plurality of files;
    第二关联关系获取子模块,连接于所述第一关联关系获取子模块,用于在所述第一关联关系集合中,获取以第一文件作为前驱文件出现次数最多的第一目标关联关系集合,并在所述第一目标关联关系集合中,获取第二关联关系,所述第二关联关系为:第一目标关联关系集合中后继文件被访问次数最多的第一关联关系;A second association relationship acquisition sub-module, connected to the first association relationship acquisition sub-module, is used to obtain the first target association relationship set with the first file as the predecessor file most frequently in the first association relationship set And, in the first target association set, obtain a second association, where the second association is: the first association in the first target association set where the subsequent files are accessed the most;
    第一关联文件确定子模块,连接于所述第二关联关系获取子模块,用于若所述第一关联关系集合中存在前驱文件与第二关联关系的后继文件相同的第三关联关系,则从所述第三关联关系中确定后继文件出现次数最多的目标关联关系,将目标关联关系中的文件确定为关联文件;A first associated file determination submodule, connected to the second associated relationship acquisition submodule, and configured to: if there is a third associated relationship in the first associated relationship set where the predecessor file and the subsequent file of the second associated relationship are the same, then Determine the target association relationship with the highest number of subsequent files from the third association relationship, and determine the file in the target association relationship as the association file;
    第二关联文件确定子模块,连接于所述第二关联关系获取子模块,用于若所述第一关联关系集合中不存在前驱文件与第二关联关系的后继文件相同的第三关联关系,则将所述第二关联关系的后继文件确定为关联文件。A second associated file determination sub-module, connected to the second associated relationship acquisition sub-module, and used for a third associated relationship where the predecessor file and the subsequent file of the second associated relationship are not the same in the first associated relationship set, Then, the subsequent file of the second association relationship is determined as the associated file.
  13. 根据权利要求12所述的装置,所述第二确定模块,还包括:The apparatus of claim 12, the second determination module, further comprising:
    删除子模块,用于删除所述第一关联关系集合中的所述目标关联关系, 得到新第一关联关系集合;A deletion submodule, configured to delete the target association relationship in the first association set to obtain a new first association set;
    重复确定子模块,连接于所述删除子模块,用于重复触发所述第二关联关系获取子模块、第一关联文件确定子模块执行以下操作,直到所述第二关联文件确定子模块确定新第一关联关系集合中不存在前驱文件与新第二关联关系的后继文件相同的第三关联关系:A repeat determination submodule, connected to the deletion submodule, for repeatedly triggering the second association relationship acquisition submodule and the first association file determination submodule to perform the following operations until the second association file determination submodule determines that the new There is no third association relationship in the first association relationship set where the predecessor file and the successor file of the new second association relationship are the same:
    在所述新第一关联关系集合中,获取以第一文件作为前驱文件出现次数最多的新第一目标关联关系集合,并在所述新第一目标关联关系集合中,获取新第二关联关系,所述新第二关联关系为:新第一目标关联关系集合中后继文件被访问次数最多的第一关联关系;In the new first association relationship set, obtain a new first target association relationship set with the first file as the predecessor file, and obtain the new second association relationship in the new first target association relationship set , The new second association relationship is: the first association relationship in which a subsequent file in the new first target association set is accessed most often;
    若在所述新第一关联关系集合中存在前驱文件与新第二关联关系的后继文件相同的新第三关联关系时,从所述新第三关联关系中确定后继文件出现次数最多的新目标关联关系,将新目标关联关系中的文件确定为关联文件;以及,删除所述新目标关联关系,获取所述新第一关联关系集合。If there is a new third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set, a new target with the largest number of subsequent file occurrences is determined from the new third association relationship The association relationship determines the file in the new target association relationship as the association file; and deletes the new target association relationship to obtain the new first association set.
  14. 一种服务器,包括处理器、机器可读存储介质及收发器,所述机器可读存储介质存储有能够被所述处理器和所述收发器执行的机器可执行指令;所述机器可执行指令促使所述收发器:接收文件读取请求,并向处理器发送所述文件读取请求,所述文件读取请求中包括待读取的目标文件的标识;A server includes a processor, a machine-readable storage medium, and a transceiver. The machine-readable storage medium stores machine-executable instructions executable by the processor and the transceiver; the machine-executable instructions Causing the transceiver to receive the file reading request and send the file reading request to the processor, where the file reading request includes the identification of the target file to be read;
    所述机器可执行指令促使所述处理器:The machine-executable instructions cause the processor to:
    接收所述文件读取请求;Receiving the file reading request;
    根据所述目标文件的标识,在本地存储的第一索引信息包括的子文件标识与合并文件标识的映射关系中,查找与所述目标文件的标识匹配的目标子文件标识及对应的目标合并文件标识;其中,合并文件存储于Hadoop分布式文件系统HDFS、且所述合并文件中的子文件有关联关系;According to the identification of the target file, in the mapping relationship between the sub-file identification included in the first index information stored locally and the merged file identification, find the target sub-file identification and the corresponding target merged file that match the target file identification Identification; wherein, the merged files are stored in the Hadoop distributed file system HDFS, and the sub-files in the merged files are associated;
    根据所述目标合并文件标识,在本地存储的第二索引信息包括的合并文件标识与所述HDFS的存储块标识的映射关系中,查找与所述目标合并文件标识对应的目标存储块标识;According to the target merged file identifier, in the mapping relationship between the merged file identifier included in the locally stored second index information and the HDFS storage block identifier, search for the target storage block identifier corresponding to the target merged file identifier;
    按照预设获取条件,确定与所述目标文件关联的待获取的子文件数量;According to the preset acquisition condition, determine the number of sub-files to be acquired associated with the target file;
    所述机器可执行指令还促使所述收发器:The machine-executable instructions also cause the transceiver to:
    向所述HDFS发送文件获取请求,所述文件获取请求中包含所述目标存储块标识、所述目标子文件标识、所述目标合并文件标识、所述子文件数量,以使所述HDFS在与所述目标存储块标识对应的目标存储块中查找与所述目 标合并文件标识对应的目标合并文件,并在所述目标合并文件中查找所述目标文件及数量为所述子文件数量的关联文件;Send a file acquisition request to the HDFS, where the file acquisition request includes the target storage block identifier, the target subfile identifier, the target merged file identifier, and the number of subfiles, so that the HDFS Searching for a target merge file corresponding to the target merge file ID in the target storage block corresponding to the target storage block ID, and searching the target merge file for the target file and associated files whose number is the number of the sub-files ;
    接收所述HDFS返回的目标文件以及关联文件,并向所述处理器发送所述HDFS返回的目标文件以及关联文件;Receiving the target file and associated file returned by the HDFS, and sending the target file and associated file returned by the HDFS to the processor;
    所述机器可执行指令还促使所述处理器:接收并缓存所述收发器发送的所述HDFS返回的目标文件及关联文件。The machine-executable instructions also cause the processor to receive and cache the target file and associated file returned by the HDFS sent by the transceiver.
  15. 根据权利要求14所述的服务器,所述机器可执行指令促使所述处理器:The server of claim 14, the machine-executable instructions cause the processor to:
    获取多个文件的历史访问日志,所述历史访问日志中包括多个文件的被访问时间及被访问次数;Obtain historical access logs of multiple files, where the historical access logs include the access time and the number of accesses of multiple files;
    针对所述多个文件中的每一文件,根据所述多个文件的被访问时间及被访问次数,在所述多个文件中除该文件之外的其他文件中,确定在访问该文件之后与该文件具有访问关联的至少一文件,并确定该文件的多个第一关联关系,其中,第一关联关系用于表示该文件与至少一文件中任一文件的访问关联;For each file in the plurality of files, according to the access time and the number of accesses of the plurality of files, among the files other than the file in the plurality of files, it is determined that after accessing the file Having at least one file associated with access to the file, and determining a plurality of first association relationships of the file, wherein the first association relationship is used to indicate that the file is associated with any file in at least one file;
    根据所述多个文件中各文件的第一关联关系,获取第一关联关系数量最多的第一文件,并依据所述第一文件的多个第一关联关系,在所述多个文件中确定在所述第一文件被访问之后依次被访问的至少一关联文件;According to the first association relationship of each file in the plurality of files, obtain the first file with the largest number of first association relationships, and determine among the plurality of files according to the plurality of first association relationships of the first file At least one associated file that is sequentially accessed after the first file is accessed;
    将所述第一文件及至少一关联文件存储在第一合并文件中。Storing the first file and at least one associated file in the first merged file.
  16. 根据权利要求15所述的服务器,所述机器可执行指令促使所述处理器:The server of claim 15, the machine-executable instructions cause the processor to:
    在所述多个文件中各文件的第一关联关系中,删除确定至少一关联文件时应用到的第一关联关系,获取剩余的第一关联关系;根据剩余的第一关联关系,获取第一关联关系数量最多的新第一文件;In the first association relationship of each file in the plurality of files, delete the first association relationship applied when determining at least one associated file to obtain the remaining first association relationship; according to the remaining first association relationship, obtain the first The new first document with the largest number of associations;
    在所述多个文件中,重复执行依据所述新第一文件的多个第一关联关系确定在所述新第一文件被访问之后依次被访问的至少一关联文件、将所述新第一文件及在新第一文件被访问之后依次被访问的至少一关联文件存储在新第一合并文件中的过程,直到获取不到剩余的第一关联关系。In the plurality of files, repeatedly execute at least one associated file that is sequentially accessed after the new first file is accessed according to the plurality of first association relationships of the new first file, and replace the new first file The process of storing files and at least one associated file that is accessed sequentially after the new first file is accessed in the new first merged file until the remaining first associated relationship cannot be obtained.
  17. 根据权利要求15所述的服务器,所述第一关联关系中记录的相关联的两个文件中的一个为前驱文件,另一个为后继文件,后继文件为在访问所述前驱文件之后被访问的文件;所述机器可执行指令促使所述处理器:The server according to claim 15, wherein one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is accessed after accessing the predecessor file Files; the machine executable instructions cause the processor to:
    获取包含所述多个文件中各文件的第一关联关系的第一关联关系集合;Acquiring a first association set containing the first association of each file in the plurality of files;
    在所述第一关联关系集合中,获取以第一文件作为前驱文件出现次数最多的第一目标关联关系集合,并在所述第一目标关联关系集合中,获取第二关联关系,所述第二关联关系为:第一目标关联关系集合中后继文件被访问次数最多的第一关联关系;In the first association relationship set, a first target association relationship set that uses the first file as a predecessor file to appear most often is obtained, and in the first target association relationship set, a second association relationship is obtained. The second association relationship is: the first association relationship in which the subsequent files in the first target association set are accessed the most;
    若所述第一关联关系集合中存在前驱文件与第二关联关系的后继文件相同的第三关联关系,则从所述第三关联关系中确定后继文件出现次数最多的目标关联关系,将目标关联关系中的文件确定为关联文件;If there is a third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, then from the third association relationship, the target association relationship with the highest number of subsequent file occurrences is determined, and the target association The files in the relationship are determined to be related files;
    若所述第一关联关系集合中不存在前驱文件与第二关联关系的后继文件相同的第三关联关系,则将所述第二关联关系的后继文件确定为关联文件。If there is no third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, the successor file of the second association relationship is determined as the association file.
  18. 根据权利要求17所述的服务器,所述机器可执行指令促使所述处理器:The server of claim 17, the machine-executable instructions cause the processor to:
    删除所述第一关联关系集合中的所述目标关联关系,得到新第一关联关系集合;Deleting the target association relationship in the first association relationship set to obtain a new first association relationship set;
    重复执行以下操作,直到所述新第一关联关系集合中不存在前驱文件与新第二关联关系的后继文件相同的第三关联关系:Repeat the following operations until there is no third association relationship in the new first association relationship set where the predecessor file and the successor file of the new second association relationship are the same:
    在所述新第一关联关系集合中,获取以第一文件作为前驱文件出现次数最多的新第一目标关联关系集合,并在所述新第一目标关联关系集合中,获取新第二关联关系,所述新第二关联关系为:新第一目标关联关系集合中后继文件被访问次数最多的第一关联关系;In the new first association relationship set, obtain a new first target association relationship set with the first file as the predecessor file, and obtain the new second association relationship in the new first target association relationship set , The new second association relationship is: the first association relationship in which a subsequent file in the new first target association set is accessed most often;
    若在所述新第一关联关系集合中存在前驱文件与新第二关联关系的后继文件相同的新第三关联关系时,从所述新第三关联关系中确定后继文件出现次数最多的新目标关联关系,将新目标关联关系中的文件确定为关联文件;以及,删除所述新目标关联关系,获取所述新第一关联关系集合。If there is a new third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set, a new target with the largest number of subsequent file occurrences is determined from the new third association relationship The association relationship determines the file in the new target association relationship as the association file; and deletes the new target association relationship to obtain the new first association set.
  19. 根据权利要求15所述的服务器,所述机器可执行指令促使所述处理器:The server of claim 15, the machine-executable instructions cause the processor to:
    通过以下方式确定文件的多个第一关联关系:The multiple first associations of the file are determined in the following ways:
    根据第二文件的被访问次数、所述第二文件被访问后第三文件的被访问次数,获取所述第二文件被访问后所述第三文件被访问的第一概率,其中,所述第二文件和所述第三文件为所述多个文件中的任意两个不相同的文件;Obtaining the first probability that the third file is accessed after the second file is accessed according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed, wherein, the The second file and the third file are any two different files in the plurality of files;
    根据所述第二文件被访问后所述第三文件的被访问次数及所述历史访问 日志中所有文件被访问的总次数,获取所述第二文件和所述第三文件都被访问的第二概率;According to the number of times the third file is accessed after the second file is accessed and the total number of times all files in the historical access log are accessed, obtain the number of times the second file and the third file are accessed Second probability
    根据所述历史访问日志中所有文件被访问的总次数、所述第二文件被访问后所述第三文件的被访问次数、所述第二文件被访问的次数及所述第三文件被访问的次数,获取所述第二文件被访问对所述第三文件被访问的影响力值;According to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, the number of times the second file is accessed, and the third file being accessed The number of times to obtain the influence value of the second file being accessed on the third file being accessed;
    当所述第一概率大于第一概率阈值、所述第二概率大于第二概率阈值及所述影响力值大于所述影响力阈值时,确定所述第二文件及所述第三文件具有所述第一关联关系。When the first probability is greater than the first probability threshold, the second probability is greater than the second probability threshold, and the influence value is greater than the influence threshold, it is determined that the second file and the third file have all Describe the first association.
  20. 根据权利要求15所述的服务器,The server according to claim 15,
    所述机器可执行指令还促使所述收发器:将所述第一合并文件发送至所述HDFS,并接收所述HDFS返回的存储所述第一合并文件的第一存储块标识,并向所述处理器发送所述第一合并文件的第一存储块标识;The machine-executable instructions also cause the transceiver to send the first merged file to the HDFS, and receive the first storage block identifier returned by the HDFS that stores the first merged file, and send to the The processor sends the first storage block identifier of the first merged file;
    所述机器可执行指令还促使所述处理器:接收所述第一合并文件的第一存储块标识,创建包含所述第一文件标识与第一合并文件标识的映射关系的第一索引信息、以及包含第一合并文件标识与第一存储块标识的映射关系的第二索引信息。The machine-executable instructions further cause the processor to: receive the first storage block identifier of the first merged file, and create first index information including the mapping relationship between the first file identifier and the first merged file identifier, And second index information including the mapping relationship between the first merged file identifier and the first storage block identifier.
  21. 根据权利要求14所述的服务器,所述机器可执行指令促使所述处理器:The server of claim 14, the machine-executable instructions cause the processor to:
    当接收到的下一文件读取请求中包括与所述目标文件相关联的文件时,若与所述目标文件相关联的文件存储于所述缓存中,则从所述缓存中读取与与所述目标文件相关联的文件。When the next file read request received includes the file associated with the target file, if the file associated with the target file is stored in the cache, the read and The file associated with the target file.
  22. 一种机器可读存储介质,存储有机器可执行指令,在被处理器调用和执行时,所述机器可执行指令促使所述处理器执行权利要求1-8任一项所述的方法。A machine-readable storage medium storing machine-executable instructions, which when called and executed by a processor, causes the processor to perform the method of any one of claims 1-8.
PCT/CN2019/126003 2018-12-17 2019-12-17 File reading WO2020125630A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811541620.0A CN109766318B (en) 2018-12-17 2018-12-17 File reading method and device
CN201811541620.0 2018-12-17

Publications (1)

Publication Number Publication Date
WO2020125630A1 true WO2020125630A1 (en) 2020-06-25

Family

ID=66450771

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/126003 WO2020125630A1 (en) 2018-12-17 2019-12-17 File reading

Country Status (2)

Country Link
CN (1) CN109766318B (en)
WO (1) WO2020125630A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766318B (en) * 2018-12-17 2021-03-02 新华三大数据技术有限公司 File reading method and device
CN110826697B (en) * 2019-10-31 2023-06-06 深圳市商汤科技有限公司 Method and device for acquiring sample, electronic equipment and storage medium
CN113553306B (en) * 2021-07-27 2023-07-21 重庆紫光华山智安科技有限公司 Data processing method and data storage management system
CN114489510A (en) * 2022-01-28 2022-05-13 维沃移动通信有限公司 Data reading method and device
CN116991333B (en) * 2023-09-25 2024-01-26 苏州元脑智能科技有限公司 Distributed data storage method, device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577123A (en) * 2013-11-12 2014-02-12 河海大学 Small file optimization storage method based on HDFS
CN105573667A (en) * 2015-12-10 2016-05-11 华为技术有限公司 Data reading method and storage server
WO2016202199A1 (en) * 2015-06-18 2016-12-22 阿里巴巴集团控股有限公司 Distributed file system and file meta-information management method thereof
CN108804566A (en) * 2018-05-22 2018-11-13 广东技术师范学院 A kind of mass small documents read method based on Hadoop
CN109766318A (en) * 2018-12-17 2019-05-17 新华三大数据技术有限公司 File reading and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9576000B2 (en) * 2014-04-25 2017-02-21 International Business Machines Corporation Adaptive fragment assignment for processing file data in a database
KR101672901B1 (en) * 2014-12-03 2016-11-07 충북대학교 산학협력단 Cache Management System for Enhancing the Accessibility of Small Files in Distributed File System
CN104679898A (en) * 2015-03-18 2015-06-03 成都汇智远景科技有限公司 Big data access method
WO2016183545A1 (en) * 2015-05-14 2016-11-17 Walleye Software, LLC Distributed and optimized garbage collection of remote and exported table handle links to update propagation graph nodes
CN105843841A (en) * 2016-03-07 2016-08-10 青岛理工大学 Small file storing method and system
CN107168802A (en) * 2017-05-18 2017-09-15 郑州云海信息技术有限公司 The merging method and device of a kind of cloud storage small file
CN108363643B (en) * 2018-03-27 2021-06-15 东北大学 HDFS copy management method based on file access heat
CN108595567A (en) * 2018-04-13 2018-09-28 郑州云海信息技术有限公司 A kind of merging method of small documents, device, equipment and readable storage medium storing program for executing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577123A (en) * 2013-11-12 2014-02-12 河海大学 Small file optimization storage method based on HDFS
WO2016202199A1 (en) * 2015-06-18 2016-12-22 阿里巴巴集团控股有限公司 Distributed file system and file meta-information management method thereof
CN105573667A (en) * 2015-12-10 2016-05-11 华为技术有限公司 Data reading method and storage server
CN108804566A (en) * 2018-05-22 2018-11-13 广东技术师范学院 A kind of mass small documents read method based on Hadoop
CN109766318A (en) * 2018-12-17 2019-05-17 新华三大数据技术有限公司 File reading and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张春明 等 (ZHANG, CHUNMING ET AL.): "一种Hadoop小文件存储和读取的方法 (Non-official translation: Method for Storing and Reading Hadoop Small Files)", 计算机应用与软件 (COMPUTER APPLICATIONS AND SOFTWARE), vol. 29, no. 11, 30 November 2012 (2012-11-30), pages 95 - 100, DOI: 20200304161505X *
许俊杰 (XU, JUNJIE): "海量小文件存储系统的研究与实现 (Non-official translation: Research and Implementation of Massive Small File Storage System)", 中国优秀硕士学位论文全文数据库(信息科技辑) (CHINESE MASTER’S THESES FULL-TEXT DATABASE INFORMATION & TECHNOLOGY)), no. 10, 15 October 2018 (2018-10-15), DOI: 20200304161259X *
顾玉宛 等 (GU, YUWAN, ETC.): "一种面向HDFS中海量小文件的存取优化方法 (Non-official translation: Optimization of Massive Small Files Storage and Accessing on HDFS)", 计算机应用研究 (APPLICATION RESEARCH OF COMPUTERS), vol. 34, no. 8, 31 August 2017 (2017-08-31), DOI: 20200307084440A *

Also Published As

Publication number Publication date
CN109766318B (en) 2021-03-02
CN109766318A (en) 2019-05-17

Similar Documents

Publication Publication Date Title
WO2020125630A1 (en) File reading
US10754562B2 (en) Key value based block device
CN109213772B (en) Data storage method and NVMe storage system
KR102462781B1 (en) KVS tree database
US10275489B1 (en) Binary encoding-based optimizations at datastore accelerators
US10114908B2 (en) Hybrid table implementation by using buffer pool as permanent in-memory storage for memory-resident data
US9411840B2 (en) Scalable data structures
JP5996088B2 (en) Cryptographic hash database
CN106294190B (en) Storage space management method and device
WO2020186549A1 (en) Metadata management method, system and medium
US10210191B2 (en) Accelerated access to objects in an object store implemented utilizing a file storage system
CN105677826A (en) Resource management method for massive unstructured data
US9262511B2 (en) System and method for indexing streams containing unstructured text data
CN104850572A (en) HBase non-primary key index building and inquiring method and system
CN103595797B (en) Caching method for distributed storage system
US10503693B1 (en) Method and system for parallel file operation in distributed data storage system with mixed types of storage media
CN109144413A (en) A kind of metadata management method and device
US11775480B2 (en) Method and system for deleting obsolete files from a file system
WO2023179787A1 (en) Metadata management method and apparatus for distributed file system
US10515055B2 (en) Mapping logical identifiers using multiple identifier spaces
US10146833B1 (en) Write-back techniques at datastore accelerators
WO2020215580A1 (en) Distributed global data deduplication method and device
WO2021037072A1 (en) Buffer information updating method and apparatus, device, and medium
WO2024021808A1 (en) Data query request processing method and apparatus, device and storage medium
CN111752941B (en) Data storage and access method and device, server and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19900518

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19900518

Country of ref document: EP

Kind code of ref document: A1