WO2020125630A1

WO2020125630A1 - File reading

Info

Publication number: WO2020125630A1
Application number: PCT/CN2019/126003
Authority: WO
Inventors: 王勇
Original assignee: 新华三大数据技术有限公司
Priority date: 2018-12-17
Filing date: 2019-12-17
Publication date: 2020-06-25
Also published as: CN109766318B; CN109766318A

Abstract

A file reading request is received, wherein the file reading request comprises an identifier of a target file to be read; according to the identifier of the target file, a target sub-file identifier matching the identifier of the target file and a corresponding target merge file identifier are searched for in a mapping relationship, between sub-file identifiers and merge file identifiers, comprised in locally stored first index information; according to the target merge file identifier, a target storage block identifier corresponding to the target merge file identifier is searched for in a mapping relationship, between merge file identifiers and storage block identifiers of an HDFS, comprised in locally stored second index information; and the number of sub-files to be acquired that are associated with the target file is determined according to a pre-set acquisition condition, a file acquisition request is sent to the HDFS, and the target file and the associated files returned by the HDFS are received and cached.

Description

File reading

This application claims the priority of the Chinese patent application filed on December 17, 2018 in the Chinese Patent Office with the application number 201811541620.0 and the invention titled "File Reading Method and Device", the entire contents of which are incorporated by reference in this application.

Background technique

With the advent of the era of big data, in the fields of e-commerce, social networking sites, scientific research and calculation, a large amount of data is generated every day. Traditional stand-alone systems cannot solve problems such as storage and data analysis. In order to improve the storage efficiency of large amounts of data Currently, distributed storage systems are commonly used for distributed storage of data.

In the current distributed storage system, Hadoop is generally adopted as the storage technology, and the Hadoop is an open source distributed system infrastructure. Each file stored in the Hadoop Distributed File System (Hadoop Distributed File System, HDFS) needs to correspond to a block, and the master node (NameNode) in HDFS establishes a mapping relationship between each file and its corresponding block.

Brief description of the drawings

FIG. 1-1 shows a flowchart of a file reading method according to an embodiment of the present disclosure;

Figure 1-2 shows a schematic diagram of a possible application system architecture according to Embodiment 1 of the present disclosure;

2 shows a flowchart of a file reading method according to an embodiment of the present disclosure;

3 shows a flowchart of determining a first association relationship according to an embodiment of the present disclosure;

4 shows a flowchart of an associated file acquisition method according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a process for acquiring a file association relationship according to an embodiment of the present disclosure;

6 shows a block diagram of a file reading device according to an embodiment of the present disclosure;

7 shows a block diagram of a file reading device according to an embodiment of the present disclosure;

8 shows a schematic diagram of a second determination module according to an embodiment of the present disclosure;

9 shows a schematic diagram of a first determination module according to an embodiment of the present disclosure;

FIG. 10 shows a structural block diagram of a server according to an embodiment of the present disclosure.

detailed description

Various exemplary embodiments, features, and aspects of the present disclosure will be described in detail below with reference to the drawings. The same reference numerals in the drawings denote elements having the same or similar functions. Although various aspects of the embodiments are shown in the drawings, unless specifically noted, the drawings are not necessarily drawn to scale.

The word "exemplary" used exclusively here means "used as an example, embodiment, or illustrative". Any embodiments described herein as "exemplary" need not be construed as superior or better than other embodiments.

In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following specific implementations. Those skilled in the art should understand that the present disclosure can also be implemented without certain specific details. In some examples, methods, means, components and circuits well known to those skilled in the art are not described in detail in order to highlight the gist of the present disclosure.

The inventor found that accessing a large number of files with a small amount of data through HDFS usually has the following problems:

HDFS is more suitable for storing files with a large amount of data (for example, files with a data amount greater than 64M or 128MB), and can fully utilize the storage resources of HDFS. If HDFS stores a large number of files with a data volume of less than 64M (such as files with only 10KB to 10MB such as pictures and documents), because these files are much smaller than the block size in HDFS, and a large number of files with a small amount of data will occupy more The storage block will therefore reduce the storage resource utilization of HDFS. The higher the number of files stored in HDFS, the more mapping relationships need to be established, and the more memory is occupied by the master node, so this will greatly occupy the memory of the master node, resulting in greatly reduced efficiency of HDFS access to data.

In HDFS, metadata information (metadata information, which is used to describe data attributes) of files with a large amount of data is a type of electronic directory, such as a tree-like directory structure, file attributes, files, and data The mapping relationship of blocks, etc.) are usually stored in the NameNode, which will cause the memory bottleneck of the NameNode. Moreover, reading a large number of files with a small amount of data will cause the client to frequently communicate with the NameNode node, which in turn will reduce the I/O performance of the NameNode. It can be seen from this that files with a small amount of data are read from HDFS because the data read granularity of files with a small amount of data is small, and the storage space of a large number of files with a small amount of data is insufficient for continuity, making it difficult to play HDFS Advantages of sequential file access.

Based on the above problems, the present disclosure proposes a file reading method to improve the efficiency of reading files through HDFS. The file may be a file with a small amount of data or a file with a large amount of data, and the disclosure is not particularly limited.

Please refer to FIG. 1-1, which shows a flowchart of a file reading method according to an embodiment of the present disclosure.

Please refer to FIG. 1-2, which shows a schematic diagram of a system architecture of a file reading method according to an embodiment of the present disclosure.

The method shown in Figure 1-1 can be applied to server 1 to enable server 1 to read files from HDFS2.

In a possible implementation, the system may include server 1 and HDFS2. The server 1 may be a client server, and a user accesses the server 1 through the client, so that the server 1 uses the file reading method of the embodiment of the present disclosure to read files from the HDFS2.

In a possible implementation manner, the system may include server 1, server 3, and HDFS2. The method can also be applied to other servers. For example, the user can call the resources of the server 3 through the server 1 to execute the method, thereby obtaining the target file and the associated file.

In other embodiments, the method described in the present disclosure can also be applied to other processing devices (such as terminals) that can perform calculations, and the system architecture shown in FIGS. 1-2 is not intended to limit the present disclosure.

As shown in Figure 1-1, the method includes steps S110-S150, and the method is applied to a server as an example. The description of each step is as follows.

Step S110: Receive a file reading request, where the file reading request includes the identifier of the target file to be read.

In this embodiment, the file reading request may be a file reading instruction sent by the user through the client in the terminal. When the user needs to obtain a certain file, the user may manipulate the client to make the client send the file reading request, Furthermore, the server receives the file reading request sent by the client.

In one example, the identification of the target file may be the unique identification information of the target file, which is used to uniquely determine the target file. For example, the unique identification information may be a hash value obtained by hashing information such as the name of the target file. When the identification of the target file is unique identification information, the reading of the file belongs to accurate reading.

In another example, the identification of the target file may also be other information that is different from the unique identification information, which is used to identify a certain type of file or a certain range of files, for example, information such as date and category. When the identification of the target file is such information, the reading of the file is fuzzy reading.

Step S120, according to the target file identifier, in the mapping relationship between the subfile identifier and the merged file identifier included in the first index information stored locally, find the target subfile identifier that matches the target file identifier and the corresponding target merged file identifier .

Among them, the merged file is stored in HDFS, and the subfiles in the merged file have an association relationship, that is, the merged file is formed by merging a plurality of subfiles with an association relationship. The association relationship may be an access association relationship. For example, after file 1 is accessed, the next file to be accessed is file 2, then file 2 and file 1 have an association relationship, and file 1 and file 2 may be merged into a merged file, and Store the merged file in HDFS.

In this embodiment, the server may store the first index information in advance, and the process of creating the first index information will be described later. Specifically, the first index information may include the mapping relationship between the sub-file and the merged file. In the present disclosure, the relationship between the sub-file and the merged file is called: the first mapping relationship. The first mapping relationship may be expressed as a correspondence between the sub-file identification and the merged file identification. Through the first mapping relationship, the corresponding merged file may be found using the target file identification.

In other embodiments, the first index information may also include the offset of the subfile in the merged file, and the size of the subfile. The size of the sub-file may be the length or specific gravity occupied by the sub-file in the merged file, and the offset may be the starting position of the sub-file in the merged file. Applying the first index information, after finding the target sub-file identifier that matches the target file identifier and the target merged file identifier corresponding to the target file identifier, the sub-file included in the first index information can also be used In the offset of the merged file, find the storage location of the target subfile in the target merged file. It should be understood that, since the target sub-file identification matches the target file's identification, in an alternative embodiment, the target sub-file found is the target file.

Step S130, according to the target merged file identifier, in the mapping relationship between the merged file identifier included in the locally stored second index information and the HDFS storage block identifier, search for a target storage block corresponding to the target merged file identifier Logo.

In this embodiment, the server may also store the second index information in advance, and the process of creating the second index information will be described later. Specifically, the second index information may include the mapping relationship between the merged file and the HDFS storage block. In the present disclosure, the mapping relationship between the merged file and the storage block of the HDFS is called: a second mapping relationship.

In one example, the second mapping relationship may be expressed as a correspondence between the merged file identifier and the storage block identifier of HDFS. Through the second mapping relationship, the target merge file ID can be searched to obtain the target storage block ID of the target merge file. Optionally, the HDFS storage block identifier may include HDFS block address information.

In another example, the second mapping relationship may also be expressed as a correspondence between the identification of the merged file and the storage location of the merged file in HDFS, and the storage location of the merged file in HDFS may be found according to the second mapping relationship.

Step S140: Determine the number of sub-files to be acquired associated with the target file according to a preset acquisition condition, and send a file acquisition request to the HDFS, where the file acquisition request includes a target storage block identifier, a target sub-file identifier, The target merge file ID and the number of subfiles, so that the HDFS searches for the target merge file corresponding to the target merge file ID in the target storage block corresponding to the target storage block ID, and finds the target merge file in the target merge file Find the target file and related files whose number is the number of the sub-files.

After receiving the file acquisition request, HDFS acquires the target file and the associated files whose number is the number of subfiles according to the target storage block identifier, target subfile identifier, target merged file identifier, and subfile quantity included in the file acquisition request. After HDFS finds the target file and associated files whose number is the number of subfiles, it sends the target file and associated files to the server.

In this embodiment, HDFS can acquire the number of sub-files (ie, associated files) in the target merged text that are close to the storage location of the target sub-file by the number of sub-files.

For example, after receiving the file acquisition request, HDFS queries the metadata information corresponding to the target sub-file (that is, the target file), the target merged file, and the target storage block through the namenode. After determining the target subfile, determine the metadata information of the subfiles with the number of subfiles adjacent to the target subfile in the target merge file through the namenode, and then obtain the target file and the number of related files with the number of subfiles from the datanode, And send it to the server.

In a possible situation, it is possible to match multiple target merged file identifiers according to the first index information, and then match multiple target storage block identifiers according to the second index information. At this time, one of the target merge file identifier and the corresponding target storage block identifier can be selected from them, and step S140 is executed.

In other examples, for each target merged file identifier and corresponding target storage block identifier, step 140 is executed, that is, a file acquisition request is sent to obtain the target file and the associated files whose number is the number of sub-files.

Step S150: Receive and cache the target file and associated file returned by HDFS.

Specifically, the target file and associated files returned by HDFS can be cached in the server's cache space or other storage space. The next time the server receives a file read request for the same file, it can directly obtain the file from the cache, thereby reducing the interaction between the server and HDFS, saving HDFS resources, and improving HDFS access efficiency.

Since the embodiment of the present disclosure stores the merged file in HDFS, and records the mapping relationship between the merged file and each sub-file in the first index information, and the mapping relationship between the merged file and the storage block of HDFS in the second index information. , The method described in the embodiments of the present disclosure can be used to quickly obtain the target file and the associated file by using the target file identifier, the first index information, and the second index information, and store it in the cache. As can be seen from the above, the method provided by the embodiment of the present disclosure can obtain the related files that may be accessed at the next moment while acquiring the target files, and store the target files and related files in the cache. When the user issues a file read request at the next moment, these associated files stored in the cache can be queried first and hit with a high probability, which can reduce the interaction with HDFS, reduce the resource utilization rate of HDFS, and improve HDFS Access efficiency, and improve the efficiency of HDFS processing a large number of files.

In addition, the files stored in HDFS are composed of multiple files with access association, so the advantages of HDFS sequential file access can be used.

Considering the network resources, in general, all the associated files of the target file will not be obtained, so it is necessary to provide a solution that can obtain the maximum number of associated files under the condition of balancing network resources. Therefore, in a possible implementation manner, the preset acquisition condition may include:

M×t1<tm-th, where M is the number of sub-files, t1 is the time it takes to read a sub-file, tm is the user’s maximum waiting time, and th is the return time for obtaining HDFS data.

In this embodiment, the optimal number of sub-files can be determined by the user's maximum wait time, HDFS data return time, and time spent reading a sub-file, which improves the user experience (maximum wait time) while improving read effectiveness.

Please refer to FIG. 2, which shows a flowchart of a file reading method according to an embodiment of the present disclosure, where steps S201-S260 mainly describe a file merging process, which may be performed before the foregoing S110.

Step S210: Acquire historical access logs of multiple files.

In this embodiment, the historical access log includes the access time and access times of multiple files.

In a possible implementation manner, the acquisition time of the historical access log may be limited, for example, the historical access log may be acquired within a certain period of time.

In an example, the format of the historical access log may be as shown in Table 1 below.

Table 1

被访问文件Accessed file	被访问时间Time visited
文件1 File 1	2015/1/1 12:00:002015/1/1 12:00:00
文件2 File 2	2015/1/1 12:01:302015/1/1 12:01:30
文件3 File 3	2015/1/2 13:02:502015/1/2 13:02:50
文件1 File 1	2015/1/2 13:04:352015/1/2 13:04:35
文件1 File 1	2015/1/2 13:05:002015/1/2 13:05:00
文件3 File 3	2015/1/3 05:22:562015/1/3 05:22:56
文件4File 4	2015/1/4 15:07:262015/1/4 15:07:26
文件5File 5	2015/1/4 19:38:232015/1/4 19:38:23

文件6File 6	2015/1/6 09:18:072015/1/6 09:18:07
文件5File 5	2015/1/6 12:56:222015/1/6 12:56:22

Assuming that the historical access logs from 2015/1/1 to 2015/1/3 are acquired, the acquired historical access logs include the accessed time and the number of accessed times of

files

1, 2, and 3.

Step S220: For each file in the plurality of files, according to the accessed time and the number of times of access of the plurality of files, among the files other than the file in the plurality of files, it is determined that the file is accessed. The file then has access to at least one file associated with the file, and determines multiple first association relationships of the file.

The first association relationship is used to indicate that the file is associated with any file in at least one file.

Taking the above Table 1 as an example, according to the access time and access times of

files

1, 2, 3, it can be determined that the files associated with access to file 1 after accessing file 1 include file 2 and file 3, and after accessing file 2 The file associated with access to file 2 includes file 3, and the file associated with access to file 3 after file 3 is accessed includes file 1. It is possible to determine two first association relationships of file 1, one first association relationship of file 2, and one first association relationship of file 3.

Suppose that the first association relationship is expressed in the way of (File A, File B), where this method can indicate that after File A is accessed, File B is accessed accordingly, that is, the user accesses File B next time after accessing File A . Then the first association relationship of file 1 may be (file 1, file 2), (file 1, file 3), the first association relationship of file 2 may be (file 2, file 3), the first association relationship of file 3 Can be (File 3, File 1).

Step S230: Acquire the first file with the largest number of first association relationships according to the first association relationship of each file in the plurality of files, and according to the plurality of first association relationships of the first file, in the multiple The file determines at least one associated file that is accessed in sequence after the first file is accessed.

Still taking the foregoing example as an example, comparing the number of first association relationships of files 1-3, it can be determined that the first file with the largest number of first association relationships: file 1. Then, it can be determined that the files that are accessed in sequence after file 1 is accessed are file 2 and file 3.

Step S240: Store the first file and at least one associated file in the first merged file.

In this embodiment, the first file and at least one associated file may be combined to obtain a combined file.

In one example, the first file and the at least one associated file may be stored sequentially and merged into the first merged file in the order of being accessed. In the present disclosure, sequential succession means that the storage location of each file is consecutive.

Exemplarily, you can store file 1 in the preceding example at addresses 0000H to 0FFFH (where H represents hexadecimal), store file 2 at addresses 1000H to EFFFH, and store file 3 at addresses F000H to FFFFH. It can be considered that the first merged file is data stored at addresses 0000H to FFFFH.

In another example, the first file and the at least one associated file may be sequentially and successively stored in the first merged file in the order of being accessed.

Exemplarily, a storage space may be opened in advance as the storage space of the first merged file. For example, the space indicated by the addresses 0000H to FFFFH may be used as the storage space of the first merged file, and then files 1-3 are stored to 0000H to 0FFFH, 1000H to EFFFH, and 0000H to FFFFH, respectively.

Step S250, in the first association relationship of each file in the plurality of files, delete the first association relationship applied when determining at least one associated file, and obtain the remaining first association relationship; according to the remaining first association relationship, Obtain the new first file with the largest number of first associations.

Still taking the foregoing example as an example, after merging

files

1, 2, and 3, the first association relationship applied is (file 1, file 2), (file 2, file 3), then the remaining first association relationship is ( File 1, File 3), (File 3, File 1), execute the acquisition of the new first file with the largest number of first association relationships. Since file 1 (file 1, file 3) and file 3 (file 3, file 1) have the same number of first association relationships, at this time, a file can be arbitrarily selected as a new first file: file 3.

Step S260, among the plurality of files, repeatedly execute, according to the plurality of first association relationships of the new first file, determine at least one associated file that is sequentially accessed after the new first file is accessed, and change the The process of storing the new first file and at least one associated file that is sequentially accessed after the new first file is accessed is stored in the new first merged file until the remaining first associated relationship cannot be obtained.

Still taking the foregoing example as an example, after merging

files

3 and 1, file 1 and file 3 through step S260, there is no remaining first association relationship, and the process ends at this time.

Through the above method, the embodiments provided by the present disclosure can merge files that have relevance among multiple files into one merged file, and the merged file includes multiple sub-files, and each sub-file in the merged file has relevance.

In a possible implementation manner, the association relationship may include identification information of the associated file. For example, in the merged file, sub-file A and sub-file B have a file association relationship, then the association relationship may be (sub-file A , Sub-file B); sub-file A, sub-file B, sub-file C... sub-file N has a file association relationship, the association relationship may be (sub-file A, file B, sub-file C, ..., sub-file N ). Of course, in other embodiments, other forms may be used to record the association relationship of multiple files, which is not limited herein. In addition, the method for determining the first association relationship is taken as an example to introduce the method for determining the association relationship.

Please refer to FIG. 3, which illustrates a flowchart of determining a first association relationship according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 3, the first association relationship of the file may be determined in the following manner.

Step S310, according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed, obtain a first probability that the third file is accessed after the second file is accessed.

Among them, the second file and the third file are any two different files among the multiple files.

In a possible implementation manner, the first probability can be obtained by the following formula: P(B|A)=NAB/NA, where P(B|A) is the first probability and NAB is the second file The number of times the third file is accessed after access, NA is the number of times the second file is accessed, A indicates the second file, and B indicates the third file.

Step S320, according to the number of times the third file is accessed after the second file is accessed and the total number of times all files in the historical access log are accessed, obtain a second probability that both the second file and the third file are accessed.

In a possible implementation manner, the second probability is obtained by the following formula: P(AB)=NAB/N, where P(AB) is the second probability, and N is the total number of times all files in the historical access log are accessed .

Step S330: Acquire the second file according to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, the number of times the second file is accessed and the number of times the third file is accessed The influence value of being accessed on the third file being accessed.

In a possible implementation manner, the influence value is obtained by the following formula: I(B|A)=(N×NAB)/(NA×NB), where I(B|A) is the influence value and NB is The number of times the third file has been accessed.

Step S340: When the first probability is greater than the first probability threshold, the second probability is greater than the second probability threshold, and the influence value is greater than the influence threshold, it is determined that the second file and the third file have a first association relationship.

In a possible implementation manner, it is determined that the second file and the third file have the first association relationship by the following formula:

(A, B)={(A, B)|P(B|A)>min_P(B|A)&&P(AB)>min_P(AB)&&I(B|A)>min_I(B|A)}.

Among them, min_P(B|A) is the first probability threshold, min_P(AB) is the second probability threshold, min_I(B|A) is the influence threshold, (A, B) is the second file A and the third file B The first association.

Exemplarily, it may be judged first whether the first probability of one of the multiple files and other files is greater than the first probability threshold to obtain a file set greater than the first probability threshold. For example, in files A, B, C, D, E, F, and G, the first probability of file A and file B, file A and file C, file A and file D, and file C and file F is greater than the first probability Threshold, then the file set at this time includes A, B, C, D, F.

Then, it is determined whether the second probability of a certain file and other files in the file set that meets the first probability threshold is greater than the second probability threshold to obtain a file set that meets the second probability threshold. For example, when the file set includes A, B, C, D, and F, if the second probability of file A and file B, file A and file C, and file C and file F is greater than the second probability threshold, then the file at this time The set includes A, B, C, F.

Finally, it is judged whether the influence value of a certain file and other files in the file set meeting the second probability threshold is greater than the influence threshold. For example, when the file set includes A, B, C, and F, if the influence value of file A on file C and the influence value of file C and file F are greater than the influence threshold, you can determine file A and file C, and File C and file F have a first association relationship, then the first association set at this time may include (file A, file C), (file C, file F), corresponding to this, in the file collection at this time Including A, C, F three files.

The above process of obtaining the first association relationship set and the file collection that meets the association relationship in the first association relationship set is exemplary, and the number of files in the example is not used to limit the present disclosure.

As can be seen from the foregoing, the first association relationship can be used to represent the association relationship between two files. If two files with the first association relationship are merged, since the file size may be between 10KB and 10MB, the merged file will still be smaller than the HDFS block storage size (for example: 64MB), and the number of merged files is still Huge, this does not minimize the number of interactions with HDFS and the memory of the master node in HDFS. Therefore, it is necessary to determine the relationship between as many files as possible to merge as many files as possible. Please refer to FIG. 4. FIG. 4 shows a flowchart of a method for obtaining an associated file according to an embodiment of the present disclosure. This embodiment can determine the association relationship between as many files as possible to merge as many as possible. document.

In this embodiment, one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is a file that is accessed after accessing the predecessor file. The method shown in FIG. 4 will be described below with reference to FIG. 5.

Step S231: Acquire a first association set containing the first association of each of the files.

Taking FIG. 5 as an example, the first association relationship set 250 includes multiple first association relationships of each file, for example, the first association relationship of file file1 (file1, file7), the first association relationship of file file3 (file3, file5) Wait. Each first association relationship includes a predecessor file and a successor file. For example, for the first association relationship (file1, file7), the corresponding predecessor file is file1 and the successor file is file7.

Step S232: In the first association relationship set, obtain the first target association relationship set that uses the first file as the predecessor file most frequently, and obtain the second association relationship in the first target association relationship set.

The second association relationship is: the first association relationship in which the subsequent files in the first target association relationship set are accessed the most.

Taking FIG. 5 as an example, the first target association relationship in the first association relationship set 250 is obtained, that is, the first association relationship in which the first file is used as the precursor file has the highest number of occurrences, to obtain the first target association relationship set 260. Then, the first target association set 260 is selected: the first association relationship in which the subsequent files in the first target association set are accessed the most (that is, the first association relationship with the largest first probability). In the first target association set 260, (file1, file7) has the largest first probability, so (file1, file7) is the second association.

Step S233: If there is a third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, then the target association relationship with the most occurrences of the successor files is determined from the third association relationship, and the target association relationship is determined The file in is determined to be the associated file.

Taking FIG. 5 as an example, the subsequent file file7 of the second association relationship (file1, file7) is used as a predecessor file to obtain a plurality of first association relationships in the first association set 250 that take file7 as the predecessor file as the third association relationship 270 , Where the third association relationship 270 may be a set. In this example, the third association relationship 270 includes two first association relationships (file7, file5), (file7, file3) with file7 as a predecessor file. Among them, file5, which is a subsequent file, is accessed the most (ie, the first probability is the largest), so the first association relationship (file7, file5) is used as the target association relationship, and the files file7 and file5 in the target association relationship are used as associations file.

In a possible implementation manner, the subsequent file file5 of the first association relationship (file7, file5) may be merged (recorded) into the second association relationship (file1, file7) to generate an updated second association relationship ( file1, file7, file5), and delete the first association relationship (file1, file7) from the first association set. It should be noted that, after the first association relationship (file7, file5) is updated to the second association relationship (file1, file7, file5), it may be considered to have been deleted. In other embodiments, if the first association relationship (file7, file5) is not covered by the second association relationship (file1, file7, file5), it may be deleted from the first association relationship set.

Step S234: If there is no third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, the successor file of the second association relationship is determined as the association file.

Taking FIG. 5 as an example, if the aforementioned first association relationship (file7, file5), (file7, file3) does not exist in the first association relationship set, the subsequent file file7 of the second association relationship (file1, file7) can be Determine the associated file of the first file file1.

Step S235: Delete the target association relationship in the first association relationship set to obtain a new first association relationship set.

In step S236, the following operations are repeatedly performed until there is no third association relationship in the new first association relationship set where the predecessor file and the successor file of the new second association relationship are the same:

In the new first association relationship set, obtain a new first target association relationship set with the first file as the predecessor file, and obtain the new second association relationship in the new first target association relationship set , The new second association relationship is: the first association relationship in which a subsequent file in the new first target association set is accessed most often;

If there is a new third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set, the new target association with the highest number of subsequent file occurrences is determined from the new third association relationship Relationship, determining the file in the new target association relationship as an association file; and, deleting the new target association relationship to obtain the new first association set.

Taking FIG. 5 as an example, after obtaining the associated files file7 and file5 of the first file file1, file5 (in this case, file5 is a successor file) can also be used as a precursor file to find whether file5 is used in the first association set 250. The first association of the predecessor file. If it does not exist, then file7 and file5 are finally used as the associated files of the first file file1; if they exist, follow the steps S231 to S234 described above to continue to obtain the associated files.

In this example, in the first association relationship set 250, there is no first association relationship using file5 as a predecessor file, so the associated files of the first file file1 include file file7 and file file5.

When there is no target association relationship in the first association relationship set, so that the process of determining the association file of the first file ends, a new first file can be obtained again, and the association file of the new first file is obtained according to steps S231 to S235 Until the first association set is empty.

The above is only an exemplary description of the process from step S231 to step S235, and is not intended to be exhaustive or to limit the present disclosure.

It should be noted that when acquiring the associated files according to the above steps, the determined target association relationship may be deleted in sequence in the first association relationship set until the first association relationship set is empty, and the determination of all associated files of the first file is completed .

Embodiments provided by the present disclosure can use the first association relationship in the first association set to obtain as many association files as possible associated with the first file, and after obtaining the association files of the first file, associate the first file with the association The files are merged to obtain a merged file, and the merged file obtained after the merger can meet the storage requirements of HDFS to the greatest extent possible.

In a possible implementation manner, the method may further include:

Send the first merged file to HDFS, and receive the first storage block identifier returned by HDFS that stores the first merged file.

Create first index information including the mapping relationship between the first file identifier and the first merged file identifier, and create second index information including the mapping relationship between the first merged file identifier and the first storage block identifier.

In a possible implementation manner, the first merged file may be stored in a pre-established merged file space in HDFS, and the merged file space may be an integer multiple of the "block" size in HDFS, for example, when a "block" size When it is 64MB, the size of the preset merged file space can be set to 64MB, 128MB, 256MB, or 512MB.

In a possible implementation manner, after the first index information and the second index information are created, the first index information and the second index information may be stored in a local storage system to facilitate subsequent retrieval.

By merging related files (files with relatively small data volume) into merged files (files with relatively large data volume), storing the merged files in HDFS can save HDFS storage resources .

In a possible application scenario, after the user obtains the target file in HDFS through the client, other files may also be obtained. If the number of acquired files is large, the HDFS-based file access mechanism will inevitably consume a large amount of HDFS NameNode node memory. The number of interactions between the client and the NameNode node is the same as the number of files to be acquired. At this time, HDFS performance Will be reduced, the efficiency of file access is low.

Based on this, when the server requests to obtain the target file required by the user, it also requests to obtain at least one associated file associated with the target file, and sends the acquired target file and associated file to the cache. The next time the file read request sent by the terminal is received, the server can match the file in the cache with the target file ID in the file read request. Since the file in the cache is related to access, it is likely to match This file reads the requested target file. This not only improves the file reading speed and hit rate, but also reduces the memory usage of the NameNode node, reduces the number of interactions between the client and the NameNode node, and improves the performance of the system.

Through the above method, multiple associated files can be merged into a merged file to conform to the mechanism of HDFS storage and merged files, thereby improving the storage efficiency of files. After multiple files are merged into merged file storage, HDFS memory and other resources The use of is also reduced, improving the performance of the system.

Please refer to FIG. 6, which shows a block diagram of a file reading device according to an embodiment of the present disclosure.

As shown in FIG. 6, the device includes:

The receiving module 10 is configured to receive a file reading request, where the file reading request includes an identification of the target file to be read;

The first searching module 20 is connected to the receiving module 10, and is used for searching and searching for the mapping relationship between the sub-file identifier and the merged file identifier included in the first index information stored locally according to the identifier of the target file. The target sub-file identifier matching the target file identifier and the corresponding target merged file identifier; wherein, the merged file is stored in HDFS, and the sub-files in the merged file are associated;

The second search module 30 is connected to the first search module 20, and is used for mapping the merged file ID included in the second index information stored locally to the HDFS storage block ID according to the target merged file ID In the search for the target storage block identifier corresponding to the target merged file identifier;

The sending module 40 is connected to the second searching module 30, and is configured to determine the number of sub-files to be acquired associated with the target file according to a preset acquiring condition, and send a file acquiring request to the HDFS, the file acquiring The request includes the target storage block ID, target subfile ID, target merge file ID, and the number of subfiles, so that the HDFS searches for the target in the target storage block corresponding to the target storage block ID The merged file identifier corresponds to the target merged file, and searches the target merged file for the target file and related files whose number is the number of the sub-files;

The cache module 50 is connected to the sending module 40 and is used to receive and cache the target file and associated file returned by the HDFS.

It should be understood that the file reading device is a device item corresponding to the foregoing file reading method. For a specific introduction, please refer to the previous description of the method, and no more details are provided here.

The device described in the present disclosure obtains the file to be obtained and other files related to the file to be obtained, and stores these files in the cache. When the file read request sent by the terminal next time is received, these are stored in the cache The files in can be retrieved first to reduce the interaction with HDFS, thereby reducing the resource usage of HDFS and improving the efficiency of HDFS in processing a large number of files.

Please refer to FIG. 7, which shows a block diagram of a file reading device according to an embodiment of the present disclosure.

As shown in FIG. 7, the device further includes:

The first obtaining module 61 is configured to obtain historical access logs of multiple files, where the historical access logs include the accessed time and the number of accessed times of multiple files;

The first determining module 62 is connected to the first obtaining module 61, and is used for each file of the plurality of files according to the access time and the number of accesses of the plurality of files. In files other than the file, determine at least one file associated with the file after accessing the file, and determine a plurality of first association relationships of the file, where the first association relationship is used to indicate The file is associated with access to any file in at least one file;

The second determination module 63 is connected to the first determination module 62, and is configured to obtain the first file with the largest number of first association relationships according to the first association relationship of each file in the plurality of files, and according to the first Multiple first association relationships of a file, and determining, among the multiple files, at least one associated file that is sequentially accessed after the first file is accessed;

The storage module 64 is connected to the second determination module 63 and is used to store the first file and at least one associated file in the first merged file.

The second acquisition module 71 is connected to the storage module 64, and is used to delete the first association relationship applied to determine at least one associated file in the first association relationship of each file in the plurality of files to obtain the remaining The first association relationship; according to the remaining first association relationship, obtain the new first file with the largest number of first association relationships;

A third determination module 72, connected to the second acquisition module 71, is used to trigger the second determination module to repeatedly perform multiple first association determinations based on the new first file among the multiple files At least one associated file that is sequentially accessed after the new first file is accessed, storing the new first file and at least one associated file that is sequentially accessed after the new first file is accessed in the new first merged file Process until the second acquisition module cannot acquire the remaining first association relationship.

A sending and receiving module 81, connected to the storage module 64, for sending the first merged file to the HDFS, and receiving the first storage block identifier returned by the HDFS that stores the first merged file;

An index creation module 82, connected to the sending and receiving module 81, is used to create first index information including the mapping relationship between the first file identifier and the first merged file identifier, and includes the first merged file identifier and the first storage The second index information of the mapping relationship of the block identification.

The reading module 90, connected to the cache module 50, may include the file associated with the target file if the next file read request received includes the file associated with the target file, if the file associated with the target file is stored in the cache , The file associated with the target file is read from the cache.

Please refer to FIG. 8, which illustrates a schematic diagram of a second determination module according to an embodiment of the present disclosure.

In a possible implementation manner, one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is accessed after accessing the predecessor file file.

As shown in FIG. 8, the second determining module 63 includes:

A first association relationship acquisition sub-module 631, configured to obtain a first association relationship set including the first association relationship of each file in the plurality of files;

A second association relationship acquisition sub-module 632, connected to the first association relationship acquisition sub-module 631, is used to obtain the first target association with the first file as the predecessor file in the first association relationship set that has the most occurrences A relationship set, and in the first target association set, a second association is obtained, where the second association is: the first association in the first target association set where the subsequent files are accessed the most;

The first association file determination submodule 633 is connected to the second association relationship acquisition submodule 632, and is used for a third association relationship where the predecessor file and the successor file of the second association relationship are the same in the first association relationship set Determine the target association relationship with the highest number of subsequent files from the third association relationship, and determine the file in the target association relationship as the association file;

The second association file determination sub-module 634 is connected to the second association relationship acquisition sub-module 632, and is used for a third association if the predecessor file and the subsequent file of the second association relationship are not the same in the first association relationship set During the relationship, the subsequent file of the second related relationship is determined as the related file.

A deletion submodule 635, configured to delete the target association relationship in the first association set to obtain a new first association set;

Repeat determination submodule 636, connected to delete submodule 635, for repeatedly triggering the second association relationship acquisition submodule and the first association file determination submodule to perform the following operations until the second association file determination submodule determines that the new There is no third association relationship in the first association relationship set where the predecessor file and the successor file of the new second association relationship are the same: in the new first association relationship set, the new file with the first file as the predecessor file is obtained most frequently A first target association relationship set, and in the new first target association relationship set, a new second association relationship is obtained, where the new second association relationship is: the most subsequent files in the new first target association relationship set are accessed the most The first association relationship;

If there is a new target association relationship in which the precursor file and the successor file of the new second association relationship are the same in the new first association relationship set, the file in the new target association relationship is determined as the association file; and, the new Target association relationship, to obtain the new first association relationship set.

Please refer to FIG. 9, which illustrates a schematic diagram of a first determination module according to an embodiment of the present disclosure.

As shown in FIG. 9, the first determining module 62 includes:

The first probability obtaining submodule 621 is configured to obtain the third file after the second file is accessed according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed The first probability of access, wherein the second file and the third file are any two different files in the plurality of files;

The second probability obtaining sub-module 622 is used to obtain the second file and the total number of times all files in the historical access log are accessed according to the number of times the third file is accessed and the total number of times all files in the historical access log are accessed A second probability that all the third files are accessed;

The influence value obtaining sub-module 623 is used for according to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, and the second file being accessed The number of times and the number of times the third file is accessed, to obtain the influence value of the second file being accessed on the third file being accessed;

The first determination submodule 624 is connected to the first probability acquisition submodule 621, the second probability acquisition submodule 622, and the influence value acquisition submodule 623, and is used when the first probability is greater than the first probability threshold, When the second probability is greater than the second probability threshold and the influence value is greater than the influence threshold, it is determined that the second file and the third file have the first association relationship.

Based on the same technical concept, an embodiment of the present disclosure also provides a server 900. As shown in FIG. 10, the server 900 includes a processor 901, a machine-readable storage medium 902, and a transceiver 903, and the machine-readable storage medium stores The machine-executable instructions executed by the processor 901 and the transceiver 903, and the processor 901, the transceiver 903, and the machine-readable storage medium 902 can communicate via the system bus 904.

The machine executable instruction causes the transceiver 903 to receive a file request and send the file reading request to the processor 901, where the file reading request includes the identification of the target file to be read;

The machine-executable instructions cause the processor 901 to:

Receiving the file reading request;

According to the identification of the target file, in the mapping relationship between the sub-file identification included in the first index information stored locally and the merged file identification, find the target sub-file identification and the corresponding target merged file that match the target file identification Identification; wherein, the merged file is stored in HDFS, and the sub-files in the merged file are associated;

According to the target merged file identifier, in the mapping relationship between the merged file identifier included in the locally stored second index information and the HDFS storage block identifier, search for the target storage block identifier corresponding to the target merged file identifier;

According to the preset acquisition condition, determine the number of sub-files to be acquired associated with the target file;

The machine-executable instructions also cause the transceiver 903 to:

Send a file acquisition request to the HDFS, where the file acquisition request includes the target storage block identifier, the target subfile identifier, the target merged file identifier, and the number of subfiles, so that the HDFS Searching for a target merge file corresponding to the target merge file ID in the target storage block corresponding to the target storage block ID, and searching the target merge file for the target file and associated files whose number is the number of the sub-files ;

Receiving and storing the target file and associated file returned by the HDFS, and sending the target file and associated file returned by the HDFS to the processor 901;

The machine executable instructions also cause the processor 901 to receive and cache the target file and associated file returned by the HDFS sent by the transceiver 903.

Optionally, the machine executable instructions cause the processor 901 to:

Obtain historical access logs of multiple files, where the historical access logs include the access time and the number of accesses of multiple files;

For each file in the plurality of files, according to the access time and the number of accesses of the plurality of files, among the files other than the file in the plurality of files, it is determined that after accessing the file Having at least one file associated with access to the file, and determining a plurality of first association relationships of the file, wherein the first association relationship is used to indicate that the file is associated with any file in at least one file;

According to the first association relationship of each file in the plurality of files, obtain the first file with the largest number of first association relationships, and determine among the plurality of files according to the plurality of first association relationships of the first file At least one associated file that is sequentially accessed after the first file is accessed;

Storing the first file and at least one associated file in the first merged file.

Optionally, the machine executable instructions cause the processor 901 to:

In the first association relationship of each file in the plurality of files, delete the first association relationship applied when determining at least one associated file to obtain the remaining first association relationship; according to the remaining first association relationship, obtain the first The new first document with the largest number of associations;

In the plurality of files, repeatedly execute at least one associated file that is sequentially accessed after the new first file is accessed according to the plurality of first association relationships of the new first file, and replace the new first file The process of storing files and at least one associated file that is accessed sequentially after the new first file is accessed in the new first merged file until the remaining first associated relationship cannot be obtained.

Optionally, one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is a file that is accessed after accessing the predecessor file;

The machine-executable instructions cause the processor 901 to:

Acquiring a first association set containing the first association of each file in the plurality of files;

In the first association relationship set, a first target association relationship set that uses the first file as a predecessor file to appear most often is obtained, and in the first target association relationship set, a second association relationship is obtained. The second association relationship is: the first association relationship in which the subsequent files in the first target association set are accessed the most;

If there is a third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, determine the target association relationship with the highest number of subsequent files from the third association relationship, and associate the target association relationship The files in are determined to be related files;

If there is no third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, the successor file of the second association relationship is determined as the association file.

Optionally, the machine executable instructions cause the processor 901 to:

Deleting the target association relationship in the first association relationship set to obtain a new first association relationship set;

Repeat the following operations until there is no third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set:

If there is a new third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set, a new target with the largest number of subsequent file occurrences is determined from the new third association relationship The association relationship determines the file in the new target association relationship as the association file; and deletes the new target association relationship to obtain the new first association set.

Optionally, the machine executable instructions cause the processor 901 to:

The multiple first associations of the file are determined in the following ways:

Obtaining the first probability that the third file is accessed after the second file is accessed according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed, wherein, the The second file and the third file are any two different files in the plurality of files;

According to the number of times the third file is accessed after the second file is accessed and the total number of times all files in the historical access log are accessed, obtain the number of times the second file and the third file are accessed Second probability

According to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, the number of times the second file is accessed, and the third file being accessed The number of times to obtain the influence value of the second file being accessed on the third file being accessed;

When the first probability is greater than the first probability threshold, the second probability is greater than the second probability threshold, and the influence value is greater than the influence threshold, it is determined that the second file and the third file have all Describe the first association.

Optionally, the machine executable instruction further causes the transceiver 903 to send the first merged file to the HDFS and receive the first storage block returned by the HDFS that stores the first merged file Identifier, and send the first storage block identifier of the first merged file to the processor 901;

The machine-executable instructions further cause the processor 901 to receive the first storage block identifier of the first merged file, and create first index information including the mapping relationship between the first file identifier and the first merged file identifier And second index information including the mapping relationship between the first merged file identifier and the first storage block identifier.

Optionally, the machine executable instructions cause the processor 901 to:

When the next file read request received includes the file associated with the target file, if the file associated with the target file is stored in the cache, the read and The file associated with the target file.

The machine-readable storage medium 902 mentioned herein may be any electronic, magnetic, optical, or other physical storage system, and may contain or store information, such as executable instructions, data, and so on. For example, the machine-readable storage medium may be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drive (such as a hard disk drive), solid-state drive, any type of storage disk (Such as optical discs, DVDs, etc.), or similar storage media, or a combination thereof.

Based on the same technical concept, the embodiments of the present disclosure also provide a machine-readable storage medium that stores machine-executable instructions. When invoked and executed by the processor, the machine-executable instructions cause the processor to implement the foregoing FIGS. 1-5 Any of the file reading method steps shown.

Based on the same technical concept, the embodiments of the present disclosure also provide a machine-executable instruction. When called and executed by the processor, the machine-executable instruction prompts the processor to read any of the files shown in FIGS. 1-5. Method steps.

The embodiments of the present disclosure have been described above. The above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the illustrated embodiments. The selection of terms used herein is intended to best explain the principles, practical applications or technical improvements of the technologies in the embodiments, or to enable other persons of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

A file reading method, the method includes:

Receiving a file reading request, where the file reading request includes the identifier of the target file to be read;

According to the identification of the target file, in the mapping relationship between the sub-file identification included in the first index information stored locally and the merged file identification, find the target sub-file identification and the corresponding target merged file that match the target file identification Identification; wherein, the merged files are stored in the Hadoop distributed file system HDFS, and the sub-files in the merged files are associated;

According to the target merged file identifier, in the mapping relationship between the merged file identifier included in the locally stored second index information and the HDFS storage block identifier, search for the target storage block identifier corresponding to the target merged file identifier;

According to a preset acquisition condition, determine the number of sub-files to be acquired associated with the target file, and send a file acquisition request to the HDFS, the file acquisition request includes the target storage block identifier and the target sub-file identifier , The target merged file identifier and the number of sub-files, so that the HDFS searches for the target merged file corresponding to the target merged file identifier in the target storage block corresponding to the target storage block identifier, and Searching for the target file and related files whose number is the number of the sub-files in the target merge file;

Receiving and caching the target files and associated files returned by the HDFS.
The method according to claim 1, further comprising:

Obtain historical access logs of multiple files, where the historical access logs include the access time and the number of accesses of multiple files;

For each file in the plurality of files, according to the access time and the number of accesses of the plurality of files, among the files other than the file in the plurality of files, it is determined that after accessing the file Having at least one file associated with access to the file, and determining a plurality of first association relationships of the file, wherein the first association relationship is used to indicate that the file is associated with any file in at least one file;

According to the first association relationship of each file in the plurality of files, obtain the first file with the largest number of first association relationships, and determine among the plurality of files according to the plurality of first association relationships of the first file At least one associated file that is sequentially accessed after the first file is accessed;

Storing the first file and at least one associated file in the first merged file.
The method according to claim 2, further comprising:

In the first association relationship of each file in the plurality of files, delete the first association relationship applied when determining at least one associated file to obtain the remaining first association relationship; according to the remaining first association relationship, obtain the first The new first document with the largest number of associations;

In the plurality of files, repeatedly execute at least one associated file that is sequentially accessed after the new first file is accessed according to the plurality of first association relationships of the new first file, and replace the new first file The process of storing the file and at least one associated file that is accessed in sequence after the new first file is accessed in the new first merged file until the remaining first associated relationship cannot be obtained.
The method according to claim 2, one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is accessed after accessing the predecessor file File; the first file with the largest number of first relationship is obtained according to the first relationship of each file in the plurality of files, and according to the multiple first relationship of the first file, the The at least one associated file that is sequentially accessed after the first file is accessed among the files includes:

Acquiring a first association set containing the first association of each file in the plurality of files;

In the first association relationship set, a first target association relationship set that uses the first file as a predecessor file to appear most frequently is obtained, and in the first target association relationship set, a second association relationship is obtained. The second association relationship is: the first association relationship in which the subsequent files in the first target association set are accessed the most;

If there is a third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, then from the third association relationship, the target association relationship with the highest number of subsequent file occurrences is determined, and the target association The files in the relationship are determined to be related files;

If there is no third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, the successor file of the second association relationship is determined as the association file.
The method according to claim 4, after determining that the file in the target association relationship is an associated file, further comprising:

Deleting the target association relationship in the first association relationship set to obtain a new first association relationship set;

Repeat the following operations until there is no third association relationship in the new first association relationship set where the predecessor file and the successor file of the new second association relationship are the same:

In the new first association relationship set, a new first target association relationship set with the first file as the predecessor file is most frequently obtained, and in the new first target association relationship set, a new second association relationship is obtained , The new second association relationship is: the first association relationship in which a subsequent file in the new first target association set is accessed most often;

If there is a new third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set, a new target with the largest number of subsequent file occurrences is determined from the new third association relationship The association relationship determines the file in the new target association relationship as the association file; and deletes the new target association relationship to obtain the new first association set.
According to the method of claim 2, a plurality of first association relationships of files are determined by:

Obtaining the first probability that the third file is accessed after the second file is accessed according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed, wherein, the The second file and the third file are any two different files in the plurality of files;

According to the number of times the third file is accessed after the second file is accessed and the total number of times all files in the historical access log are accessed, obtain the number of times the second file and the third file are accessed Second probability

According to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, the number of times the second file is accessed, and the third file being accessed The number of times to obtain the influence value of the second file being accessed on the third file being accessed;

When the first probability is greater than the first probability threshold, the second probability is greater than the second probability threshold, and the influence value is greater than the influence threshold, it is determined that the second file and the third file have all Describe the first association.
The method according to claim 2, further comprising:

Sending the first merged file to the HDFS, and receiving the first storage block identifier returned by the HDFS that stores the first merged file;

Create first index information that includes the mapping relationship between the first file identifier and the first merged file identifier, and second index information that includes the mapping relationship between the first merged file identifier and the first storage block identifier.
The method according to claim 1, further comprising:

When the next file read request received includes the file associated with the target file, if the file associated with the target file is stored in the cache, the read and The file associated with the target file.
A file reading device, the device includes:

A receiving module, configured to receive a file reading request, where the file reading request includes the identification of the target file to be read;

A first search module, connected to the receiving module, is configured to search for the target file in the mapping relationship between the sub-file identification and the merged file identification included in the first index information stored locally according to the identification of the target file The matching target sub-file identification and the corresponding target merged file identification; where the merged file is stored in the Hadoop distributed file system HDFS, and the sub-files in the merged file are associated;

A second search module, connected to the first search module, for mapping the merged file ID included in the locally stored second index information and the HDFS storage block ID according to the target merged file ID, Find the target storage block identifier corresponding to the target merged file identifier;

A sending module, connected to the second search module, for determining the number of sub-files to be acquired associated with the target file according to a preset acquisition condition, and sending a file acquisition request to the HDFS, in the file acquisition request Including the target storage block identification, the target sub-file identification, the target merged file identification, and the number of sub-files, so that the HDFS can search and locate the target storage block corresponding to the target storage block identification Identifying the target merged file corresponding to the target merged file, and searching the target merged file for the target file and related files whose number is the number of the sub-files;

A cache module, connected to the sending module, is used to receive and cache the target file and associated file returned by the HDFS.
The device according to claim 9, further comprising:

The first obtaining module is used to obtain historical access logs of multiple files, and the historical access logs include the access time and the number of access times of multiple files;

A first determination module, connected to the first acquisition module, for each file of the plurality of files, according to the access time and the number of accesses of the plurality of files, in the plurality of files In other files than the file, determine at least one file that has access association with the file after accessing the file, and determine a plurality of first association relationships of the file, where the first association relationship is used to represent the file Associated with access to any file in at least one file;

A second determination module, connected to the first determination module, for acquiring the first file with the largest number of first association relationships according to the first association relationship of each file in the plurality of files, and based on the first file Multiple first association relationships of the multiple files, determining at least one associated file that is sequentially accessed after the first file is accessed among the multiple files;

A storage module, connected to the second determination module, is used to store the first file and at least one associated file in a first merged file.
The device according to claim 10, further comprising:

A second acquisition module, connected to the storage module, for deleting the first association relationship applied when determining at least one associated file in the first association relationship of each file in the plurality of files, to obtain the remaining first Association relationship; according to the remaining first association relationship, obtain the new first file with the largest number of first association relationships;

A third determination module, connected to the second acquisition module, is used to trigger the second determination module to repeatedly perform determination of multiple first association relationships based on the new first file in all of the multiple files. At least one associated file that is sequentially accessed after the new first file is accessed, storing the new first file and at least one associated file that is sequentially accessed after the new first file is accessed in the new first merged file Process until the second acquisition module cannot acquire the remaining first association relationship.
The apparatus according to claim 10, one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is accessed after accessing the predecessor file File; the second determining module, including:

A first association relationship obtaining submodule, configured to obtain a first association relationship set containing the first association relationship of each file in the plurality of files;

A second association relationship acquisition sub-module, connected to the first association relationship acquisition sub-module, is used to obtain the first target association relationship set with the first file as the predecessor file most frequently in the first association relationship set And, in the first target association set, obtain a second association, where the second association is: the first association in the first target association set where the subsequent files are accessed the most;

A first associated file determination submodule, connected to the second associated relationship acquisition submodule, and configured to: if there is a third associated relationship in the first associated relationship set where the predecessor file and the subsequent file of the second associated relationship are the same, then Determine the target association relationship with the highest number of subsequent files from the third association relationship, and determine the file in the target association relationship as the association file;

A second associated file determination sub-module, connected to the second associated relationship acquisition sub-module, and used for a third associated relationship where the predecessor file and the subsequent file of the second associated relationship are not the same in the first associated relationship set, Then, the subsequent file of the second association relationship is determined as the associated file.
The apparatus of claim 12, the second determination module, further comprising:

A deletion submodule, configured to delete the target association relationship in the first association set to obtain a new first association set;

A repeat determination submodule, connected to the deletion submodule, for repeatedly triggering the second association relationship acquisition submodule and the first association file determination submodule to perform the following operations until the second association file determination submodule determines that the new There is no third association relationship in the first association relationship set where the predecessor file and the successor file of the new second association relationship are the same:

In the new first association relationship set, obtain a new first target association relationship set with the first file as the predecessor file, and obtain the new second association relationship in the new first target association relationship set , The new second association relationship is: the first association relationship in which a subsequent file in the new first target association set is accessed most often;

If there is a new third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set, a new target with the largest number of subsequent file occurrences is determined from the new third association relationship The association relationship determines the file in the new target association relationship as the association file; and deletes the new target association relationship to obtain the new first association set.
A server includes a processor, a machine-readable storage medium, and a transceiver. The machine-readable storage medium stores machine-executable instructions executable by the processor and the transceiver; the machine-executable instructions Causing the transceiver to receive the file reading request and send the file reading request to the processor, where the file reading request includes the identification of the target file to be read;

The machine-executable instructions cause the processor to:

Receiving the file reading request;

According to the identification of the target file, in the mapping relationship between the sub-file identification included in the first index information stored locally and the merged file identification, find the target sub-file identification and the corresponding target merged file that match the target file identification Identification; wherein, the merged files are stored in the Hadoop distributed file system HDFS, and the sub-files in the merged files are associated;

According to the target merged file identifier, in the mapping relationship between the merged file identifier included in the locally stored second index information and the HDFS storage block identifier, search for the target storage block identifier corresponding to the target merged file identifier;

According to the preset acquisition condition, determine the number of sub-files to be acquired associated with the target file;

The machine-executable instructions also cause the transceiver to:

Send a file acquisition request to the HDFS, where the file acquisition request includes the target storage block identifier, the target subfile identifier, the target merged file identifier, and the number of subfiles, so that the HDFS Searching for a target merge file corresponding to the target merge file ID in the target storage block corresponding to the target storage block ID, and searching the target merge file for the target file and associated files whose number is the number of the sub-files ;

Receiving the target file and associated file returned by the HDFS, and sending the target file and associated file returned by the HDFS to the processor;

The machine-executable instructions also cause the processor to receive and cache the target file and associated file returned by the HDFS sent by the transceiver.
The server of claim 14, the machine-executable instructions cause the processor to:

Obtain historical access logs of multiple files, where the historical access logs include the access time and the number of accesses of multiple files;

For each file in the plurality of files, according to the access time and the number of accesses of the plurality of files, among the files other than the file in the plurality of files, it is determined that after accessing the file Having at least one file associated with access to the file, and determining a plurality of first association relationships of the file, wherein the first association relationship is used to indicate that the file is associated with any file in at least one file;

According to the first association relationship of each file in the plurality of files, obtain the first file with the largest number of first association relationships, and determine among the plurality of files according to the plurality of first association relationships of the first file At least one associated file that is sequentially accessed after the first file is accessed;

Storing the first file and at least one associated file in the first merged file.
The server of claim 15, the machine-executable instructions cause the processor to:

In the first association relationship of each file in the plurality of files, delete the first association relationship applied when determining at least one associated file to obtain the remaining first association relationship; according to the remaining first association relationship, obtain the first The new first document with the largest number of associations;

In the plurality of files, repeatedly execute at least one associated file that is sequentially accessed after the new first file is accessed according to the plurality of first association relationships of the new first file, and replace the new first file The process of storing files and at least one associated file that is accessed sequentially after the new first file is accessed in the new first merged file until the remaining first associated relationship cannot be obtained.
The server according to claim 15, wherein one of the two associated files recorded in the first association relationship is a predecessor file, the other is a successor file, and the successor file is accessed after accessing the predecessor file Files; the machine executable instructions cause the processor to:

Acquiring a first association set containing the first association of each file in the plurality of files;

In the first association relationship set, a first target association relationship set that uses the first file as a predecessor file to appear most often is obtained, and in the first target association relationship set, a second association relationship is obtained. The second association relationship is: the first association relationship in which the subsequent files in the first target association set are accessed the most;

If there is a third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, then from the third association relationship, the target association relationship with the highest number of subsequent file occurrences is determined, and the target association The files in the relationship are determined to be related files;

If there is no third association relationship in which the predecessor file and the successor file of the second association relationship are the same in the first association relationship set, the successor file of the second association relationship is determined as the association file.
The server of claim 17, the machine-executable instructions cause the processor to:

Deleting the target association relationship in the first association relationship set to obtain a new first association relationship set;

Repeat the following operations until there is no third association relationship in the new first association relationship set where the predecessor file and the successor file of the new second association relationship are the same:

In the new first association relationship set, obtain a new first target association relationship set with the first file as the predecessor file, and obtain the new second association relationship in the new first target association relationship set , The new second association relationship is: the first association relationship in which a subsequent file in the new first target association set is accessed most often;

If there is a new third association relationship in which the predecessor file and the successor file of the new second association relationship are the same in the new first association relationship set, a new target with the largest number of subsequent file occurrences is determined from the new third association relationship The association relationship determines the file in the new target association relationship as the association file; and deletes the new target association relationship to obtain the new first association set.
The server of claim 15, the machine-executable instructions cause the processor to:

The multiple first associations of the file are determined in the following ways:

Obtaining the first probability that the third file is accessed after the second file is accessed according to the number of times the second file is accessed and the number of times the third file is accessed after the second file is accessed, wherein, the The second file and the third file are any two different files in the plurality of files;

According to the number of times the third file is accessed after the second file is accessed and the total number of times all files in the historical access log are accessed, obtain the number of times the second file and the third file are accessed Second probability

According to the total number of times all files in the historical access log are accessed, the number of times the third file is accessed after the second file is accessed, the number of times the second file is accessed, and the third file being accessed The number of times to obtain the influence value of the second file being accessed on the third file being accessed;

When the first probability is greater than the first probability threshold, the second probability is greater than the second probability threshold, and the influence value is greater than the influence threshold, it is determined that the second file and the third file have all Describe the first association.
The server according to claim 15,

The machine-executable instructions also cause the transceiver to send the first merged file to the HDFS, and receive the first storage block identifier returned by the HDFS that stores the first merged file, and send to the The processor sends the first storage block identifier of the first merged file;

The machine-executable instructions further cause the processor to: receive the first storage block identifier of the first merged file, and create first index information including the mapping relationship between the first file identifier and the first merged file identifier, And second index information including the mapping relationship between the first merged file identifier and the first storage block identifier.
The server of claim 14, the machine-executable instructions cause the processor to:

When the next file read request received includes the file associated with the target file, if the file associated with the target file is stored in the cache, the read and The file associated with the target file.
A machine-readable storage medium storing machine-executable instructions, which when called and executed by a processor, causes the processor to perform the method of any one of claims 1-8.