CN117667853B - Data reading method, device, computer equipment and storage medium - Google Patents

Data reading method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN117667853B
CN117667853B CN202410129220.8A CN202410129220A CN117667853B CN 117667853 B CN117667853 B CN 117667853B CN 202410129220 A CN202410129220 A CN 202410129220A CN 117667853 B CN117667853 B CN 117667853B
Authority
CN
China
Prior art keywords
file
block
files
target
reading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410129220.8A
Other languages
Chinese (zh)
Other versions
CN117667853A (en
Inventor
王继玉
陈培
荆荣讯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202410129220.8A priority Critical patent/CN117667853B/en
Publication of CN117667853A publication Critical patent/CN117667853A/en
Application granted granted Critical
Publication of CN117667853B publication Critical patent/CN117667853B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data processing, and discloses a data reading method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring block identifiers of a plurality of file blocks, and determining a file identifier corresponding to each file in the file blocks; the file blocks are data blocks generated by aggregating files in a file data set; grouping the block identifiers of the plurality of file blocks to form a plurality of block identifier groups; inter-group disorder is carried out on the plurality of block identification groups, and intra-group disorder is carried out on file identifications in at least part of the block identification groups, so that a file identification sequence after disorder is formed; and reading the files corresponding to the file identifications contained in the current batch in batches according to the sequence of the file identifications in the file identification sequence. The invention can conveniently and rapidly realize the disorder of the whole file data set, can improve the disorder efficiency while guaranteeing the model training accuracy, and can ensure the reading efficiency.

Description

Data reading method, device, computer equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data reading method, a data reading device, a computer device, and a storage medium.
Background
AI (ARTIFICIAL INTELLIGENCE ) model training, which takes a long time in general, takes place after the dataset is localized in the early stage. With the development of big data and AI technology, training of AI models in complex scenes generally requires massive small files as a training set.
At present, after a large number of small files are localized, the reading mode efficiency is low, and the training efficiency is further affected.
Disclosure of Invention
In view of the above, the present invention provides a data reading method, apparatus, computer device and storage medium, so as to solve the problem of low reading efficiency.
In a first aspect, the present invention provides a data reading method, including:
Acquiring block identifiers of a plurality of file blocks, and determining a file identifier corresponding to each file in the file blocks; the file blocks are data blocks generated by aggregating files in a file data set;
Grouping the block identifiers of the plurality of file blocks to form a plurality of block identifier groups;
Inter-group disorder is carried out on the plurality of block identification groups, and intra-group disorder is carried out on file identifications in at least part of the block identification groups, so that a file identification sequence after disorder is formed;
And reading the files corresponding to the file identifications contained in the current batch in batches according to the sequence of the file identifications in the file identification sequence.
In a second aspect, the present invention provides a data reading apparatus comprising:
The acquisition module is used for acquiring block identifiers of a plurality of file blocks and determining file identifiers corresponding to each file in the file blocks; the file blocks are data blocks generated by aggregating files in a file data set;
The grouping module is used for grouping the block identifiers of the plurality of file blocks to form a plurality of block identifier groups;
the disorder module is used for carrying out group disorder on the plurality of block identification groups and carrying out group disorder on file identifications in at least part of the block identification groups to form a file identification sequence after disorder;
and the reading module is used for reading the files corresponding to the file identifications contained in the current batch in batches according to the sequence of the file identifications in the file identification sequence.
In a third aspect, the present invention provides a computer device comprising: the memory and the processor are in communication connection with each other, the memory stores computer instructions, and the processor executes the computer instructions to perform the data reading method according to the first aspect or any of the corresponding embodiments.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the data reading method of the first aspect or any of its corresponding embodiments.
In a fifth aspect, the present invention provides a computer program product comprising a computer program/instruction which when executed by a processor implements the data reading method of the first aspect or any of its corresponding embodiments.
The invention realizes grouping of file blocks required by a plurality of iterative processes based on the block identifiers to form a plurality of block identifier groups, and realizes disorder between groups and disorder of files in the groups by utilizing the block identifier groups, thereby conveniently and rapidly realizing disorder of the whole file data set, ensuring the training accuracy of a model, improving disorder efficiency and ensuring reading efficiency. Moreover, the file blocks are grouped, so that the files can hit the file blocks of the same group more easily when the files of the current batch are read, the hit proportion can be improved, and only the files in a small number of file blocks need to be processed in the current batch, so that the reading efficiency can be further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the related art, the drawings that are required to be used in the description of the embodiments or the related art will be briefly described, and it is apparent that the drawings in the description below are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is a schematic diagram of the framework of the main operational content of a massive doclet dataset according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a process for data aggregation by a distributed system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a process for aggregating a plurality of file blocks according to an embodiment of the present invention;
FIG. 4 is a flow chart of a data reading method according to an embodiment of the invention;
FIG. 5 is a schematic structural diagram of a training hierarchy in accordance with an embodiment of the present invention;
FIG. 6 is a workflow diagram of a training task in accordance with an embodiment of the present invention;
FIG. 7 is a schematic diagram of a training architecture according to an embodiment of the present invention;
FIG. 8 is a flow chart of another data reading method according to an embodiment of the invention;
FIG. 9 is a schematic diagram of a process for forming a global file list according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of a process for shuffling indexes according to an embodiment of the present invention;
FIG. 11 is a flowchart of a further data reading method according to an embodiment of the present invention;
FIG. 12 is a flow chart of batch read data according to an embodiment of the invention;
FIG. 13 is a schematic diagram of a process for file addressing based on file blocks according to an embodiment of the invention;
FIG. 14 is a timing diagram for addressing based on addressing information objects according to an embodiment of the present invention;
FIG. 15 is a graph of test results for SSD-based performance comparison, according to an embodiment of the present invention;
FIG. 16 is a graph of test results for performance comparison based on NVME in accordance with an embodiment of the present invention;
FIG. 17 is a graph of test results for performance comparison based on HDD according to an embodiment of the present invention;
fig. 18 is a block diagram of a data reading apparatus according to an embodiment of the present invention;
Fig. 19 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The data set used by the training task is basically in a remote storage system or cloud storage, and the data set must be localized when the task is actually running. And the remote data set is pulled to the local, so that a better scheme is needed to solve the problems of the pulling efficiency, the caching, the management and the like of the data set. In order to solve the problem of processing massive data sets, a plurality of distributed storage systems related to the data sets are developed at home and abroad to aggregate storage resources and store the massive data sets. In order to solve the training problem of massive data sets, some related components are also developed in the market, and the problem that massive data are read by training tasks is solved.
At present, in order to give consideration to the cloud primary environment, the problems of data access delay, large bandwidth overhead of remotely acquired data sets and the like caused by the separation of calculation and storage of the cloud primary environment are solved, and a data set processing framework which is promoted in the market has respective architecture characteristics and market positioning. In order to achieve different storage, different data types, files and different application scenes, a general data processing framework is developed at present, so that the processing performance of a large number of small file data sets (SMALL FILE DATASET) with tens of millions of levels and hundreds of millions of levels is not particularly good, and many problems need to be optimized. For example, alluxio (distributed very large-scale data orchestration platform), fluid, etc. which are commercially available are relatively large in volume, are general solutions, and are not particularly friendly to solutions for massive small file datasets. The partial data set processing tool mainly interfaces with the object storage system, or the bottom layer uses an HDFS (Hadoop Distributed FILE SYSTEM, a distributed file system) and the like, and the docking framework for processing massive small file data is basically not available.
In a large-scale computing cluster, many training tasks are running concurrently, and the data set used by each training task may be different. The training task sends out a large number of small file reading requests to the underlying storage system. This access is a challenge to the storage system, which is inefficient in handling small files, resulting in slower training speeds. The current systems are basically metadata (meta) intensive work loads.
The smaller the KB-level small file is used for training, the higher the time consumption ratio of metadata operation is; the smaller the I/O (input/output), the more delay and QPS (query-per-second) exceeds the throughput becomes a major contradiction. Thus, for small file out-of-order (shuffle) and read, ensuring latency and QPS is key to improving performance.
Specifically, the main operations of the training dataset include a read operation and a write operation. The write operation procedure is embodied in: pulling a remote data set, and writing the remote data set into a local storage (a memory, NVME, SSD, HDD, a locally built shared storage system, a distributed storage system or an object storage system and the like); the read operation process is embodied in: when the training task runs, the pulled data set is read according to batch (batch). However, if the remote storage data set is pre-fetched to the locally built storage system, metadata (meta) information of the data set needs to be recorded, and the management mode of the metadata affects the efficiency of acquiring the training file list during training and the out-of-order efficiency. Even after localization, the reading mode of massive small files also affects the reading efficiency of the training task, thereby affecting the training efficiency.
Aiming at the processing scheme of a large number of small file data sets, the processing scheme is always a pain problem in the market and academic scientific research, and an effective technical scheme and an open source architecture are not available, so that a series of problems faced by the large number of small file data sets are solved. Aiming at a massive small file data set of tens of thousands of catalogs and tens of millions of small files, the current processing method comprises the following steps:
(1) Using a shared distributed storage system, a data set catalog is mounted to each compute node, and the training task reads data directly from the shared storage. However, due to the high frequency of metadata (meta), storage (storage) interactive access, and high concurrency and continuous I/O reading, etc., tens of millions of small files require more than tens of millions of metadata and data traffic interactions, resulting in easy saturation of existing distributed file systems. The efficiency of the distributed storage system for processing small files seriously affects the training efficiency.
(2) Packaging a large number of small files in an optimized data format, while custom formats can reduce metadata workload, read workload is unchanged and limited bandwidth between the storage system and the compute nodes is a performance bottleneck. At the same time, for small files that are packaged, the respective packaging and interpretation rules need to be defined, a set of management systems needs to be maintained between the storage system and the training, and the read data is processed and maintained, increasing complexity.
(3) The hardware is purchased with high performance through burning expense, and the physical means is used for large-scale expansion, so that the problem of performance bottleneck is solved, but the cost of the method is higher.
At present, the massive data sets are basically a framework with separated metadata and storage data, and the framework mode has the advantages that high-density metadata interaction of massive small file data sets and high-frequency small file IO (input output) overhead are caused, and a file system designed based on the framework mode is basically not provided with performance in the face of massive small files.
The main operation of a massive small file data set is shown in fig. 1, and problems that may be involved include: read/write data problems, metadata management and intensive interactions, data consistency, data caching, data storage, storage I/O loads, and the like.
As shown in fig. 1, the write operation includes a write operation on a file block (chunk) and metadata (meta) thereof, where the file block may be stored in a corresponding storage system, and the file block in the storage system may be read and parsed during the read operation; the metadata may be in the form of key-value pairs (key-value) that are stored in a database, and metadata snapshots may also be stored. During the reading operation, the metadata list of the mass metadata can be read. And, the read operation mainly involves data cache management, shuffle, check data integrity, k8s scheduling (kubernetes, an open source system for automated deployment, extension and management of containerized applications), etc.
Problems that may be faced in AI training using massive doclet datasets include:
(1) Disk I/O bottleneck: because each file requires a separate I/O operation, there is a significant amount of disk I/O overhead to train reading a large number of small files, which can result in slow data loading and processing times, thereby affecting overall training performance.
(2) Metadata overhead: the massive small files are usually attached with additional metadata, such as file names, time stamps, and the like, so that besides consuming storage space, intensive metadata interaction is caused, and the processing performance of the metadata can influence the processing performance of the massive small files. This metadata overhead can be a challenge when handling massive small data sets, affecting file system performance.
(3) File system limitations: some file systems have a limit on the maximum number of files that can be stored in a directory or partition. These limitations can present challenges in file organization and storage management when handling millions or billions of small files.
(4) Data preprocessing complexity: preprocessing a massive data set consisting of small files may require a significant amount of computation and time. Data cleansing, enhancement, or feature extraction operations need to be effectively applied to each file, which may increase overall training time.
(5) Data loading and conversion: loading a large number of small files into memory can be memory intensive and may require careful memory management. Furthermore, converting data to a corresponding training format can be challenging due to variability in file format and structure.
(6) Parallelization and distributed computing: training a model using a massive small file dataset may require a distributed computing framework or parallel processing techniques to efficiently handle the computational load. Coordinating data distribution and synchronization across multiple nodes can be complex.
For massive data sets, massive small files are usually loaded into a memory or a distributed storage system for subsequent processing and training, and are a commonly used mode at present. The remote data set is loaded to a file system built locally by a preloading and preprocessing means, and can be loaded as required during training, but the technology of data pre-fetching, data streaming and the like is involved, and the efficiency and speed of data loading need to be optimized. For a large number of small files, the problems of low loading efficiency and poor loading performance exist.
The massive small file data sets used by the AI training task are generally classified and sorted, and are read-write separated scenes, and in order to improve efficiency, the scenes with frequent read access are basically pulled once. Furthermore, AI training tasks typically go through multiple iterations (epochs), each of which reads the entire data set, but each file is read only once per iteration. In order to solve at least part of related problems of massive small file data sets in an AI training task, a distributed system is constructed according to the characteristics of massive small file data sets pulled once and read many times, and in the data set reading stage of the AI training task, a training library is used for interacting with the distributed system, so that the accelerated reading of training data is realized, the training time is shortened, the training efficiency is improved, and the related problems of the reading efficiency and performance of the massive small file data sets are solved.
In this embodiment, a distributed system is used to store a file data set and corresponding metadata thereof; for example, the distributed system may be CacheFS distributed systems. The distributed system can pull and aggregate file data sets from remote data sources and aggregate and store the pulled file data sets so that the back stream can read the aggregated file data sets from the distributed system as training data.
Fig. 2 shows a schematic diagram of a process of data aggregation by the distributed system. As shown in fig. 2, the distributed system includes a server (e.g., CACHEFS SERVER) and a client (e.g., CACHEFS CLIENT) disposed at a computing node. In addition, the distributed system can aggregate the storage units of all the computing nodes, aggregate the storage units such as a plurality of disks of a plurality of computing nodes (the computing nodes form a computing cluster) into one storage space, and butt-joint the storage units into the distributed system, so that the problem that a single computing node cannot cache data in a remote data source due to overlarge massive file data sets can be prevented. The storage unit may be, for example, NVMe (non-volatile memory host controller interface specification), SSD (solid state disk), HDD (hard disk drive, i.e., a conventional mechanical hard disk), or the like.
As shown in fig. 2, a client of the distributed system may initiate a pull request to a remote data source, which responds to the pull request, so that a file data set may be issued to the distributed system. Specifically, the client may work in a multi-channel and multi-process manner, scan a file data set in a remote data source, maintain an original relative directory structure of the file data set, aggregate the file data set into file blocks (chunk) in a lossless manner according to an aggregation rule, and process the file block numbers to form an ID of each file block, i.e., a chunk ID.
The client can mount the designated path to each computing node by using the fuse, and perform operations such as reading, writing, modifying, deleting and the like on the file data set under the mounted path. Because the data in the aggregated storage of the client is an aggregated file block and is not a directory file hierarchy of the original file data set, the client can also have functions such as data view, so that the file data set hierarchy seen by the user is similar to the original directory file hierarchy. Meanwhile, in order to improve the reading efficiency during training, the file data set can be pre-cached in a storage unit (such as a memory, NVMe, SSD, HDD and the like) of the local computing node, so that the file data set has functions of local file data set caching, management and the like, and also has functions of metadata reporting and the like.
As shown in fig. 2, the file blocks generated by aggregation are written into an aggregation storage, namely, a storage space of the computing cluster aggregation; and reporting the metadata of the file blocks to the service node. After the storage is completed, the client can report the first-pass information of the file blocks which are successfully stored.
The service end of the service node mainly records metadata, synchronous metadata and the like of the aggregated file blocks; the metadata may specifically include: the ID of the file block, the list of file names aggregated by the file block, the relative directory structure, the total number of files aggregated by the file block, the file block size, the file block creation/modification timestamp, etc. The database of the service node may be redis or mysql, etc., and can record aggregated storage information, relevant information of each client, etc.
FIG. 3 illustrates a schematic diagram of a process for aggregating file data into a plurality of file blocks. As shown in fig. 3, the file data set contains a plurality of files; for example, the file data set is a data set of massive small files, wherein the files are small files. A plurality of files which can be aggregated can be extracted from the file data set to form file blocks with a certain size; as shown in fig. 3, file 1 of 4KB, file 2 of 128KB, and other files (not shown in fig. 3) may be aggregated into one file block, i.e., file block 1, for example, the file block 1 is 4MB in size; similarly, other file blocks 2 may also be formed in an aggregate manner.
The file blocks can be stored in a storage space formed by aggregation of the computing clusters, and all the file blocks are stored based on the mode of aggregation storage, and m file blocks are taken as a whole. As shown in fig. 3, file 1, file 2, …, file n, and the like are included in file block 1, thereby realizing storage of data such as small files in the form of file blocks (chunk).
And, the metadata of the file list contained in the file block can be recorded and uploaded to the service node while the file block is formed by aggregation.
Because the file data set of the massive small files is too large, the local storage unit of the computing node may not be capable of caching the data set for completing the iterative (epoch) training, so that the problem of pulling while deleting occurs; because the deletion and the pulling are time-consuming, the training task waits, so that the time consumption is increased, and the training efficiency is affected. In addition, the training process generally needs multiple iterations, even if the distributed mode is used, each computing node is trained, a complete file data set is basically required to be pulled, if the storage space is insufficient, the content deleted in the iterative training can be caused, and the next iteration also needs to be pulled again, so that the training efficiency is seriously affected.
The distributed system in the embodiment can be used for pre-pulling or task triggering and pulling the file data set of the remote massive small files into the storage space aggregated by the distributed system, and can be pulled once and read for multiple times. Each data file only needs to interact with the remote data source once, and each computing node only needs to interact with the locally mounted client in the process of repeated iterative training. After the task training is completed, the client can set up the file data set which is pulled along with the task, clean the file data set which is pulled along with the task, set up the timing task, clean the file data set which is not used for more than the preset time period at regular intervals, and the like, so that the data set management function is realized.
According to an embodiment of the present invention, there is provided a data reading method embodiment, it being noted that the steps shown in the flowcharts of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that herein.
In this embodiment, a data reading method is provided, which may be used for the above-mentioned computing node, and fig. 4 is a flowchart of the data reading method according to an embodiment of the present invention, as shown in fig. 4, and the flowchart includes the following steps.
Step S401, block identifiers of a plurality of file blocks are obtained, and file identifiers corresponding to each file in the file blocks are determined; the file block is a data block generated by aggregating files in a file data set.
In this embodiment, as described above, a file data set pulled from a remote data source may be aggregated into a plurality of file blocks (chunk), which may be stored in the storage space of the aggregated store; and, the metadata (meta) of the file block may be stored to the service node, for example, in a database of the service node. During the data set reading stage of the training task, the file blocks need to be read, and at this time, the block identifiers of the file blocks can be obtained. Typically, the training task is to read file blocks of the entire file data set.
Wherein the block identifier is an identifier uniquely representing a file block; for example, the block identification may specifically be an ID of the file block. The file identification uniquely represents the identification of the corresponding file; for example, the file identification may specifically be an ID of the file. Wherein all file blocks and their metadata for the file may be extracted from a data center for storing metadata, which may be, for example, a database of service nodes as shown in fig. 2, and corresponding block identifications and file identifications are determined based on the obtained metadata.
Training tasks (e.g., AI training tasks) are typically divided into multiple iterations (epochs), each of which is in turn divided into multiple batches, and the required files are read in batches. Wherein at the beginning of a training task, metadata of a file data set is first acquired, which includes an entire file list of the file data set, and tag information for training, such as classification mark information, etc. At each iteration training, in order to improve accuracy of model training, it is generally required to perform out-of-order (shuffle) on all files of the file data set, which results in lower efficiency when reading a huge number of files. For example, when small files are iteratively read, the existing training libraries, such as the official training libraries pytorch/tensorflow, can open each small file, and read file data, so that the time is long and the efficiency is low.
In the embodiment, a training library required by the user is built so as to accelerate the reading efficiency of massive small file data sets in AI training, reduce the interaction times with a storage system during small file reading, reduce frequent IO (input/output) expenditure and solve the problem of pain related to massive small file data sets in training. Fig. 5 is a schematic structural diagram of a training hierarchy provided in this embodiment. As shown in fig. 5, there is a training library indicating clients to read corresponding data from the distributed system to complete the training tasks required by the Application.
Specifically, when performing the training task, the training library may be used to interact with the data center storing metadata according to the related path information of the specified file data set, so as to extract the metadata of the file data set from the data center, and may specifically include the metadata of the file block and the metadata of the file therein. For example, the metadata of the file block may specifically include: an ID (chunk ID) of a file block, a file block name (chunk name), a corresponding inode (inode), a file block size, and the like; the metadata of the file includes a file name, a file ID, and the like.
From these metadata, a block identification of the file block and a file identification of the file may be generated. For example, the ID of the file block is taken as the corresponding block identifier, and the ID of the file is taken as the corresponding file identifier.
It will be appreciated that the metadata of the file blocks is stored separately from the file in which the file blocks are aggregated; as shown in fig. 2, the metadata of the file block is stored in the service node, and the file block itself is stored in the storage space of the aggregate storage. Therefore, in this embodiment, when reading the file block, the metadata and the corresponding file block need to be separately read, and a schematic workflow for completing the training task can be seen in fig. 6.
As shown in FIG. 6, the training reading may be triggered based on a training task, which may be determined by an application, such as pytorch/tensorflow training tasks, or the like. The corresponding metadata is then read from the service node by the training library and the corresponding file blocks are read based on the client.
Step S402, grouping block identifiers of a plurality of file blocks to form a plurality of block identifier groups.
In this embodiment, as described above, in order to improve accuracy of model training, it is generally necessary to shuffle (shuffle) all files of the file data set. Because the file data set contains a large number of files, the number of the files is at least over millions, and if the file data set is directly disordered, the problem of lower disordered efficiency exists. In this embodiment, metadata of a file data set is acquired first, and disorder is performed based on a block identifier and a file identifier determined by the metadata, so that the throughput of the disorder process can be reduced.
Wherein the disorder may be implemented based on a training library. Fig. 7 shows a schematic diagram of a training architecture in this embodiment, as shown in fig. 7, after the training library obtains metadata from the service node, the training library is disordered based on the metadata, so that, based on a disordered result, the client may be instructed to read a corresponding file block from the aggregate storage and cache the file block into a local cache, where the local cache may specifically be a local memory, or a local disk, for example, NVMe, SSD, and the like. Wherein the loading of the file blocks may be implemented by a data loader (DataLoader).
Specifically, when out-of-order operation is performed, a plurality of file blocks are first grouped; wherein, by grouping block identifications of a plurality of file blocks, a group of text of the file blocks can be represented, and each group of block identifications is referred to as a block identification group. It will be appreciated that the set of block identifications comprises a plurality of block identifications.
For example, if each four file blocks are grouped, the adjacent four block IDs may be grouped to form a block ID group including the four block IDs.
Step S403, inter-group disorder is performed on the plurality of block identification groups, and intra-group disorder is performed on file identifications in at least part of the block identification groups, so as to form a file identification sequence after disorder.
In this embodiment, in order to achieve disorder of the file data set, group disorder may be performed on a plurality of block identification groups, that is, disorder may be performed between the block identification groups by taking the block identification groups as a unit; for example, the file data set is divided into four block identification groups: the block identifier group 1, the block identifier group 2, the block identifier group 3 and the block identifier group 4 can perform inter-group disorder on four block identifier groups, and a result after the disorder can be, for example: block identification group 3, block identification group 1, block identification group 4, block identification group 2, etc.
And, for each block identification group, it corresponds to a plurality of files, so it contains file identifications of a plurality of files. In this embodiment, the file identifiers in the block identifier groups are further in-group disordered, where part or all of the block identifier groups may be in-group disordered. For example, if the block identifier group 1 includes file identifiers of three files, and respectively: the file identifiers 1-1, 1-2 and 1-3 may be separately disordered in the group, and the result after disordered may be, for example: file identification 1-2, file identification 1-3, file identification 1-1; intra-group shuffling may also occur in a similar manner for other block identification groups, which are not described in detail herein.
It will be appreciated that inter-group disorder may be performed first, then intra-group disorder may be performed first, or intra-group disorder may be performed first, then inter-group disorder may be performed, and the present invention is not limited thereto.
In this embodiment, the entire file data set includes a plurality of files, which correspond to a certain file list; and each file has a specific file identification, the file data set also has an original list of file identifications. After the intra-group disorder and inter-group disorder are performed, the disorder of the file identification list can be realized; for convenience of description, the unordered file identification list is referred to as a file identification sequence.
Because the file data set contains a large number of files, and correspondingly, the number of the corresponding file identifications is also large, and the embodiment adopts a mode of disorder in the group and disorder among groups, so that disorder of the whole file data set can be realized more simply.
For example, dividing all file blocks of a file data set into m groups to form m file block groups, wherein each file block group corresponds to one block identification group; and each file block group contains n files, i.e. each block identification group contains n file identifications. It will be appreciated that the file data set includes m x n files. In this embodiment, when the inter-group disorder is performed, m elements (i.e., m block identifier groups) are disordered; when the disorder in the group is carried out, the disorder is carried out on n elements (namely n file identifications), the cardinality of the disorder elements can be greatly reduced, and thus the disorder can be completed more simply and more quickly.
Step S404, reading the files corresponding to the file identifications contained in the current batch in batches according to the order of the file identifications in the file identification sequence.
In this embodiment, only the file identifier is included in the file identifier sequence, which does not include the file itself, but based on the file identifier, a corresponding file in the file data set may be determined, so that the file may be read. In particular, the corresponding file block may be read, a process which may be seen in fig. 6 and 7.
The files read in batches can be used for model training, such as training a recognition model, a classification model and the like, and the files can be specifically image data, text data and the like.
In one iteration process, files in the file data set need to be read in batches (batch) until the file data set is read; for example, a file of a batch size may be read at each batch based on a preset batch size (batch size). In the current batch, it can be determined which file identifications the current batch contains, and then the corresponding files are read based on the file identifications, so as to complete the reading of the files of the current batch.
And repeating the process of reading the files in the file data set in batches at the next iteration, and completing training through multiple iterations.
In this embodiment, since the disorder process is performed on the basis of grouping file blocks, in the file identifier sequence, a plurality of file identifiers belonging to the same block identifier group only belong to a small number of file blocks; furthermore, when the files are read in batches, the file identifier corresponding to the current batch only corresponds to a small number of file blocks, so that the current batch only needs to process the files in the small number of file blocks.
For example, if the file data set includes file block 1, file block 2, file block 3 and file block 4, and file block 1 and file block 2 are one file block group, file block 3 and file block 4 are one file block group. If the batch size in the training process is smaller than the number of files contained in the file block group, for a certain batch of file identifications, the large probability only corresponds to the files in one of the file block groups, or for a certain batch of file identifications, most of the file identifications correspond to the same file block group in large probability; for example, the file identifier of the current batch corresponds to the first file block group only, so that only the data in the file block 1 and the file block 2 need to be processed, and the data volume to be processed can be reduced; if the file identifier of the current batch also corresponds to the second file block group, the file block 3 and the file block 4 need to be processed, for example, the file block 3 and the file block 4 need to be queried.
It will be appreciated that the batch size may be smaller than the number of files in the file block group, i.e., the batch size may be smaller than the number of file identifications in the block identification group or larger than the number of file identifications in the block identification group, which is not limited in this embodiment. If the size of the batch is larger than the number of the file identifications in the block identification group, the number of the file blocks corresponding to the file identifications of the current batch is still relatively small although the file blocks belong to a plurality of file block groups, so that the file identifications of the current batch can still be ensured to hit only a small number of file blocks, and further the file reading based on the hit file blocks is convenient to realize.
According to the data reading method, the file blocks required by the iterative processes are grouped based on the block identifiers, a plurality of block identifier groups are formed, and the block identifier groups are utilized to achieve disorder among the groups and disorder of files in the groups, so that disorder of the whole file data set can be conveniently and rapidly achieved, disorder efficiency can be improved while model training accuracy is ensured, and reading efficiency can be ensured. Moreover, the file blocks are grouped, so that the files can hit the file blocks of the same group more easily when the files of the current batch are read, the hit proportion can be improved, and only the files in a small number of file blocks need to be processed in the current batch, so that the reading efficiency can be further improved.
In this embodiment, a data reading method is provided, which may be used for the above-mentioned computing node, and fig. 8 is a flowchart of the data reading method according to an embodiment of the present invention, as shown in fig. 8, and the flowchart includes the following steps.
Step S801, block identifiers of a plurality of file blocks are obtained, and file identifiers corresponding to each file in the file blocks are determined; a file block is a data block generated by aggregating files in a file data set.
Please refer to step S401 in the embodiment shown in fig. 4 in detail, which is not described herein.
Step S802, grouping block identifiers of a plurality of file blocks to form a plurality of block identifier groups.
Please refer to step S402 in the embodiment shown in fig. 4 in detail, which is not described herein.
In some alternative embodiments, the step S802 "groups the block identifiers of the plurality of file blocks to form a plurality of block identifier groups" may specifically include the following steps A1 to A2.
A1, determining the grouping size according to the number of files in the file block; the number of files in a file block is inversely related to the packet size.
In this embodiment, a file block formed by aggregation includes a plurality of files; for example, the file data set is a small file data set (SMALL FILE DATASET), and assuming that the size of each small file is 1KB and the size of the file block formed by aggregation is 4MB, the file block contains 4000 small files in total, and the number of files is 4000.
When grouping file blocks, each file block group comprises a plurality of file blocks, so that the larger the number of files of the file blocks is, the larger the number of files in the file block group is; similarly, if the group size (group size) of the file block group is larger, the number of files in the file block group is also larger; wherein the packet size indicates the number of file blocks contained in the file block group.
If the number of files in the file block group is large, the number of file identifications in the block identification group divided according to the grouping size is also large; since the file identifiers in the block identifier group are out of order, in this way, the file identifiers of the files that need to be read in the current batch may correspond to more file blocks, which is not beneficial to subsequent processing. Therefore, in the present embodiment, the number of files in the file block and the packet size are set as a negative correlation, that is, the larger the number of files in the file block is, the smaller the set packet size is, so that the packet size can be adaptively set. The proper grouping size can ensure the hit rate of the files in the same file block group hit by the current batch, thereby ensuring the file reading efficiency of the current batch based on the file blocks.
Alternatively, when the file blocks are formed by aggregation, a fixed size of the file blocks may be preset, and a plurality of files are aggregated under the condition of the file block size, where the number of files aggregated by the file blocks is related to the size of each file. For example, the file block size is 4MB, and the sum of the sizes of all files aggregated by the file block cannot exceed 4MB. In addition, when the packet size is set, the packet size can also be set according to the size of the file data set, namely, the number of files (or the number of file blocks) contained in the file data set; the larger the size of the file data set, the larger the packet size may be.
Or an index number of 2 is typically used for each batch processing, i.e., batch size (batch size) is an index number of 2, e.g., 64, 128, etc. Therefore, when aggregating files to form file blocks, the size of the file blocks may not be limited, but m-th power of 2 (i.e. 2 μm) files may be aggregated into one file block, i.e. the number of files in the file block is 2 μm; and the packet size is set to the power n of 2 (i.e., 2 n), where m and n are integers, e.g., m and n are greater than 1. Assuming that the number of files to be processed for each batch is 2^p, that is, the batch size is 2^p, the files to be processed for the current batch can belong to the same file block group with a higher probability. For example, if p is less than or equal to m+n, the files to be processed in the current batch must belong to the same file block group, so that the subsequent file reading efficiency can be ensured.
And A2, grouping the block identifications of the plurality of file blocks according to the grouping size to form a plurality of block identification groups.
In this embodiment, after determining the packet size, the block identifiers of the file blocks may be grouped according to the packet size to form a block identifier group with the packet size, that is, the block identifier group includes the block identifiers with the packet size.
Step S803, inter-group disorder is performed on a plurality of block identification groups, and intra-group disorder is performed on file identifications in at least part of the block identification groups, so as to form a file identification sequence after disorder.
The file ID may be an ID of a corresponding file, but the file ID itself is relatively complex, for example, an ID of a certain file in a certain picture dataset is "ILSVRC2012_test_00000001.Jpeg", or even, the file ID may be further represented based on a corresponding file block name, that is, the file ID includes information of a file block to which the file ID belongs, for example, a certain file ID is: "imagenet _4m/imagenet _1/imagenet _2/ILSVRC2012_test_00000001.Jpeg". Therefore, the file ID is directly used for disorder, the realization is complex, and a larger storage space is also required for storing the file identification sequence after disorder.
In this embodiment, the index of the file in the whole file data set is used as the file identifier of the file, that is, the file identifier is the index of the corresponding file in all files contained in the plurality of file blocks.
Specifically, when a training task is started, a global file list corresponding to all files of the file data set can be generated, the global file list can represent the mapping relation between the files in the file data set and metadata thereof, and then the corresponding files can be positioned based on the global file list when the files are read; and, the global file list can also be used for determining the position of each file in the global file list, and the position is used as the index of the file. The global file list may be a list of description data of all files in the file data set, and the description data may be metadata of corresponding files; it will be appreciated that the order of elements in the global file list may also represent an order of all files of the file data set.
Alternatively, the process of determining the file index using the global file list may include the following steps B1 to B3.
And step B1, in the first iteration process of model training, generating a file list containing description data of a plurality of files by taking the block identification group as a unit.
And B2, carrying out disorder on the description data in the file list, and forming a global file list comprising the description data in the file lists according to the sequence of the block identification groups by the file list after disorder.
And step B3, the position index of the description data of the file in the global file list is used as the index of the corresponding file.
In this embodiment, in the first iteration (epoch) process, the file list may be used as a unit to perform disorder, so that not only a global file list after disorder may be formed, but also a file after disorder may be used to perform subsequent training, thereby ensuring accuracy of the first iteration training.
For example, when a training task is started, a file block ID list including all file block IDs may be generated, the file block ID list may be adaptively segmented according to the number of file block IDs and the specified thread number (default 50), and the size of each sub-list obtained by segmentation=the number of file block IDs/the thread number may be rounded up. Starting the threads with the specified thread number, inquiring the aggregation information of the file blocks in the respective sub-list in batches by each thread, decompressing the small files aggregated by the respective file blocks, mapping with metadata such as a chunk Id, a chunk name and the like, and forming a mapping relation between the small files and the chunk Id, the chunk name.
At the first iteration, the file block ID list is grouped according to a suitable group size (group size) to form a plurality of block identification groups. After grouping, generating a corresponding file list (group_files) by taking the block identification groups as units of small files aggregated by each file block (chunk) in each block identification group, and carrying out disorder (shuffle) on the file list of each block identification group; and then, the file list with the disordered block identification groups is put into the global file list according to the order of the block identification groups to form a global file list containing all files.
Fig. 9 shows a schematic diagram of a process for forming a global file list. As shown in fig. 9, the file data set includes 6 file blocks, respectively: file block 1, file block 2, file block 3, file block 4, file block 5, file block 6. These 6 groups may be grouped to form corresponding file block groups. As shown in fig. 9, all file blocks may be first scrambled by taking the file block as a unit, and the scrambling result in fig. 9 is: file block 4, file block 1, file block 6, file block 2, file block 5, file block 3. If every three file blocks are taken as a group, a file block group a including file block 4, file block 1, file block 6, and a file block group B including file block 2, file block 5, file block 3 can be formed.
And each file block group comprises a plurality of files which are arranged in sequence, so that each file block group corresponds to a corresponding file list. As shown in fig. 9, file block 4 includes file 1, file 2, file 3, file 4, file block 1 includes file 5, file 6, file 7, file 8, and so on. Thus, the file list corresponding to the file block group a includes from file 1 to file 12, and the file list corresponding to the file block group B includes from file 13 to file 24.
Then, the files in the file list of each file block group are respectively disordered, as shown in fig. 9, and for the file block group a, the file list after disordered is: file 5, file 2, file 11, file 9, file 1, file 8, file 12, file 4, file 7, file 3, file 6, file 10; for file block group B, its unordered file list is: file 21, file 17, file 13, file 18, file 22, file 15, file 14, file 20, file 24, file 16, file 23, file 19. The two disordered file lists can be spliced to form a complete global file list according to the sequence of the first file block group A and the second file block group B, and the global file list can be specifically shown by referring to the last row of FIG. 9.
After the global file list is determined, an index for each file may be determined based on the global file list. The location of the file in the global file list may be represented by a location index, and accordingly, the location index may be used as a file index.
For example, if the file list length of each block identification group is len, for the file corresponding to each block identification group, index lists from idx to idx+len may be generated, where the index idx may start from 0, and each index list may be sequentially arranged, so as to form a global index list, and the global index list may represent the index of each file. The element value idx in the global index list is the index samples [ idx ] of the corresponding files in the global file list, so that the file information of the corresponding positions in the global file list can be obtained through the samples [ idx ]; where samples represent a global file list.
Also, referring to fig. 8, the above-described step S803 of "inter-group shuffling the plurality of block identification groups and intra-group shuffling the file identifications in at least part of the block identification groups" may include the following steps S8031 to S8032.
Step S8031, constructing a two-dimensional array; the first dimension element in the two-dimension array represents the corresponding block identification group, and the second dimension element corresponding to the first dimension element represents the file identification in the corresponding block identification group.
In the first iteration process, since the global file list is disordered, file reading can be directly performed based on the global file list, that is, corresponding files can be read in batches. For example, if the batch size (batch size) is 4, referring to fig. 9, file 5, file 2, file 11, file 9 can be read in the first batch, file 1, file 8, file 12, file 4 can be read in the next batch, i.e., the second batch, and so on.
In this embodiment, in order to facilitate the disorder of the file data set, especially from the second iteration, the disorder needs to be performed in each iteration process, and in this embodiment, two-dimensional data is constructed based on the index of the file.
Specifically, the block identification groups correspond to the first dimension elements of the two-dimensional array, and the file identifications (i.e. indexes) in each block identification group are used as the second dimension elements in the corresponding first dimension elements, so that the two-dimensional array containing all file indexes can be formed.
For example, the two-dimensional array is: { { idx_11, idx_12, …, idx_1n }, { idx_21, idx_22, …, idx_2n }, …, { idx_m1, idx_m2, …, idx_mn }; where idx_ij represents the j-th index in the i-th block identification group. For a first dimension element { idx_11, idx_12, …, idx_1n } in the two-dimensional array, each second dimension element (e.g., idx_11, idx_12, etc.) represents an index belonging to the corresponding block identification group. The index idx_ij is essentially an index in the global file list, and is represented by i and j, only for convenience in representing the block identifier group to which the index belongs and which position in the block identifier group corresponds to the index.
Step S8032, the plurality of first dimension elements in the two-dimensional array are disordered, and at least part of second dimension elements corresponding to the first dimension elements are disordered.
In this embodiment, the disorder is performed on the plurality of first dimension elements in the two-dimensional array, which is equivalent to the inter-group disorder performed on the block identification group; the disorder is carried out on the second dimension elements corresponding to the first dimension elements, which is equivalent to the disorder in the group of the file identifications in the block identification group. Therefore, the two-dimensional data is disordered in two dimensions, so that the file identifications of all files can be conveniently disordered in groups and disordered between groups.
Step S8033, converting the disordered two-dimensional array into a one-dimensional global index list according to the disordered sequence of the elements in the two-dimensional array; the global index list serves as a file identification sequence.
It can be understood that the order of the first dimension element in the two-dimension array is the grouping order of the file blocks, i.e. the order of the block identification group; the sequence of the second dimension elements corresponding to the first dimension elements is the sequence of the file list in each block identification group, namely the sequence of the file identifications. After the two-dimensional array is disordered, the sequence of the disordered elements is kept, the two-dimensional array is converted into a one-dimensional list, the list is a global index list containing all indexes, and the global index list is the file identification sequence.
For example, in the second iteration, the index of each file may be determined based on the global file list determined in the first iteration, and then a corresponding two-dimensional array may be generated based on the indexes of the files, and each one-dimensional array (i.e., the first-dimensional element) in the two-dimensional array is disordered, that is, the order of the files in each block identification group is equivalent to that of the block identification group, but the order of the files in each block identification group is kept unchanged; and, the elements in each one-dimensional array in the two-dimensional array (namely, the second-dimensional elements) are respectively disordered, namely, the disordered is equivalent to disordered files in each block identification group. It can be understood that the disorder is performed based on the two-dimensional data, and files are indirectly disorder through index indexes, so that massive files do not need to be directly disorder. After disorder, the two-dimensional array in disorder is converted into a list, and a global index list required by the iteration can be generated, wherein elements in the global index list are indexes and are not specific file information. Then, the files with the batch size are returned from the global index list in batches according to the batch size and the sequence.
Fig. 10 shows a schematic diagram of a process for shuffling the index during the second iteration. As shown in fig. 10, based on the global file list shown in fig. 9, an index corresponding to each file can be determined; starting with index 1 in fig. 10, index 1, index 2, …, index 24 may be determined sequentially; and, the first one of the two-dimensional arrays includes indexes 1 to 12 corresponding to the file block group a, and the second one of the two-dimensional arrays includes indexes 13 to 24 corresponding to the file block group B. The one-dimensional arrays of the two-dimensional arrays can be disordered first, namely, the groups are disordered, as shown in fig. 10, and the sequence of the file block groups can be changed through the group disorder, namely, the sequence of the block identification groups can be changed; then, the elements in each one-dimensional array are disordered, namely, the sequence of indexes in each block identification group can be changed, which is equivalent to changing the sequence of files in the file group, and finally a disordered global index list, namely, a global index list required by the second iteration is formed, and the global index list can be specifically seen in the last row of fig. 10.
And, in each batch of the second iteration, the file corresponding to the corresponding index can be read in batches according to the corresponding batch size. As shown in fig. 10, if the index 23, the index 17, the index 22, and the index 13 are read in the first lot, the index 23 corresponds to the file 23, the index 17 corresponds to the file 22, the index 22 corresponds to the file 16, and the index 13 corresponds to the file 21, and thus it can be determined that the files 23, 22, 16, and 21 need to be read in the first lot.
In this embodiment, the file identifier is represented by an index, and the in-group disorder and the inter-group disorder of the file can be simply and quickly implemented by using the two-dimensional array, so as to maintain the sequence among the file block groups, and also maintain the sequence of the files in the file block groups after disorder, so that the subsequent reading efficiency can be ensured. In addition, the out-of-order mode can improve the hit rate of hitting the same group of file blocks in each batch, namely most of files hit in each batch belong to the same file block group.
In step S804, the files corresponding to the file identifications included in the current batch are read in batches according to the order of the file identifications in the file identification sequence.
Please refer to step S404 in the embodiment shown in fig. 4 in detail, which is not described herein.
According to the data reading method provided by the embodiment, the disorder of the whole file data set can be conveniently and rapidly realized by utilizing the two-dimensional array, the disorder efficiency can be improved while the training accuracy of the model is ensured, and the reading efficiency can be ensured.
In this embodiment, a data reading method is provided, which may be used for the above-mentioned computing node, and fig. 11 is a flowchart of the data reading method according to an embodiment of the present invention, as shown in fig. 11, and the flowchart includes the following steps.
Step 1101, obtaining block identifiers of a plurality of file blocks, and determining a file identifier corresponding to each file in the file blocks; a file block is a data block generated by aggregating files in a file data set.
Please refer to step S401 in the embodiment shown in fig. 4 in detail, which is not described herein.
In step S1102, block identifiers of a plurality of file blocks are grouped to form a plurality of block identifier groups.
Please refer to step S402 in the embodiment shown in fig. 4 in detail, which is not described herein.
In step S1103, the plurality of block identifier groups are disordered between groups, and the file identifiers in at least some of the block identifier groups are disordered within groups, so as to form a disordered file identifier sequence.
Please refer to step S403 in the embodiment shown in fig. 4 in detail, which is not described herein.
Step S1104, addressing the files corresponding to the file identifications contained in the current batch in batches according to the order of the file identifications in the file identification sequence, and determining corresponding addressing information.
In this embodiment, since the file identifier is not the file itself and the file is stored in the aggregated storage space, in order to be able to locate the desired file from the aggregated storage space, addressing is required based on the file identifier to determine addressing information of the desired file. After determining the addressing information, the corresponding file can be read from the aggregated memory space.
In some alternative embodiments, the addressing may be performed in units of file blocks, specifically, the above step S1104 "addressing the file corresponding to the file identifier included in the current batch, and determining the corresponding addressing information" may include the following steps C1 to C3.
And step C1, determining a file block to which the target file hit by the current batch belongs according to the file identifier contained in the current batch.
In this embodiment, the file identifier included in the current batch is a file identifier of a file to be read in the current batch, and for convenience of description, the file to be read in the current batch is referred to as a target file, and the target file is also a file hit in the current batch; the file block to which the target file belongs may also be referred to as a target file block.
For example, if the file identifier is an index, the file (i.e., the target file) mapped by the file index and the mapped file block may be determined according to the file index that needs to be read by the current batch. As shown in fig. 10, in the first batch of the second iteration, for the index 23, it has a mapping relationship with the file 23, and accordingly, the index 23 maps the file block to which the file 23 belongs, as shown in fig. 9, to be the file block 3; while for index 17 it has a mapping relation with file 22, and accordingly, as shown in fig. 9, this index 17 also has a mapping relation with file block 3.
And C2, classifying the target files belonging to the same file block into a group, and generating a target file list containing at least one file identifier corresponding to the target file.
For example, as shown in fig. 10, in the first batch of the second iteration, the indexes 23, 17, 22 and 12 have mapping relations with the files 23, 22, 16 and 21 respectively, and the files 23, 22 and 21 all belong to the file block 3, and the file 16 belongs to the file block 2, so that the file identifications corresponding to the files 23, 22 and 21, that is, the indexes 23, 17 and 12, can be grouped to form a target file list containing the three indexes; the index 22 corresponding to the file 16 is grouped into another group, and another target file list including the index 22 is formed.
And C3, respectively determining the addressing information of the target files corresponding to each target file list by taking the target file list as a unit.
In this embodiment, since only a small number of file blocks will be hit in no batch, the target files belonging to the same file block are grouped into a group, so that each generated target file list corresponds to only one file block; only addressing information of the files in one file block needs to be determined by taking the target file list as a unit.
Optionally, the method may further include: starting a plurality of working processes, and reading files of corresponding batches based on the working processes; different working processes are used for reading files of different batches, and after the working process finishes reading files of the current batch, the working process is used for reading files of the next batch until the file data set is read.
And in the file reading process of the current batch, starting working threads matched with the number of the hit file blocks, wherein each working thread is used for determining addressing information of the files in the corresponding file blocks and reading the corresponding files. By respectively starting each thread for each file block and addressing in a multithreading parallel processing mode, the addressing efficiency can be improved.
In this embodiment, since each iteration needs to be processed by multiple batches, multiple working processes can be started to process data of different batches in parallel; in addition, in each iteration process, a plurality of processes can be started, and each process respectively processes the data of the corresponding batch. After the processing is finished, processing the data of other batches based on the corresponding working process; this parallel processing method can also improve the efficiency of batch processing of data.
FIG. 12 shows a schematic flow chart of batch read data based on training library implementation. As shown in fig. 12, the file data set required for training can be obtained from a remote data source storing a large number of small file data sets, and fig. 12 and the picture data set are exemplified; after the picture dataset is acquired, multiple threads (i.e., worker threads) may be started based on the data loader (DataLoader), enabling batch iterations in parallel. Also, multiple processes (i.e., work processes) may be started for each thread. As shown in fig. 12, during the processing of each batch, the file mapped by the index of the current batch (i.e., the target file) and the mapped file blocks may be determined, and thus the progress of the number of file blocks may be opened. Taking the first batch of the second iteration as shown in fig. 10 as an example, it relates to at most three file blocks in the file block group B, namely, the file block 2, the file block 5 and the file block 3, so that three processes can be started to process the file addressing processes of the file block 2, the file block 5 and the file block 3 respectively so as to determine the addressing information required by the current batch. After the addressing information is determined, the corresponding file can be read from the distributed file system based on the read function open (), and converted (transform) into a desired format for subsequent training.
In the first batch of the second iteration shown in fig. 10, the process of file addressing based on file blocks is shown in fig. 13. For example, if the index 23 corresponds to the file 23 in the file block 3 and the opened process 3 is used to process the related data of the file block 3, the process 3 may be used to read the corresponding file 13 from the distributed file system according to the addressing information corresponding to the index 23.
In some alternative embodiments, the step S1104 "addressing the file corresponding to the file identifier included in the current batch, and determining the corresponding addressing information" may include the following step D1.
Step D1, determining a target file hit by a file identification contained in the current batch, and generating an addressing information object of the target file; the addressing information object includes location information of the target file in the corresponding file block.
However, if addressing is performed based on information such as metadata of a file, since the addressing is generally performed by traversing the entire file block, for example, by traversing file by file using a method such as for loop, the efficiency is low, and in particular, operations such as file deletion, verification and the like may be involved, and even if the amount of data is large, the addressing efficiency may be affected. In this embodiment, an object (object) for addressing, that is, an addressing information object, is set for each target file, and addressing of the target file is achieved based on the addressing information object.
Specifically, the addressing information object only needs to include the location information required when reading the target file, which may specifically include the handle (FILE HANDLE), location, offset (offset), etc. of the target file in the file block. In the embodiment, lightweight addressing information objects are adopted for addressing, so that the addressing efficiency is high.
In some alternative embodiments, the step D1 "generating the addressing information object of the target file" may specifically include the following step D11 or step D12.
Step D11, in a batch addressing mode, addressing information objects of a plurality of target files are generated in batches.
Step D12, addressing is carried out according to the constructed global dictionary, and addressing information objects of a plurality of target files are determined; the global dictionary is used for maintaining addressing information objects of all files of the file block where the target file is located.
In this embodiment, when determining the addressing information object, a suitable manner may be selected according to the actual situation; among them, the alternative ways include: a batch addressing approach, and a global dictionary based addressing approach.
Alternatively, the file block to which the target file hit in the current batch belongs is referred to as a target file block, and an appropriate addressing manner may be selected based on the number of files in the target file block, or the like. Specifically, the above step D1 "generating the addressing information object of the target file" may further include the following step D13.
Step D13, determining a first number of files in the target file block hit by the file identifier contained in the current batch, and determining a second number of target files belonging to the target file block hit by the file identifier contained in the current batch.
Wherein a first number of thresholds, i.e. a first threshold, and a second number of thresholds, i.e. a second threshold, are set. If the first number exceeds the first threshold, the file block per se is described as containing more files; if the second number is more, indicating that most files in the target file block are hit in the current batch; therefore, in the case that the first number exceeds the first threshold and the second number exceeds the second threshold, since the global dictionary is relatively complex to address, the entire global dictionary needs to be traversed multiple times, and the global dictionary is not easy to be constructed, the batch addressing mode may be adopted, that is, the step D11 is performed: addressing information objects of a plurality of target files are generated in batches in a batch addressing mode.
Conversely, if the first number does not exceed the first threshold value and the second number does not exceed the second threshold value, the global dictionary is not excessively large even if created, so that the addressing can be performed based on the global dictionary, that is, the above-mentioned step D12 can be performed: addressing is carried out according to the constructed global dictionary, and addressing information objects of a plurality of target files are determined.
Alternatively, the step D11 "of generating the addressing information object of the plurality of target files in batch in a batch addressing manner" may include steps D111 to D115.
Step D111, determining the target file block hit by the file identification contained in the current batch, and creating the block object of the target file block.
Step D112, traversing the header information of the file in the target file block, and judging whether the header information of the file is matched with the file identification contained in the current batch.
Step D113, when the header information of the file is matched with the file identification contained in the current batch, determining the position information of the file in the target file block according to the header information of the file; the location information includes a handle to the file, a location in the target file block, and an offset.
Step D114, generating an addressing information object of the matched file according to the position information, and returning the addressing information object to the block object.
And step D115, stopping traversing the target file block when the number of the addressing information objects in the block object is consistent with the number of the file identifications belonging to the target file block in the file identifications contained in the current batch.
In this embodiment, for a target file block hit by a file identification contained in a current batch, an object (object) of the target file block, that is, a block object, may be created. When batch addressing is performed, files in the target file blocks are traversed, wherein the traversing is specifically realized based on header information of the files, or the traversing is performed by header information of each file, and the traversing of the files is not required.
The header information of the file can directly or indirectly represent the file representation, so that whether the file is matched with the file identifier contained in the current batch or not can be judged based on the header information of the file, the matched file is the target file to be read in the current batch, and the header information of the matched file is the header information of the target file.
And, header information of the file may represent location information of the file in the target file block. Specifically, the location information may include a handle of the file in the target file block, and a location, an offset, etc., and may be used to form corresponding addressing information, so that an addressing information object for the matching file may be generated based on the location information, and then the addressing information object may be returned to the block object.
If the file identifications contained in the current batch are all matched, the batch addressing is completed, so that the target file block does not need to be traversed. Specifically, in file identifications contained in the current batch, the number of file identifications belonging to the target file block is fixed, the number of file identifications is set as a, if the current inquiry of the a matched file header information is carried out by traversing the file header information in the target file block, at the moment, a addressing information objects also exist in the block objects of the target file block; at this time, the file identifications associated with the target file block included in the current batch are all matched, and the target file block is already addressed, so that it is not necessary to traverse the target file block any more, and other target file blocks can be traversed.
For example, a thread that processes a target file block may determine the total number of files for the target file block, i.e., a first number, and the number of target files that hit the target file block, i.e., a second number; if the number of hit target files is high, for example, the number of hit target files exceeds a configuration threshold (e.g., 1/3) of the total number of target file block aggregate files, then a batch addressing approach is used.
For example, the distributed file system is CachFs systems, the block object of the target file block is CacheFsChunkIO, and the addressing information object of the file is CacheFsChunkFile. In the process of batch addressing, when CacheFsChunkIO objects are created, the target file can be transmitted; when CacheFsChunkIO objects are initialized, traversing the head files of the target file blocks according to the sequence, judging whether the small file names of the head information addressed each time correspond to the target files in the target file list, if yes, generating the addressing information objects of the target files, namely CacheFsChunkFile objects, mapping the object names with the small file names, putting the object names into the addressing information object list, and if the length of the information object list to be addressed is consistent with that of the target file list, indicating that the addressing of the target file list is completed, and returning CacheFsChunkFile objects. It will be appreciated that there is typically no need to maintain a global CacheFsChunkFile object at this point, i.e., there is no need to maintain the addressing information objects for all files in the target file block.
Alternatively, the above step D12 "addressing according to the constructed global dictionary, determining the addressing information objects of the plurality of target files" may include steps D121 to D126.
Step D121, determining whether the target file block hit by the file identifier included in the current batch is the first hit.
Step D122, in the case of hitting the target file block for the first time, creating a block object of the target file block.
Step D123, traversing the header information of all files in the target file block, and determining the position information of the files in the target file block according to the header information of the files; the location information includes a handle to the file, a location in the target file block, and an offset.
Step D124, determining the addressing information objects of all files according to the position information, and returning the addressing information objects of all files to the block object.
And step D125, the block object is put into a global dictionary, and the addressing information object of the target file corresponding to the file identification contained in the current batch is read from the global dictionary.
Step D126, under the condition that the target file block is not hit for the first time, reading the addressing information object of the target file corresponding to the file identifier contained in the current batch from the global dictionary.
In this embodiment, if the target file block is the first hit, when creating the block object of the target file block, the addressing information objects of all the files in the target file block need to be determined, so as to form a global dictionary containing all the addressing information objects. Furthermore, the addressing information object of the target file can be determined based on the global dictionary in the current batch and other batches after the current batch, the file block does not need to be read for many times, and the IO times can be reduced.
For example, a thread that processes a target file block may determine the total number of files for the target file block, i.e., a first number, and the number of target files that hit the target file block, i.e., a second number; if the number of hit target files is small and the target file blocks only contain few files, addressing can be performed through the constructed global dictionary.
For example, the distributed file system is CachFs systems, the block object of the target file block is CacheFsChunkIO, and the addressing information object of the file is CacheFsChunkFile. If a global dictionary is currently required to be built, when a target file block is hit for the first time, a block object, namely CacheFsChunkIO objects, of the target file block can be directly created, when the object is initialized, small file header information of the target file block is traversed in sequence, position information, such as information of a handle fh, a position pos, an offset and the like, of all small files in the target file block is obtained, an addressing information object, namely CacheFsChunkFile objects, of each small file is generated based on the position information, and the addressing information object and the small file name are mapped and placed into a global dictionary members (also referred to as a global member dictionary) of a CacheFsChunkIO object. After the traversal is completed, mapping CacheFsChunkIO objects with file block names of the target file blocks, and putting the mapping objects into a global dictionary chunk_stream of a dataset. The global addressing of each file block is traversed only once during Dataloader multiprocess concurrency. And then, when other processes hit, extracting CacheFsChunkIO the object directly from the global dictionary chunk_stream, and finding the small file address information of the global dictionary members, namely addressing information objects. It will be appreciated that this process involves only header files of the target file block, and logical data processing, and that the reading of specific file data in the target file block is not currently triggered.
For example, the file is identified as an index, in the process of iterating batch data reading, if in the current iteration, the index list received by the training library is [2, 3, 35, 36, 98, 35, 17, 9, 8, 20], the index list may be grouped according to the file blocks to which the corresponding files of the index belong, so as to generate a file list belonging to the same file block, that is, a target file list, and map with the file block name (chunk name) of the corresponding file block.
For example, [2, 98, 35, 17, 9, 8, 20] is a group, which all belong to the file block imagenet _4m_1, and the corresponding file is specifically as follows:
[
'imagenet_4M/imagenet_3/imagenet_2/ILSVRC2012_test_00000008.JPEG
'imagenet_4M/imagenet_1/imagenet_2/ILSVRC2012_test_00000035.JPEG'
'imagenet_4M/imagenet_1/imagenet_2/ILSVRC2012_test_0000006.JPEG'
'imagenet_4M/imagenet_1/imagenet_2/ILSVRC2012_test_0000007.JPEG'
'imagenet_4M/imagenet_1/imagenet_2/ILSVRC2012_test_0000008.JPEG'
'imagenet_4M/imagenet_2/imagenet_1/ILSVRC2012_test_00000014.JPEG'
'imagenet_4M/imagenet_2/imagenet_1/ILSVRC2012_test_00000017.JPEG'
]
Wherein [3, 35, 36] is another group, which all belong to file block imagenet _4m_2, and the corresponding file is specifically as follows:
[
imagenet_4M/imagenet_1/imagenet_1/ILSVRC2012_test_00000015.JPEG
imagenet_4M/imagenet_1/imagenet_1/ILSVRC2012_test_00000016.JPEG
imagenet_4M/imagenet_1/imagenet_1/ILSVRC2012_test_00000017.JPEG
]
Thereafter, a global dictionary of block objects containing target file blocks may be constructed, chunk stre am. For example, if the global dictionary chunk_stream contains the block object of the file block imagenet _4m_1 described above, one form of the global dictionary chunk_stream may be as follows:
CachefsDataset self.chunk_stream
{
/mnt/jfs/pack/imagenet_4M_1:<cachefs.CacheFsChunkIO.CacheFsChunkIO object at 0x7f2db285be10>
}
Also, the addressing information object CachFsChunkFile of each file in the determination target file block imagenet _4m_1, which includes the handle fh, the position pos, the offset, and the like, may be traversed. Further, the addressing information objects of all files are returned to the block object, and a corresponding global dictionary members can be formed. For example, one form of the global dictionary membranes may be as follows:
CacheFsChunkIO self.members
{
'imagenet_4M/imagenet_3/imagenet_2/ILSVRC2012_test_00000008.JPEG':<cachefs.CacheFsChunkFile.CacheFsChunkFile object at 0x7f4f87e65d90>,
'imagenet_4M/imagenet_1/imagenet_2/ILSVRC2012_test_00000035.JPEG':<cachefs.CacheFsChunkFile.CacheFsChunkFile object at 0x7f4faf65ee50>,
'imagenet_4M/imagenet_1/imagenet_2/ILSVRC2012_test_0000006.JPEG':<cachefs.CacheFsChunkFile.CacheFsChunkFile object at 0x7f4faf65ee58>,
'imagenet_4M/imagenet_1/imagenet_2/ILSVRC2012_test_0000007.JPEG':<cachefs.CacheFsChunkFile.CacheFsChunkFile object at 0x7f4faf65ee72>,
'imagenet_4M/imagenet_1/imagenet_2/ILSVRC2012_test_0000008.JPEG':<cachefs.CacheFsChunkFile.CacheFsChunkFile object at 0x7f4faf65ee69>,
'imagenet_4M/imagenet_1/imagenet_2/ILSVRC2012_test_00000038.JPEG':<cachefs.CacheFsChunkFile.CacheFsChunkFile object at 0x7f4faf65ae50>,
'imagenet_4M/imagenet_1/imagenet_2/ILSVRC2012_test_00000031.JPEG':<cachefs.CacheFsChunkFile.CacheFsChunkFile object at 0x7f4faf65gy50>,
'imagenet_4M/imagenet_2/imagenet_1/ILSVRC2012_test_00000013.JPEG':<cachefs.CacheFsChunkFile.CacheFsChunkFile object at 0x7f4faf65fg90>,
'imagenet_4M/imagenet_2/imagenet_1/ILSVRC2012_test_00000014.JPEG':<cachefs.CacheFsChunkFile.CacheFsChunkFile object at 0x7f4faf65hjk0>,
'imagenet_4M/imagenet_2/imagenet_1/ILSVRC2012_test_00000017.JPEG':<cachefs.CacheFsChunkFile.CacheFsChunkFile object at 0x7f4faf65vb35>
……
}
a timing diagram for implementing addressing by the training library based on the addressing information object is shown in fig. 14.
Step S1105, reading the corresponding file from the hit file block according to the addressing information.
In this embodiment, for example, the corresponding target file may be quickly read out according to the addressing information object. It will be appreciated that if a global dictionary is built, the addressing information object is retrieved from the global dictionary.
In some alternative embodiments, the step S1105 "reading the corresponding file from the hit file block according to the addressing information" may include the following steps E1 to E3.
And E1, judging whether the target file block hit by the file identification contained in the current batch is the first hit.
And E2, under the condition that the target file block is hit for the first time, caching the target file block, and reading the corresponding target file from the cached target file block according to the addressing information.
And E3, under the condition that the target file block is not hit for the first time, reading the corresponding target file from the cached target file block according to the addressing information.
In this embodiment, since each batch generally hits only a small number of file blocks, and subsequent batches may read other files in the same file block, when the target file block is hit for the first time, the entire target file block is read, i.e. all data in the target file block is read, and cached, for example, in the memory or disk space of the local computing node. When the file in the target file block needs to be read, the corresponding target text is directly read from the cache, namely only one IO interaction with the storage system is needed, and the IO interaction times can be effectively reduced. The storage system may be, for example, a storage space of the aggregate storage shown in fig. 2, or may be a storage system of another third party. For example, the storage system may be a storage system such as Ceph, GFS, lustre, HDFS, swift, which is not limited in this embodiment.
Specifically, if any one file in the file block is triggered to be read, the file block is triggered to be cached in a memory or a local cache disk after interaction with the storage system is performed once. In the computing node, if other files of the file block are read in the batch and the subsequent batches, the file data are directly read from the cached file block by using the addressing strategy and the addressing information, and interaction with a storage system is not needed, so that the file reading efficiency is improved.
Alternatively, the above step E3 "reading the corresponding target file from the cached target file block according to the addressing information" may include the following steps E31 to E32.
Step E31, transmitting an addressing information object of the target file to an image reading function to generate a picture file object under the condition that the target file is picture data; and reading corresponding picture data according to the picture file object, and performing format conversion.
And E32, when the target file is text data, reading out corresponding text data according to the addressing information object of the target file, and performing format conversion.
In this embodiment, for each working thread of a file block, after addressing is completed, the target file list may be traversed, and the addressing information object CacheFsChunkFile in the global dictionary may be hit according to the file name. If the target file is picture data, the addressing information object may be transmitted to an image reading function, for example, image open () so as to generate a corresponding picture file object; for example, a picture file object may be: < pil. Jpegmageplug in. Jpegmagefile image mode=rgb size= 500x495 at 0x7F5394EE6B50>. Then, when conversion (transform) is performed, the corresponding picture data may be read based on the picture file object, and format conversion may be performed.
For example, one type of converted picture data is:
tensor([[[ 0.4851,0.5536,0.5878,..., -1.1589, -1.1932, -1.3130],
[ 0.5022,0.5022,0.5022,..., -0.7479, -0.7479, -0.7479],
[ 0.5878,0.6049,0.5707,..., -0.6452, -0.6794, -0.6452],
……
It will be appreciated that upon conversion, a read of the file is triggered.
If the text data needs to be read currently, that is, the target file is the text data, the text data can be directly read out by directly using the read () method of the addressing information object CacheFsChunkFile, and then converted into the specified format according to the need, so that the text data is read.
Optionally, if a global dictionary is maintained, the data therein needs to be deleted at the right time. Specifically, the method further comprises: deleting the addressing information object corresponding to the read target file after the target file corresponding to the addressing information object in the global dictionary is read; after the target file corresponding to any addressing information object in the block objects is read, deleting the corresponding block object.
In this embodiment, after the file data is read, if a global dictionary chunk_stream is generated, the addressing information object in the global dictionary memebers of the CacheFsChunkIO object of the file block to which the file belongs needs to be deleted, and if the member information of the global dictionary members is deleted completely, the CacheFsChunkIO object of the file block in the global dictionary chunk_stream is cleared, so that when mass data is reduced, the maintenance amount of the global dictionary is increased, and the processing efficiency and hit efficiency are improved.
The reading method provided according to the present embodiment will be compared with the reading of data directly from NVME/SSD/HDD disk, and the performance comparison of the two can be seen from fig. 15 to 17. Fig. 15, 16 and 17 are graphs of test results based on SSD, NVME, HDD performance comparisons, respectively, each showing the run lengths (ELAPSED TIME) of three different data sets at the first iteration (epoch 1), i.e., the read time in seconds(s). The test condition is batch size=128, and the number of processes is 16; and testing different sized data sets, e.g., "4KB-100W" data set representation: the data set contains 100 ten thousand (W represents ten thousand) files of 4KB in size, and other data set sizes "4KB-1000W", "128KB-100W", and so on are similar. The ordinate in the test result graph indicates the operation time period. Based on the test result, the reading method (realized based on CacheFS distributed system) provided by the embodiment can reduce the running time in different magnitudes, which further proves that the method provided by the embodiment can accelerate the process of reading massive small file data sets through AI training.
The embodiment also provides a data reading device, which is used for implementing the above embodiment and the preferred implementation, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a data reading apparatus, as shown in fig. 18, including:
An obtaining module 1801, configured to obtain block identifiers of a plurality of file blocks, and determine a file identifier corresponding to each file in the file blocks; the file blocks are data blocks generated by aggregating files in a file data set;
A grouping module 1802, configured to group block identifiers of the plurality of file blocks to form a plurality of block identifier groups;
The disorder module 1803 is configured to disorder groups of the plurality of block identifier groups, and disorder in groups of file identifiers in at least some of the block identifier groups to form a disordered file identifier sequence;
The reading module 1804 is configured to read, in batches, the files corresponding to the file identifications included in the current batch according to the order of the file identifications in the file identification sequence.
In some alternative embodiments, the file identification is an index of the corresponding file in all files contained by the plurality of file blocks;
The disorder module 1803 performs inter-group disorder on the plurality of block identifier groups, and performs intra-group disorder on file identifiers in at least part of the block identifier groups, so as to form a disordered file identifier sequence, including: constructing a two-dimensional array; a first dimension element in the two-dimension array represents a corresponding block identification group, and a second dimension element corresponding to the first dimension element represents a file identification in the corresponding block identification group; the first dimension elements in the two-dimensional array are disordered, and at least part of the second dimension elements corresponding to the first dimension elements are disordered; according to the sequence of the disordered elements in the two-dimensional array, converting the disordered two-dimensional array into a one-dimensional global index list; the global index list serves as a file identification sequence.
In some alternative embodiments, the batch read files are used for model training, and the out-of-order module 1803 is further configured to: generating a file list containing description data of a plurality of files by taking the block identification group as a unit in a first iteration process of model training; the description data in the file list is disordered, and the disordered file list is formed into a global file list comprising the description data in a plurality of file lists according to the sequence of the block identification groups; and the position index of the description data of the file in the global file list is used as the index of the corresponding file.
In some alternative embodiments, the grouping module 1802 groups block identifications of the plurality of file blocks to form a plurality of block identification groups, including: determining the grouping size according to the number of files in the file block; the number of files in the file block and the grouping size are in a negative correlation; and grouping the block identifications of the plurality of file blocks according to the grouping size to form a plurality of block identification groups.
In some alternative embodiments, the number of files in the file block is the number of files aggregated under the limitation of the file block size, the grouping size being a size set according to the size of the file data set; or the number of files in the file block is 2m, and the grouping size is 2 n; m and n are integers.
In some alternative embodiments, the reading module 1804 reads a file corresponding to a file identifier contained in the current lot, including: addressing the file corresponding to the file identifier contained in the current batch, and determining corresponding addressing information; and reading the corresponding file from the hit file block according to the addressing information.
In some alternative embodiments, the reading module 1804 addresses a file corresponding to a file identifier included in the current batch, and determines corresponding addressing information, including: determining a target file hit by a file identification contained in a current batch, and generating an addressing information object of the target file; the addressing information object includes location information of the target file in a corresponding file block.
In some alternative embodiments, the reading module 1804 generates an addressing information object for the target file, including: generating addressing information objects of a plurality of target files in batches in a batch addressing mode; or addressing is carried out according to the constructed global dictionary, and addressing information objects of a plurality of target files are determined; the global dictionary is used for maintaining addressing information objects of all files of the file block where the target file is located.
In some alternative embodiments, the reading module 1804 generates addressing information objects of a plurality of the target files in a batch addressing manner, including: determining a target file block hit by a file identifier contained in a current batch, and creating a block object of the target file block; traversing the header information of the file in the target file block, and judging whether the header information of the file is matched with the file identifier contained in the current batch; when the header information of the file is matched with the file identification contained in the current batch, determining the position information of the file in the target file block according to the header information of the file; the location information includes a handle of the file, a location in the target file block, and an offset; generating an addressing information object of the matched file according to the position information, and returning the addressing information object to the block object; and stopping traversing the target file block when the number of the addressing information objects in the block object is consistent with the number of the file identifications belonging to the target file block in the file identifications contained in the current batch.
In some alternative embodiments, the reading module 1804 addresses according to a constructed global dictionary to determine addressing information objects for a plurality of the target files, including: judging whether a target file block hit by a file identifier contained in the current batch is hit for the first time; creating a block object of the target file block under the condition of hitting the target file block for the first time; traversing the header information of all files in the target file block, and determining the position information of the files in the target file block according to the header information of the files; the location information includes a handle of the file, a location in the target file block, and an offset; determining addressing information objects of all files according to the position information, and returning the addressing information objects of all files to the block object; the block object is put into a global dictionary, and an addressing information object of a target file corresponding to a file identifier contained in the current batch is read from the global dictionary; and under the condition that the target file block is not hit for the first time, reading the addressing information object of the target file corresponding to the file identification contained in the current batch from the global dictionary.
In some alternative embodiments, the apparatus further comprises a deletion module for: deleting the addressing information object corresponding to the read target file after the target file corresponding to the addressing information object in the global dictionary is read; and deleting the corresponding block object after the target file corresponding to any addressing information object in the block objects is read.
In some alternative embodiments, the reading module 1804 generates an addressing information object for the target file, including: determining a first number of files in a target file block hit by a file identifier contained in a current batch, and determining a second number of target files belonging to the target file block hit by the file identifier contained in the current batch; generating addressing information objects of a plurality of target files in a batch addressing mode under the condition that the first number exceeds a first threshold value and the second number exceeds a second threshold value; and under the condition that the first quantity does not exceed a first threshold value and the second quantity does not exceed a second threshold value, addressing is carried out according to the constructed global dictionary, and addressing information objects of a plurality of target files are determined.
In some alternative embodiments, the reading module 1804 addresses a file corresponding to a file identifier included in the current batch, and determines corresponding addressing information, including: determining a file block to which a target file hit by the current batch belongs according to a file identifier contained by the current batch; grouping target files belonging to the same file block into a group, and generating a target file list containing at least one file identifier corresponding to the target file; and respectively determining addressing information of the target files corresponding to each target file list by taking the target file list as a unit.
In some alternative embodiments, the read module 1804 is further configured to: starting a plurality of working processes, and reading files of corresponding batches based on the working processes; different working processes are used for reading files of different batches, and after the working process reads the files of the current batch, the working process is used for reading the files of the next batch until the file data set is read; and in the file reading process of the current batch, starting working threads matched with the number of the hit file blocks, wherein each working thread is used for determining addressing information of the files in the corresponding file blocks and reading the corresponding files.
In some alternative embodiments, the reading module 1804 reads a corresponding file from the hit file block according to the addressing information, including: judging whether a target file block hit by a file identifier contained in the current batch is hit for the first time; under the condition of hitting the target file block for the first time, caching the target file block, and reading a corresponding target file from the cached target file block according to the addressing information; and under the condition that the target file block is not hit for the first time, reading the corresponding target file from the cached target file block according to the addressing information.
In some alternative embodiments, the reading module 1804 reads the corresponding target file from the cached target file block according to the addressing information, including:
Transmitting an addressing information object of the target file to an image reading function to generate a picture file object under the condition that the target file is picture data; reading corresponding picture data according to the picture file object, and performing format conversion;
And under the condition that the target file is text data, corresponding text data is read out according to the addressing information object of the target file, and format conversion is carried out.
Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.
The data reading device in this embodiment is presented in the form of a functional unit, where the unit refers to an ASIC (Application SPECIFIC INTEGRATED Circuit) Circuit, including a processor and a memory executing one or more software or fixed programs, and/or other devices that can provide the above functions.
The embodiment of the invention also provides computer equipment which can comprise the data reading device shown in the figure 18.
Referring to fig. 19, fig. 19 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 19, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 19.
The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.
Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.
The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.
The computer device further comprises input means 30 and output means 40. The processor 10, memory 20, input device 30, and output device 40 may be connected by a bus or other means, for example in fig. 19.
The input device 30 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output means 40 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. Such display devices include, but are not limited to, liquid crystal displays, light emitting diodes, displays and plasma displays. In some alternative implementations, the display device may be a touch screen.
The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.
The methods of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user device, a core network device, an OAM, or other programmable apparatus.
The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, e.g., floppy disk, hard disk, tape; but also optical media such as digital video discs; but also semiconductor media such as solid state disks. The computer readable storage medium may be volatile or nonvolatile storage medium, or may include both volatile and nonvolatile types of storage medium.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (19)

1. A method of reading data, the method comprising:
Acquiring block identifiers of a plurality of file blocks, and determining a file identifier corresponding to each file in the file blocks; the file blocks are data blocks generated by aggregating files in a file data set;
Grouping the block identifiers of the plurality of file blocks to form a plurality of block identifier groups;
Inter-group disorder is carried out on the plurality of block identification groups, and intra-group disorder is carried out on file identifications in at least part of the block identification groups, so that a file identification sequence after disorder is formed;
Reading files corresponding to the file identifications contained in the current batch in batches according to the sequence of the file identifications in the file identification sequence;
Wherein the file identifier is an index of the corresponding file in all files contained in the plurality of file blocks;
The method for performing inter-group disorder on the plurality of block identification groups and performing intra-group disorder on file identifications in at least part of the block identification groups to form a disordered file identification sequence comprises the following steps:
Constructing a two-dimensional array; a first dimension element in the two-dimension array represents a corresponding block identification group, and a second dimension element corresponding to the first dimension element represents a file identification in the corresponding block identification group;
the first dimension elements in the two-dimensional array are disordered, and at least part of the second dimension elements corresponding to the first dimension elements are disordered;
According to the sequence of the disordered elements in the two-dimensional array, converting the disordered two-dimensional array into a one-dimensional global index list; the global index list serves as a file identification sequence.
2. The method of claim 1, wherein the batch read file is used for model training, and the method further comprises:
Generating a file list containing description data of a plurality of files by taking the block identification group as a unit in a first iteration process of model training;
the description data in the file list is disordered, and the disordered file list is formed into a global file list comprising the description data in a plurality of file lists according to the sequence of the block identification groups;
And the position index of the description data of the file in the global file list is used as the index of the corresponding file.
3. The method of claim 1, wherein grouping the block identifications of the plurality of file blocks to form a plurality of block identification groups comprises:
determining the grouping size according to the number of files in the file block; the number of files in the file block and the grouping size are in a negative correlation;
And grouping the block identifications of the plurality of file blocks according to the grouping size to form a plurality of block identification groups.
4. The method of claim 3, wherein the step of,
The number of files in the file block is the number of files aggregated under the limit of the size of the file block, and the grouping size is the size set according to the size of the file data set;
Or the number of files in the file block is 2m, and the grouping size is 2 n; m and n are integers.
5. The method of claim 1, wherein the reading a file corresponding to a file identification contained in the current batch comprises:
addressing the file corresponding to the file identifier contained in the current batch, and determining corresponding addressing information;
And reading the corresponding file from the hit file block according to the addressing information.
6. The method of claim 5, wherein the addressing the file corresponding to the file identifier contained in the current batch to determine the corresponding addressing information comprises:
determining a target file hit by a file identification contained in a current batch, and generating an addressing information object of the target file; the addressing information object includes location information of the target file in a corresponding file block.
7. The method of claim 6, wherein generating the addressing information object for the target file comprises:
Generating addressing information objects of a plurality of target files in batches in a batch addressing mode;
Or addressing is carried out according to the constructed global dictionary, and addressing information objects of a plurality of target files are determined; the global dictionary is used for maintaining addressing information objects of all files of the file block where the target file is located.
8. The method of claim 7, wherein the batch generating addressing information objects for a plurality of the target files in a batch addressing manner comprises:
Determining a target file block hit by a file identifier contained in a current batch, and creating a block object of the target file block;
Traversing the header information of the file in the target file block, and judging whether the header information of the file is matched with the file identifier contained in the current batch;
When the header information of the file is matched with the file identification contained in the current batch, determining the position information of the file in the target file block according to the header information of the file; the location information includes a handle of the file, a location in the target file block, and an offset;
Generating an addressing information object of the matched file according to the position information, and returning the addressing information object to the block object;
And stopping traversing the target file block when the number of the addressing information objects in the block object is consistent with the number of the file identifications belonging to the target file block in the file identifications contained in the current batch.
9. The method of claim 7, wherein said addressing according to the constructed global dictionary determines addressing information objects for a plurality of said target files, comprising:
judging whether a target file block hit by a file identifier contained in the current batch is hit for the first time;
Creating a block object of the target file block under the condition of hitting the target file block for the first time;
Traversing the header information of all files in the target file block, and determining the position information of the files in the target file block according to the header information of the files; the location information includes a handle of the file, a location in the target file block, and an offset;
determining addressing information objects of all files according to the position information, and returning the addressing information objects of all files to the block object;
The block object is put into a global dictionary, and an addressing information object of a target file corresponding to a file identifier contained in the current batch is read from the global dictionary;
And under the condition that the target file block is not hit for the first time, reading the addressing information object of the target file corresponding to the file identification contained in the current batch from the global dictionary.
10. The method as recited in claim 9, further comprising:
deleting the addressing information object corresponding to the read target file after the target file corresponding to the addressing information object in the global dictionary is read;
And deleting the corresponding block object after the target file corresponding to any addressing information object in the block objects is read.
11. The method of claim 7, wherein generating the addressing information object for the target file comprises:
Determining a first number of files in a target file block hit by a file identifier contained in a current batch, and determining a second number of target files belonging to the target file block hit by the file identifier contained in the current batch;
Generating addressing information objects of a plurality of target files in a batch addressing mode under the condition that the first number exceeds a first threshold value and the second number exceeds a second threshold value;
and under the condition that the first quantity does not exceed a first threshold value and the second quantity does not exceed a second threshold value, addressing is carried out according to the constructed global dictionary, and addressing information objects of a plurality of target files are determined.
12. The method of claim 5, wherein the addressing the file corresponding to the file identifier contained in the current batch to determine the corresponding addressing information comprises:
determining a file block to which a target file hit by the current batch belongs according to a file identifier contained by the current batch;
grouping target files belonging to the same file block into a group, and generating a target file list containing at least one file identifier corresponding to the target file;
And respectively determining addressing information of the target files corresponding to each target file list by taking the target file list as a unit.
13. The method as recited in claim 12, further comprising:
starting a plurality of working processes, and reading files of corresponding batches based on the working processes; different working processes are used for reading files of different batches, and after the working process reads the files of the current batch, the working process is used for reading the files of the next batch until the file data set is read;
And in the file reading process of the current batch, starting working threads matched with the number of the hit file blocks, wherein each working thread is used for determining addressing information of the files in the corresponding file blocks and reading the corresponding files.
14. The method of claim 5, wherein said reading the corresponding file from the hit file block according to the addressing information comprises:
judging whether a target file block hit by a file identifier contained in the current batch is hit for the first time;
Under the condition of hitting the target file block for the first time, caching the target file block, and reading a corresponding target file from the cached target file block according to the addressing information;
and under the condition that the target file block is not hit for the first time, reading the corresponding target file from the cached target file block according to the addressing information.
15. The method of claim 14, wherein the reading the corresponding target file from the cached target file block according to the addressing information comprises:
Transmitting an addressing information object of the target file to an image reading function to generate a picture file object under the condition that the target file is picture data; reading corresponding picture data according to the picture file object, and performing format conversion;
And under the condition that the target file is text data, corresponding text data is read out according to the addressing information object of the target file, and format conversion is carried out.
16. A data reading apparatus, the apparatus comprising:
The acquisition module is used for acquiring block identifiers of a plurality of file blocks and determining file identifiers corresponding to each file in the file blocks; the file blocks are data blocks generated by aggregating files in a file data set;
The grouping module is used for grouping the block identifiers of the plurality of file blocks to form a plurality of block identifier groups;
the disorder module is used for carrying out group disorder on the plurality of block identification groups and carrying out group disorder on file identifications in at least part of the block identification groups to form a file identification sequence after disorder;
The reading module is used for reading the files corresponding to the file identifications contained in the current batch in batches according to the sequence of the file identifications in the file identification sequence;
Wherein the file identifier is an index of the corresponding file in all files contained in the plurality of file blocks;
The disorder module performs group disorder on the plurality of block identification groups, performs intra-group disorder on at least part of file identifications in the block identification groups, forms a file identification sequence after disorder, and comprises:
Constructing a two-dimensional array; a first dimension element in the two-dimension array represents a corresponding block identification group, and a second dimension element corresponding to the first dimension element represents a file identification in the corresponding block identification group;
the first dimension elements in the two-dimensional array are disordered, and at least part of the second dimension elements corresponding to the first dimension elements are disordered;
According to the sequence of the disordered elements in the two-dimensional array, converting the disordered two-dimensional array into a one-dimensional global index list; the global index list serves as a file identification sequence.
17. A computer device, comprising:
A memory and a processor in communication with each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the data reading method of any of claims 1 to 15.
18. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the data reading method of any one of claims 1 to 15.
19. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the data reading method of any of claims 1 to 15.
CN202410129220.8A 2024-01-30 2024-01-30 Data reading method, device, computer equipment and storage medium Active CN117667853B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410129220.8A CN117667853B (en) 2024-01-30 2024-01-30 Data reading method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410129220.8A CN117667853B (en) 2024-01-30 2024-01-30 Data reading method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117667853A CN117667853A (en) 2024-03-08
CN117667853B true CN117667853B (en) 2024-05-03

Family

ID=90079225

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410129220.8A Active CN117667853B (en) 2024-01-30 2024-01-30 Data reading method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117667853B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140025712A1 (en) * 2012-07-19 2014-01-23 Microsoft Corporation Global Recently Used Files List
CN109597903A (en) * 2018-11-21 2019-04-09 北京市商汤科技开发有限公司 Image file processing apparatus and method, document storage system and storage medium
US20200245040A1 (en) * 2019-01-25 2020-07-30 International Business Machines Corporation Securing and segmental sharing of multimedia files
CN114968942A (en) * 2022-06-15 2022-08-30 每平每屋(上海)科技有限公司 Data processing method, prediction method, device, storage medium and program product
CN116185308A (en) * 2023-04-25 2023-05-30 山东英信计算机技术有限公司 Data set processing method, device, equipment, medium and model training system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140025712A1 (en) * 2012-07-19 2014-01-23 Microsoft Corporation Global Recently Used Files List
CN109597903A (en) * 2018-11-21 2019-04-09 北京市商汤科技开发有限公司 Image file processing apparatus and method, document storage system and storage medium
US20200245040A1 (en) * 2019-01-25 2020-07-30 International Business Machines Corporation Securing and segmental sharing of multimedia files
CN114968942A (en) * 2022-06-15 2022-08-30 每平每屋(上海)科技有限公司 Data processing method, prediction method, device, storage medium and program product
CN116185308A (en) * 2023-04-25 2023-05-30 山东英信计算机技术有限公司 Data set processing method, device, equipment, medium and model training system

Also Published As

Publication number Publication date
CN117667853A (en) 2024-03-08

Similar Documents

Publication Publication Date Title
US20230126005A1 (en) Consistent filtering of machine learning data
US10048937B2 (en) Flash optimized columnar data layout and data access algorithms for big data query engines
US20180081798A1 (en) System and method for executing data processing tasks using resilient distributed datasets (rdds) in a storage device
US8949224B2 (en) Efficient query processing using histograms in a columnar database
US10061834B1 (en) Incremental out-of-place updates for datasets in data stores
US20150379072A1 (en) Input processing for machine learning
US20170177597A1 (en) Biological data systems
CN109189995B (en) Data redundancy elimination method in cloud storage based on MPI
US20160239527A1 (en) Systems, apparatuses, methods, and computer readable media for processing and analyzing big data using columnar index data format
CN116185308B (en) Data set processing method, device, equipment, medium and model training system
Hu et al. A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data
Merceedi et al. A comprehensive survey for hadoop distributed file system
Su et al. Effective and efficient data sampling using bitmap indices
Aggarwal et al. Small files’ problem in Hadoop: A systematic literature review
CN112965939A (en) File merging method, device and equipment
CN117667853B (en) Data reading method, device, computer equipment and storage medium
Akhtar et al. Parallel processing of image segmentation data using Hadoop
Nguyen et al. An efficient similar image search framework for large-scale data on cloud
US20220300321A1 (en) Data pipeline
JP2022053542A (en) Computer system, computer program and computer-implemented method (workload-driven database reorganization)
Karunarathna et al. Scalable graph convolutional network based link prediction on a distributed graph database server
Park et al. KV-CSD: A Hardware-Accelerated Key-Value Store for Data-Intensive Applications
CN113448957A (en) Data query method and device
CN114911886B (en) Remote sensing data slicing method and device and cloud server
Shen et al. A unified storage system for whole-time-range data analytics over unbounded data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant