CN116483748A - Efficient data loading method, equipment and storage medium based on deep learning - Google Patents
Efficient data loading method, equipment and storage medium based on deep learning Download PDFInfo
- Publication number
- CN116483748A CN116483748A CN202310267674.7A CN202310267674A CN116483748A CN 116483748 A CN116483748 A CN 116483748A CN 202310267674 A CN202310267674 A CN 202310267674A CN 116483748 A CN116483748 A CN 116483748A
- Authority
- CN
- China
- Prior art keywords
- data
- field
- cache
- loading
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000011068 loading method Methods 0.000 title claims abstract description 92
- 238000013135 deep learning Methods 0.000 title claims abstract description 34
- 230000015654 memory Effects 0.000 claims abstract description 46
- 238000012549 training Methods 0.000 claims abstract description 46
- 238000000034 method Methods 0.000 claims abstract description 35
- 238000004590 computer program Methods 0.000 claims description 7
- 230000010354 integration Effects 0.000 claims description 2
- 230000002829 reductive effect Effects 0.000 abstract description 7
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000005457 optimization Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 206010028980 Neoplasm Diseases 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0877—Cache access modes
- G06F12/0884—Parallel mode, e.g. in parallel with main memory or CPU
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a high-efficiency data loading method, equipment and a storage medium based on deep learning, wherein the method comprises the steps of opening each small file in a binary reading mode, sequentially adding binary data of each small file into a data block, and generating a data structure of the data block; starting data loading of a data block, and caching all data in a memory for reuse when a first epoch loads a first miniBatch; half of the loaded data is repeatedly used, and half of the needed data is loaded in the disk every other miniBatch; integrating the data loaded from the disk and the cache, then sending the integrated data to the GPU for training, and loading the next miniBatch; and deleting the access data in the cache to release the memory, so that the cache can adapt to the limited memory size. According to the invention, the small files are compressed into the data blocks, and the reasonable caching strategy is designed, so that the data loading time and the overall running time are greatly reduced under the condition of not affecting the accuracy.
Description
Technical Field
The invention relates to the technical field of Internet of things communication, in particular to a high-efficiency data loading method based on deep learning, and further relates to computer equipment and a storage medium for realizing the method.
Background
Deep Neural Networks (DNNs) have achieved excellent results in a wide range of applications such as image classification, speech recognition, automated driving of automobiles, cancer detection, and playing complex games. However, in the DNN training task, repeated reading and processing of large data sets is required to achieve high accuracy. Performing DNN training is a data and computation intensive task. Thus, improving runtime performance without sacrificing application accuracy or increasing hardware costs is critical to DNN training. In DNN training, a portion of the data must be loaded first, and then the AI accelerator (GPU) can initiate the training process to calculate the loaded data. Thus, unbalanced I/O and computational performance may result in performance degradation throughout the training process. Because of the limited I/O performance of the disk that typically stores the data set, each training process must wait to provide data to the computing unit of the AI accelerator. Because the GPU greatly improves the performance of the computing unit, the performance of loading data is critical to the overall runtime performance of deep neural network training. Especially when loading data from a single hard drive to perform training on multiple GPUs, the data loading will take up significant time and become a critical bottleneck limiting overall training performance.
Accelerating DNN training has been studied in many ways. First, specialized accelerators, such as FPGAs and GPUs, which are considered artificial intelligence accelerators, are typically employed and optimized for DNN algorithms. In terms of software, machine learning libraries such as Tensorflow and PyTorch provide computational task optimization and data loading acceleration through prefetching and multithreading techniques. Data loading is an operation of the DNN training main time, and is mainly researched and optimized through data parallelism and caching.
Data parallelism is a common technique for improving data loading efficiency in training deep neural networks. In Importance of data loading pipeline in training deep neural networks, it points out two main problems of data loading. Firstly, reading data directly from a file is inefficient due to frequent opening, reading and closing. At the same time, allocation of resources for additional operations on the data can overload the CPU. It suggests the use of binary data format HDF5 and TFRecord from Tensorflow to achieve better data loading performance. It reveals that the performance that can be improved by using more threads to perform data loading is limited, e.g., pyrerch.
Since the IOPS and throughput of a disk may not meet the requirements of data loading performance, caching is a strategy commonly used in large-scale systems to accelerate data access performance, such as Memcache and ramcaloud.
The disadvantages of the prior art include the following:
as shown in FIG. 1, in deep learning training, data loading needs to be read from a disk, then open, read and close operations are performed on each file, and the current data set size reaches hundreds of GB, and tens of thousands of files exist. If tens of thousands of files are operated on by each epoch, disk I/O can be a significant expense. In addition, in the training process, in order to obtain better accuracy, tens to hundreds of epochs need to be trained, and the data loading time consumption is very large.
In terms of data parallelism, the data pipeline may be a fine-grained data loading technique for reducing the I/O overhead of training deep neural networks. It uses multithreading to perform data loading while training tasks are performed on multiple GPUs. Although data formats like HDF5 can support data parallelism, files in HDF5 are only compressed, and open, read, and close operations are required for each file during data loading, without reducing much disk I/O overhead.
In terms of caching, if the data set is completely cached in memory, a fast data load can be achieved. However, for some large data sets hundreds of GB or even TB levels of memory are required. Therefore, a reasonable cache obtaining strategy is designed, data in the cache is required to be reasonably eliminated, and model training cannot be carried out by using the same obtained data for a long time, so that the final accuracy of model training can be influenced.
Disclosure of Invention
The invention aims to provide a high-efficiency data loading method, equipment and a storage medium based on deep learning, which are used for solving the problems of high data loading time consumption, high disk I/O (input/output) overhead, low model training accuracy and the like in the prior art.
In a first aspect, the present invention provides a method for efficient data loading based on deep learning, the method comprising the steps of: opening each small file in a binary reading mode, sequentially adding binary data of each small file into a data block, and generating a data structure of the data block;
starting data loading of a data block, and caching all data in a memory for reuse when a first epoch loads a first miniBatch;
half of the loaded data is repeatedly used, and half of the needed data is loaded in the disk every other miniBatch;
integrating the data loaded from the disk and the cache, then sending the integrated data to the GPU for training, and loading the next miniBatch;
and deleting the access data in the cache to release the memory, so that the cache can adapt to the limited memory size.
According to the invention, the Data structure comprises N, offset Field, size Field, label Field and Raw Data Field, wherein N is expressed as the Data block which integrates N small files; the Offset Field is an Offset Field containing N integers indexed from 1 to N, each integer describing in order the starting position of each file in the original data Field in a block; the Size Field represents the Size of each small file in a block described in sequence as N integers, each small file requiring N bytes to represent its Size Field; the Label Field contains N integers, which in turn represent the labels of each file; the Raw Data Field contains the Raw Data of N doclets; wherein, the fields and the data in the data block are stored in a binary manner.
According to the high-efficiency data loading method based on deep learning, when a data block is loaded, the number of data items in the data block is obtained by reading the first n bytes, and the original data of a specific data item can be directly positioned through the Offset Field and the Size Field, and the Label of the data can be found in the Label Field.
According to the present invention, when the original data of a specific data item is located through the Offset Field and the Size Field, the method for loading data based on deep learning comprises:
the first file has an Offset of zero, a Size given by a first integer in the Size Field, and the second file has an Offset given by a second integer in the Offset Field, a Size given by a second integer in the Size Field;
according to the number of bytes required by each field in the data block, the extra bytes required by compressing N small files into one data block in the data block are calculated, and the specific positioning of the small files in the data block is realized.
According to the high-efficiency Data loading method based on deep learning, the initial positions of 1 to N small files in the Raw Data Field are sequentially stored in an Offset Field, when the position of an mth small file in a Data block is found, the initial position of the mth small file can be known through the Offset Field, and the Size of the mth small file can be known through the Size Field;
the start position and end position of binary Data to the mth small file in the Data block in the Raw Data Field are determined by the start position of the Offset Field and the file Size of the Size Field, thereby achieving specific positioning of the small file in the block.
According to the high-efficiency data loading method based on deep learning provided by the invention, the following steps are further executed:
in the training process, each data block is used as a unit of a shuffle, wherein each data block has a unique number as an index, which is respectively 0, 1, 2 and 3, and the index set of all data blocks is the input of the shuffle algorithm;
when a set of indexes is arranged, the arranged index order is used to train the loading blocks one by one.
According to the high-efficiency data loading method based on deep learning provided by the invention, half of the loaded data is reused, and half of the needed data is loaded in a disk every other miniBatch, and the method comprises the following steps:
when data exist in the cache, taking half of miniBatch from the disk and the cache in a counter statistics mode, stopping loading the data when the counter for loading the data from the disk and the cache is equal to miniBatch/2, then integrating the data loaded from the disk and the cache, sending the data together into the GPU for training, resetting the counter to 0, and loading the next miniBatch.
According to the high-efficiency data loading method based on deep learning, when data loading is executed:
carrying out data loading in a data block reading mode;
when data is loaded from a disk, only half of the files are loaded for miniBatch by utilizing the data structure of the data block, and then the other half of the files are obtained from the memory cache;
for getCachedData (), it will immediately evict the accessed file from the cache, avoiding the excessive reuse of data in the cache;
for the first getCachedData () in the deep learning training, there is no data in the cache, so the disk I/O loads the data, then the data returned in the getCachedData () is owned in the cache; after one miniBatch is obtained, the file loaded from the disk is put into the cache through putCachedData ().
It can be seen that the invention has the following beneficial effects over the prior art:
1. the invention reduces the operations of open, read and close on the file by designing the data structure of the data block, which can greatly reduce the disk I/O operation, improve the I/O efficiency and reduce the data loading time.
2. According to the invention, through a reasonable cache strategy, under the condition of not affecting the final accuracy, the cache can be reasonably used, the operation of loading data from a disk is reduced, the time of waiting for loading the data before model training is greatly reduced, the server can be better utilized for deep learning training, and the time of loading the data and the time of deep learning training are greatly reduced.
In a second aspect, the present invention also provides an electronic device, including:
a memory storing computer executable instructions;
a processor configured to execute the computer-executable instructions,
wherein the computer executable instructions, when executed by the processor, implement the steps of any of the deep learning based efficient data loading methods described above.
In a third aspect, the present invention also provides a storage medium having stored thereon a computer program for implementing the steps of any one of the above-described deep learning-based efficient data loading methods when executed by a processor.
It can be seen that the present invention also provides an electronic device and a storage medium for a deep learning-based efficient data loading method, which includes: one or more memories, one or more processors. The memory is used for storing the program codes, intermediate data generated in the running process of the program, the output result of the model and model parameters; the processor is used for processor resources occupied by code running and a plurality of processor resources occupied when training the model.
The invention is described in further detail below with reference to the drawings and the detailed description.
Drawings
FIG. 1 is a schematic diagram of a data loading process in prior art deep learning.
FIG. 2 is a flow chart of an embodiment of a deep learning based efficient data loading method of the present invention.
Fig. 3 is a schematic structural diagram of compressing a small file into a data block according to an embodiment of a deep learning-based efficient data loading method of the present invention.
FIG. 4 is a schematic diagram of a data structure related to compressing small files into data blocks in an embodiment of a deep learning-based efficient data loading method according to the present invention.
FIG. 5 is a schematic diagram of a deep learning-based efficient data loading method according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of an algorithm for loading and caching data with high efficiency in an embodiment of a method for loading data with high efficiency based on deep learning according to the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 2, the invention provides a high-efficiency data loading method based on deep learning, which comprises the following steps:
step S1, opening each small file in a binary reading mode, and sequentially adding binary data of each small file into a data block to generate a data structure of the data block;
step S2, starting data loading of the data block, and caching all data in a memory for reuse when a first epoch loads a first miniBatch;
step S3, half of the loaded data is repeatedly used, and half of the needed data is loaded in the disk every other miniBatch;
s4, integrating the data loaded from the disk and the cache, then sending the integrated data to the GPU for training, and loading the next miniBatch;
and S5, deleting the access data in the cache to release the memory, so that the cache can adapt to the limited memory size.
In this embodiment, the Data structure includes N, offset Field, size Field, label Field, and Raw Data Field, N Field indicating that the Data block is an integration of N doclets together; the Offset Field is an Offset Field containing N integers indexed from 1 to N, each integer describing in order the starting position of each file in the original data Field in a block; the Size Field represents the Size of each small file in a block described in sequence as N integers, each small file requiring N bytes to represent its Size Field; the Label Field contains N integers, which in turn represent the labels of each file; the Raw Data Field contains the Raw Data of N doclets; wherein, the fields and the data in the data block are stored in a binary manner.
In this embodiment, upon loading a data block, the number of data items in the data block is obtained by reading the first n bytes, and the original data of a particular data item can be located directly by the Offset Field and Size Field fields, and the tag of that data can be found in the Label Field.
In this embodiment, when the original data of a specific data item is located by the Offset Field and the Size Field, it includes:
the first file has an Offset of zero, a Size given by a first integer in the Size Field, and the second file has an Offset given by a second integer in the Offset Field, a Size given by a second integer in the Size Field;
according to the number of bytes required by each field in the data block, the extra bytes required by compressing N small files into one data block in the data block are calculated, and the specific positioning of the small files in the data block is realized.
In the Offset Field, the starting positions of 1 to N small files in the Raw Data Field are sequentially stored, when the position of the mth small file in the Data block is found, the starting position of the mth small file can be known through the Offset Field, and the Size of the mth small file can be known through the Size Field;
the start position and end position of binary Data to the mth small file in the Data block in the Raw Data Field are determined by the start position of the Offset Field and the file Size of the Size Field, thereby achieving specific positioning of the small file in the block.
In this embodiment, the Offset Field in a block represents the start position of binary Data of each small file in the block, one block contains N small files, and the Offset Field sequentially stores the start positions of 1 to N small files in the Raw Data Field. The Size Field represents the file Size of each small file. For example, the position of the 5 th small file in the block is found, the starting position of the 5 th small file is known through the Offset Field, then the Size Field knows that the Size of the 5 th small file is 100KB, then the starting position and the ending position of binary Data of the 5 th small file in the block in the Raw Data Field are determined through the starting position of the Offset Field and the file Size of the Size Field, and then the specific positioning of the small file in the block is realized.
In the training process, each data block is used as a unit of a shuffle, wherein each data block has a unique number as an index, which is respectively 0, 1, 2 and 3, and the index set of all data blocks is the input of the shuffle algorithm;
when a set of indexes is arranged, the arranged index order is used to train the loading blocks one by one.
In this embodiment, half of the data loaded is reused, and every other miniBatch loads half of the required data in disk, including:
when data exist in the cache, taking half of miniBatch from the disk and the cache in a counter statistics mode, stopping loading the data when the counter for loading the data from the disk and the cache is equal to miniBatch/2, then integrating the data loaded from the disk and the cache, sending the data together into the GPU for training, resetting the counter to 0, and loading the next miniBatch. The action of ending the data loading is that the number of the data to be loaded reaches the number of the miniBatch, then the data is sent to the GPU for training, when the data exists in the cache (except for the data in the cache when the first miniBatch in the first epoch is loaded, the data exists in the cache when the first miniBatch is loaded), two counters are respectively taken from the disk and the cache in a counting mode through the counter, the initial setting is 0, when the counter for loading the data from the disk and the cache is equal to the miniBatch/2, the data to be loaded is stopped, then the data loaded from the disk and the cache are integrated together and sent to the GPU for training, and the two counters are reset to 0 again for loading of the next miniBatch.
In the present embodiment, when data loading is performed:
carrying out data loading in a data block reading mode;
when data is loaded from a disk, only half of the files are loaded for miniBatch by utilizing the data structure of the data block, and then the other half of the files are obtained from the memory cache;
for getCachedData (), it will immediately evict the accessed file from the cache, avoiding the excessive reuse of data in the cache;
for the first getCachedData () in the deep learning training, there is no data in the cache, so the disk I/O loads the data, then the data returned in the getCachedData () is owned in the cache; after one miniBatch is obtained, the file loaded from the disk is put into the cache through putCachedData ().
Specifically, the embodiment mainly considers two aspects of designing a data structure of an efficient data loading block and a reasonable caching strategy:
first, to increase I/O efficiency while reducing the number of threads required to perform disk reads, the present embodiment compresses small files into blocks, which can greatly reduce the number of times that multiple files are read in a single block, opening, reading, and closing are performed. At the same time, when multiple files are aggregated into one block, read throughput may be improved by reducing I/O seek on a given disk.
Secondly, in order to cache the loaded data in the memory with limited memory size, the loaded data is reused to obtain effective loading, and meanwhile, the cache size is kept to adapt to the limited memory size. This embodiment uses two strategies to cache data. First, when the first epoch loads the first miniBatch, all data is cached in memory for reuse. Half of the data loaded can then be reused, and only half of the required data is loaded in disk every other miniBatch. In this way, the required disk throughput is reduced by half, the time required to load data is reduced by half, and each file has the same access time. Because of the limited memory size, the second strategy is to load more other files in the dataset by deleting the access data in the cache to free up memory.
Design for efficient data loading block:
as shown in fig. 3, fig. 3 shows that 256 small files are read into a block, each small file is opened first by binary reading, then binary data is added to the block in sequence, and then other data segments are designed according to fig. 4.
For the design of the data structure of a data block, it is not just to put each small file in a binary manner into one block, but it is also necessary to know the specific location of each file in the block. As shown in fig. 4, fig. 4 illustrates how a data structure in one block is designed, where the value of N can be set to a multiple of the value of batch size.
As shown in fig. 4, the data structure of the present embodiment includes five fields, where the first field N indicates that the data block integrates N small files, and the N field of the data block of the present embodiment is 256, so that only 4 bytes are needed to indicate the numbers of the N small files in the data block; the second Field Offset Field is an Offset Field, which contains N integers from 1 to N indexes, each integer describing the starting position of each file in the original data Field in the block in sequence, and each small file requires 4 bytes to represent the Offset Field, then N requires 4N bytes; the Size Field represents the Size of each small file that is sequentially described by N integers in a data block, each small file requiring 4 bytes to represent its Size Field, and a total of N small files requiring 4N bytes; label fields are also labels containing N integers, representing each file in turn, with one doclet consuming 4 bytes and N doclets consuming 4N bytes; raw Data Field is the original Data containing N small files, and the size of the original small files is not increased or decreased, and the size of the original small files is also increased in the Data block. The fields and data in the data block are stored in binary fashion. The data structure of the data block is not only used for putting a plurality of small files into one block, but also can realize the function of data positioning, and can provide effective support for the subsequent data adding into a cache through the position of one or a part of small files in two parameter positioning blocks of an Offset Field and a Size Field. For example, the offset of the first file is zero, the Size is given by the first integer in the Size Field, the offset of the second file is given by the second integer in the offset Field, and the Size is given by the second integer in the Size Field. From the number of bytes required for each field in the block, the maximum (12N+4) extra bytes required to compress N small files into one block in the block can be calculated. When N is 256, one data block is only 3076 bytes more than the original small files, namely only about 3KB more, but the original data size of each small file is about 100KB, and the total size of the 256 small files is about 25600KB, so that the size of the extra 3KB is negligible compared with the size of the original small file. Therefore, the size of the additional addition is far smaller than one ten thousandth, the size of the block is not increased more than the size of the original multiple small files, and excessive overhead of system resources is not caused.
In addition to Raw Data fields, other regions in the Data structure of FIG. 4 may be considered metadata for each Data item. Thus, the present invention can also extend the field to accommodate different types of training data sets. When a block is loaded, the first four bytes can be read to see how many data items are in deblocking. By the Offset Field and Size Field fields, the original data of a particular data item can be located directly and its tag found in the label Field. Because the entire data structure would be loaded together into main memory, the process of reading each file in a block would be very efficient.
The present embodiment can reduce the number of times of opening and closing to one-nth of the baseline algorithm version when reading a plurality of files by compressing every N small files into one data block. As read small files, performance is mainly limited by the IOPS of the disk, and when N small files are integrated into one data block, performance becomes better, and disk throughput also reduces the limitation.
An important function during training is data shuffling to avoid overfitting during deep learning training. Since each data item is read independently in the baseline algorithm version, it can be easily arranged for shuffling in units of each image. However, this shuffle method is not suitable for a method of compressing N small files into one block. Otherwise, many blocks must be loaded to read the N files, and this results in inefficient data loading.
In order to execute the shuffle while maintaining efficient loading performance, the present embodiment uses each data block as a unit of the shuffle. As shown in fig. 5, fig. 5 shows an example of how data blocks are used as shuffles, each block having a unique number as an index, 0, 1, 2, and 3, respectively. The index set of all blocks is the input to the shuffle algorithm. When a set of indices is arranged, the training can be performed on a block-by-block basis using the arranged index order, i.e., 3, 1, 0, and 2.
For this reason, the present embodiment can compress each N files into one block to achieve efficient data loading. Meanwhile, the method can carry out shuffling by taking the blocks as a unit during training, so as to avoid the problem of overfitting.
For the design of caching strategy:
as shown in fig. 6, the algorithm in fig. 6 demonstrates how a cache can be used to obtain efficient data loading when applying the optimization strategy of the present invention. When the line 6 of the algorithm loads data from the disk, only half of the file is loaded for miniBatch, and then the other half is acquired from the cache in the line 7. For getCachedData (), it will immediately evict the accessed file from the cache. Note that for the first getCachedData (), there is no data in the cache, so disk I/O loads data are performed. Then has data in the cache that can be returned in getCachedData (). After line 8 gets a miniBatch, the loaded file is placed from disk into cache at line 9.
Therefore, when the first ministch of the first epoch is loaded, the cache is free of data, the data of the first ministch is loaded from the disk, when the first ministch is loaded in the algorithm, the getCacheData () of the 7 th line is free of data, so the first ministch of the 8 th line is loaded from the first haffministch, the first ministch is also the data of the disk, then the data just loaded from the disk is put into the cache, there is one ministch data in the cache, the latter ministch is half of the data loaded from the disk, then the data loaded from the cache is deleted in the cache, and the data just loaded from the disk is loaded into the cache.
In the above algorithm, in the optimization using the memory as the cache, the present invention caches the loaded data for reuse. Meanwhile, the data accessed in the cache is deleted, so that the cache can adapt to the limited memory size. The load time can be reduced by half because half of the files are needed in each ministch.
In summary, since the data loading is dominant in the running time, the embodiment effectively brings high-efficiency data loading by compressing the small file into blocks and reducing the disk I/O operation, thereby reducing the overall running time, and by designing a reasonable cache strategy, the data loading time of DNN can achieve an optimization effect up to 10 times without affecting the accuracy, and the performance is increased.
In one embodiment, an electronic device is provided, which may be a server. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the electronic device is for storing data. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements an efficient data loading method based on deep learning.
It will be appreciated by those skilled in the art that the electronic device structure shown in this embodiment is merely a partial structure related to the present application and does not constitute a limitation of the electronic device to which the present application is applied, and that a specific electronic device may include more or fewer components than those shown in this embodiment, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.
Further, the logic instructions in the memory described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It can be seen that the present invention also provides an electronic device and a storage medium for a deep learning-based efficient data loading method, which includes: one or more memories, one or more processors. The memory is used for storing the program codes, intermediate data generated in the running process of the program, the output result of the model and model parameters; the processor is used for processor resources occupied by code running and a plurality of processor resources occupied when training the model.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but any insubstantial changes and substitutions made by those skilled in the art on the basis of the present invention are intended to be within the scope of the present invention as claimed.
Claims (10)
1. The high-efficiency data loading method based on deep learning is characterized by comprising the following steps of:
opening each small file in a binary reading mode, sequentially adding binary data of each small file into a data block, and generating a data structure of the data block;
starting data loading of a data block, and caching all data in a memory for reuse when a first epoch loads a first miniBatch;
half of the loaded data is repeatedly used, and half of the needed data is loaded in the disk every other miniBatch;
integrating the data loaded from the disk and the cache, then sending the integrated data to the GPU for training, and loading the next miniBatch;
and deleting the access data in the cache to release the memory, so that the cache can adapt to the limited memory size.
2. The method according to claim 1, characterized in that:
the Data structure includes N, offset Field, size Field, label Field, and Raw Data Field, N Field indicating that the Data block is an integration of N doclets; the Offset Field is an Offset Field containing N integers indexed from 1 to N, each integer describing in order the starting position of each file in the original data Field in a block; the Size Field represents the Size of each small file in a block described in sequence as N integers, each small file requiring N bytes to represent its Size Field; the Label Field contains N integers, which in turn represent the labels of each file; the Raw Data Field contains the Raw Data of N doclets; wherein, the fields and the data in the data block are stored in a binary manner.
3. The method according to claim 2, characterized in that:
when loading a data block, the number of data items in the data block is obtained by reading the first n bytes, and by the Offset Field and Size Field, the original data for a particular data item can be located directly and the tag of that data can be found in the LabelField Field.
4. A method according to claim 3, wherein locating the original data of a particular data item via the Offset Field and the Size Field comprises:
the first file has an Offset of zero, a Size given by a first integer in the Size Field, and the second file has an Offset given by a second integer in the Offset Field, a Size given by a second integer in the Size Field;
according to the number of bytes required by each field in the data block, the extra bytes required by compressing N small files into one data block in the data block are calculated, and the specific positioning of the small files in the data block is realized.
5. The method according to claim 4, wherein:
in the Offset Field, the starting positions of 1 to N small files in the Raw Data Field are sequentially stored, when the position of the mth small file in the Data block is found, the starting position of the mth small file can be known through the Offset Field, and the Size of the mth small file can be known through the Size Field;
the start position and end position of binary Data to the mth small file in the Data block in the Raw Data Field are determined by the start position of the Offset Field and the file Size of the Size Field, thereby achieving specific positioning of the small file in the block.
6. The method of claim 1, further performing:
in the training process, each data block is used as a unit of a shuffle, wherein each data block has a unique number as an index, which is respectively 0, 1, 2 and 3, and the index set of all data blocks is the input of the shuffle algorithm;
when a set of indexes is arranged, the arranged index order is used to train the loading blocks one by one.
7. The method of claim 1, wherein reusing half of the loaded data, loading half of the required data in disk every other miniBatch, comprises:
when data exist in the cache, taking half of miniBatch from the disk and the cache in a counter statistics mode, stopping loading the data when the counter for loading the data from the disk and the cache is equal to miniBatch/2, then integrating the data loaded from the disk and the cache, sending the data together into the GPU for training, resetting the counter to 0, and loading the next miniBatch.
8. The method of claim 1, wherein, when performing data loading:
carrying out data loading in a data block reading mode;
when data is loaded from a disk, only half of the files are loaded for miniBatch by utilizing the data structure of the data block, and then the other half of the files are obtained from the memory cache;
for getCachedData (), it will immediately evict the accessed file from the cache, avoiding the excessive reuse of data in the cache;
for the first getCachedData () in the deep learning training, there is no data in the cache, so the disk I/O loads the data, then the data returned in the getCachedData () is owned in the cache; after one miniBatch is obtained, the file loaded from the disk is put into the cache through putCachedData ().
9. An electronic device, comprising:
a memory storing computer executable instructions;
a processor configured to execute the computer-executable instructions,
wherein the computer executable instructions, when executed by the processor, implement the steps of the deep learning based efficient data loading method as recited in any one of claims 1-8.
10. A storage medium having stored thereon a computer program for implementing the steps of the deep learning based efficient data loading method according to any of claims 1-8 when executed by a processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310267674.7A CN116483748A (en) | 2023-03-14 | 2023-03-14 | Efficient data loading method, equipment and storage medium based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310267674.7A CN116483748A (en) | 2023-03-14 | 2023-03-14 | Efficient data loading method, equipment and storage medium based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116483748A true CN116483748A (en) | 2023-07-25 |
Family
ID=87210984
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310267674.7A Pending CN116483748A (en) | 2023-03-14 | 2023-03-14 | Efficient data loading method, equipment and storage medium based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116483748A (en) |
-
2023
- 2023-03-14 CN CN202310267674.7A patent/CN116483748A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10644721B2 (en) | Processing core data compression and storage system | |
US10884744B2 (en) | System and method of loop vectorization by compressing indices and data elements from iterations based on a control mask | |
Lemire et al. | Consistently faster and smaller compressed bitmaps with roaring | |
JP6605573B2 (en) | Parallel decision tree processor architecture | |
Andrzejewski et al. | GPU-WAH: Applying GPUs to compressing bitmap indexes with word aligned hybrid | |
JPWO2020190808A5 (en) | ||
Al Sideiri et al. | CUDA implementation of fractal image compression | |
Stehle et al. | ParPaRaw: Massively parallel parsing of delimiter-separated raw data | |
CN112015473A (en) | Sparse convolution neural network acceleration method and system based on data flow architecture | |
CN112771546A (en) | Operation accelerator and compression method | |
CN113407343A (en) | Service processing method, device and equipment based on resource allocation | |
CN111026736B (en) | Data blood margin management method and device and data blood margin analysis method and device | |
US20230385258A1 (en) | Dynamic random access memory-based content-addressable memory (dram-cam) architecture for exact pattern matching | |
CN115836346A (en) | In-memory computing device and data processing method thereof | |
CN116483748A (en) | Efficient data loading method, equipment and storage medium based on deep learning | |
CN112200310A (en) | Intelligent processor, data processing method and storage medium | |
US12001237B2 (en) | Pattern-based cache block compression | |
Lu et al. | G-Match: a fast GPU-friendly data compression algorithm | |
Fuentes-Alventosa et al. | Cuvle: Variable-length encoding on cuda | |
CN114518841A (en) | Processor in memory and method for outputting instruction using processor in memory | |
WO2021061183A1 (en) | Shuffle reduce tasks to reduce i/o overhead | |
Ahmed et al. | Efficient GPU Acceleration for Computing Maximal Exact Matches in Long DNA Reads | |
CN112906728A (en) | Feature comparison method, device and equipment | |
KR102519210B1 (en) | Method and apparatus for accelerating artificial neural network using double-stage weight sharing | |
US11442643B2 (en) | System and method for efficiently converting low-locality data into high-locality data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |