CN116483748A - Efficient data loading method, equipment and storage medium based on deep learning - Google Patents

Efficient data loading method, equipment and storage medium based on deep learning Download PDF

Info

Publication number
CN116483748A
CN116483748A CN202310267674.7A CN202310267674A CN116483748A CN 116483748 A CN116483748 A CN 116483748A CN 202310267674 A CN202310267674 A CN 202310267674A CN 116483748 A CN116483748 A CN 116483748A
Authority
CN
China
Prior art keywords
data
field
cache
loading
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310267674.7A
Other languages
Chinese (zh)
Inventor
刘成健
史鹏程
李艺鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Technology University
Original Assignee
Shenzhen Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Technology University filed Critical Shenzhen Technology University
Priority to CN202310267674.7A priority Critical patent/CN116483748A/en
Publication of CN116483748A publication Critical patent/CN116483748A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0877Cache access modes
    • G06F12/0884Parallel mode, e.g. in parallel with main memory or CPU
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a high-efficiency data loading method, equipment and a storage medium based on deep learning, wherein the method comprises the steps of opening each small file in a binary reading mode, sequentially adding binary data of each small file into a data block, and generating a data structure of the data block; starting data loading of a data block, and caching all data in a memory for reuse when a first epoch loads a first miniBatch; half of the loaded data is repeatedly used, and half of the needed data is loaded in the disk every other miniBatch; integrating the data loaded from the disk and the cache, then sending the integrated data to the GPU for training, and loading the next miniBatch; and deleting the access data in the cache to release the memory, so that the cache can adapt to the limited memory size. According to the invention, the small files are compressed into the data blocks, and the reasonable caching strategy is designed, so that the data loading time and the overall running time are greatly reduced under the condition of not affecting the accuracy.

Description

Efficient data loading method, equipment and storage medium based on deep learning
Technical Field
The invention relates to the technical field of Internet of things communication, in particular to a high-efficiency data loading method based on deep learning, and further relates to computer equipment and a storage medium for realizing the method.
Background
Deep Neural Networks (DNNs) have achieved excellent results in a wide range of applications such as image classification, speech recognition, automated driving of automobiles, cancer detection, and playing complex games. However, in the DNN training task, repeated reading and processing of large data sets is required to achieve high accuracy. Performing DNN training is a data and computation intensive task. Thus, improving runtime performance without sacrificing application accuracy or increasing hardware costs is critical to DNN training. In DNN training, a portion of the data must be loaded first, and then the AI accelerator (GPU) can initiate the training process to calculate the loaded data. Thus, unbalanced I/O and computational performance may result in performance degradation throughout the training process. Because of the limited I/O performance of the disk that typically stores the data set, each training process must wait to provide data to the computing unit of the AI accelerator. Because the GPU greatly improves the performance of the computing unit, the performance of loading data is critical to the overall runtime performance of deep neural network training. Especially when loading data from a single hard drive to perform training on multiple GPUs, the data loading will take up significant time and become a critical bottleneck limiting overall training performance.
Accelerating DNN training has been studied in many ways. First, specialized accelerators, such as FPGAs and GPUs, which are considered artificial intelligence accelerators, are typically employed and optimized for DNN algorithms. In terms of software, machine learning libraries such as Tensorflow and PyTorch provide computational task optimization and data loading acceleration through prefetching and multithreading techniques. Data loading is an operation of the DNN training main time, and is mainly researched and optimized through data parallelism and caching.
Data parallelism is a common technique for improving data loading efficiency in training deep neural networks. In Importance of data loading pipeline in training deep neural networks, it points out two main problems of data loading. Firstly, reading data directly from a file is inefficient due to frequent opening, reading and closing. At the same time, allocation of resources for additional operations on the data can overload the CPU. It suggests the use of binary data format HDF5 and TFRecord from Tensorflow to achieve better data loading performance. It reveals that the performance that can be improved by using more threads to perform data loading is limited, e.g., pyrerch.
Since the IOPS and throughput of a disk may not meet the requirements of data loading performance, caching is a strategy commonly used in large-scale systems to accelerate data access performance, such as Memcache and ramcaloud.
The disadvantages of the prior art include the following:
as shown in FIG. 1, in deep learning training, data loading needs to be read from a disk, then open, read and close operations are performed on each file, and the current data set size reaches hundreds of GB, and tens of thousands of files exist. If tens of thousands of files are operated on by each epoch, disk I/O can be a significant expense. In addition, in the training process, in order to obtain better accuracy, tens to hundreds of epochs need to be trained, and the data loading time consumption is very large.
In terms of data parallelism, the data pipeline may be a fine-grained data loading technique for reducing the I/O overhead of training deep neural networks. It uses multithreading to perform data loading while training tasks are performed on multiple GPUs. Although data formats like HDF5 can support data parallelism, files in HDF5 are only compressed, and open, read, and close operations are required for each file during data loading, without reducing much disk I/O overhead.
In terms of caching, if the data set is completely cached in memory, a fast data load can be achieved. However, for some large data sets hundreds of GB or even TB levels of memory are required. Therefore, a reasonable cache obtaining strategy is designed, data in the cache is required to be reasonably eliminated, and model training cannot be carried out by using the same obtained data for a long time, so that the final accuracy of model training can be influenced.
Disclosure of Invention
The invention aims to provide a high-efficiency data loading method, equipment and a storage medium based on deep learning, which are used for solving the problems of high data loading time consumption, high disk I/O (input/output) overhead, low model training accuracy and the like in the prior art.
In a first aspect, the present invention provides a method for efficient data loading based on deep learning, the method comprising the steps of: opening each small file in a binary reading mode, sequentially adding binary data of each small file into a data block, and generating a data structure of the data block;
starting data loading of a data block, and caching all data in a memory for reuse when a first epoch loads a first miniBatch;
half of the loaded data is repeatedly used, and half of the needed data is loaded in the disk every other miniBatch;
integrating the data loaded from the disk and the cache, then sending the integrated data to the GPU for training, and loading the next miniBatch;
and deleting the access data in the cache to release the memory, so that the cache can adapt to the limited memory size.
According to the invention, the Data structure comprises N, offset Field, size Field, label Field and Raw Data Field, wherein N is expressed as the Data block which integrates N small files; the Offset Field is an Offset Field containing N integers indexed from 1 to N, each integer describing in order the starting position of each file in the original data Field in a block; the Size Field represents the Size of each small file in a block described in sequence as N integers, each small file requiring N bytes to represent its Size Field; the Label Field contains N integers, which in turn represent the labels of each file; the Raw Data Field contains the Raw Data of N doclets; wherein, the fields and the data in the data block are stored in a binary manner.
According to the high-efficiency data loading method based on deep learning, when a data block is loaded, the number of data items in the data block is obtained by reading the first n bytes, and the original data of a specific data item can be directly positioned through the Offset Field and the Size Field, and the Label of the data can be found in the Label Field.
According to the present invention, when the original data of a specific data item is located through the Offset Field and the Size Field, the method for loading data based on deep learning comprises:
the first file has an Offset of zero, a Size given by a first integer in the Size Field, and the second file has an Offset given by a second integer in the Offset Field, a Size given by a second integer in the Size Field;
according to the number of bytes required by each field in the data block, the extra bytes required by compressing N small files into one data block in the data block are calculated, and the specific positioning of the small files in the data block is realized.
According to the high-efficiency Data loading method based on deep learning, the initial positions of 1 to N small files in the Raw Data Field are sequentially stored in an Offset Field, when the position of an mth small file in a Data block is found, the initial position of the mth small file can be known through the Offset Field, and the Size of the mth small file can be known through the Size Field;
the start position and end position of binary Data to the mth small file in the Data block in the Raw Data Field are determined by the start position of the Offset Field and the file Size of the Size Field, thereby achieving specific positioning of the small file in the block.
According to the high-efficiency data loading method based on deep learning provided by the invention, the following steps are further executed:
in the training process, each data block is used as a unit of a shuffle, wherein each data block has a unique number as an index, which is respectively 0, 1, 2 and 3, and the index set of all data blocks is the input of the shuffle algorithm;
when a set of indexes is arranged, the arranged index order is used to train the loading blocks one by one.
According to the high-efficiency data loading method based on deep learning provided by the invention, half of the loaded data is reused, and half of the needed data is loaded in a disk every other miniBatch, and the method comprises the following steps:
when data exist in the cache, taking half of miniBatch from the disk and the cache in a counter statistics mode, stopping loading the data when the counter for loading the data from the disk and the cache is equal to miniBatch/2, then integrating the data loaded from the disk and the cache, sending the data together into the GPU for training, resetting the counter to 0, and loading the next miniBatch.
According to the high-efficiency data loading method based on deep learning, when data loading is executed:
carrying out data loading in a data block reading mode;
when data is loaded from a disk, only half of the files are loaded for miniBatch by utilizing the data structure of the data block, and then the other half of the files are obtained from the memory cache;
for getCachedData (), it will immediately evict the accessed file from the cache, avoiding the excessive reuse of data in the cache;
for the first getCachedData () in the deep learning training, there is no data in the cache, so the disk I/O loads the data, then the data returned in the getCachedData () is owned in the cache; after one miniBatch is obtained, the file loaded from the disk is put into the cache through putCachedData ().
It can be seen that the invention has the following beneficial effects over the prior art:
1. the invention reduces the operations of open, read and close on the file by designing the data structure of the data block, which can greatly reduce the disk I/O operation, improve the I/O efficiency and reduce the data loading time.
2. According to the invention, through a reasonable cache strategy, under the condition of not affecting the final accuracy, the cache can be reasonably used, the operation of loading data from a disk is reduced, the time of waiting for loading the data before model training is greatly reduced, the server can be better utilized for deep learning training, and the time of loading the data and the time of deep learning training are greatly reduced.
In a second aspect, the present invention also provides an electronic device, including:
a memory storing computer executable instructions;
a processor configured to execute the computer-executable instructions,
wherein the computer executable instructions, when executed by the processor, implement the steps of any of the deep learning based efficient data loading methods described above.
In a third aspect, the present invention also provides a storage medium having stored thereon a computer program for implementing the steps of any one of the above-described deep learning-based efficient data loading methods when executed by a processor.
It can be seen that the present invention also provides an electronic device and a storage medium for a deep learning-based efficient data loading method, which includes: one or more memories, one or more processors. The memory is used for storing the program codes, intermediate data generated in the running process of the program, the output result of the model and model parameters; the processor is used for processor resources occupied by code running and a plurality of processor resources occupied when training the model.
The invention is described in further detail below with reference to the drawings and the detailed description.
Drawings
FIG. 1 is a schematic diagram of a data loading process in prior art deep learning.
FIG. 2 is a flow chart of an embodiment of a deep learning based efficient data loading method of the present invention.
Fig. 3 is a schematic structural diagram of compressing a small file into a data block according to an embodiment of a deep learning-based efficient data loading method of the present invention.
FIG. 4 is a schematic diagram of a data structure related to compressing small files into data blocks in an embodiment of a deep learning-based efficient data loading method according to the present invention.
FIG. 5 is a schematic diagram of a deep learning-based efficient data loading method according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of an algorithm for loading and caching data with high efficiency in an embodiment of a method for loading data with high efficiency based on deep learning according to the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 2, the invention provides a high-efficiency data loading method based on deep learning, which comprises the following steps:
step S1, opening each small file in a binary reading mode, and sequentially adding binary data of each small file into a data block to generate a data structure of the data block;
step S2, starting data loading of the data block, and caching all data in a memory for reuse when a first epoch loads a first miniBatch;
step S3, half of the loaded data is repeatedly used, and half of the needed data is loaded in the disk every other miniBatch;
s4, integrating the data loaded from the disk and the cache, then sending the integrated data to the GPU for training, and loading the next miniBatch;
and S5, deleting the access data in the cache to release the memory, so that the cache can adapt to the limited memory size.
In this embodiment, the Data structure includes N, offset Field, size Field, label Field, and Raw Data Field, N Field indicating that the Data block is an integration of N doclets together; the Offset Field is an Offset Field containing N integers indexed from 1 to N, each integer describing in order the starting position of each file in the original data Field in a block; the Size Field represents the Size of each small file in a block described in sequence as N integers, each small file requiring N bytes to represent its Size Field; the Label Field contains N integers, which in turn represent the labels of each file; the Raw Data Field contains the Raw Data of N doclets; wherein, the fields and the data in the data block are stored in a binary manner.
In this embodiment, upon loading a data block, the number of data items in the data block is obtained by reading the first n bytes, and the original data of a particular data item can be located directly by the Offset Field and Size Field fields, and the tag of that data can be found in the Label Field.
In this embodiment, when the original data of a specific data item is located by the Offset Field and the Size Field, it includes:
the first file has an Offset of zero, a Size given by a first integer in the Size Field, and the second file has an Offset given by a second integer in the Offset Field, a Size given by a second integer in the Size Field;
according to the number of bytes required by each field in the data block, the extra bytes required by compressing N small files into one data block in the data block are calculated, and the specific positioning of the small files in the data block is realized.
In the Offset Field, the starting positions of 1 to N small files in the Raw Data Field are sequentially stored, when the position of the mth small file in the Data block is found, the starting position of the mth small file can be known through the Offset Field, and the Size of the mth small file can be known through the Size Field;
the start position and end position of binary Data to the mth small file in the Data block in the Raw Data Field are determined by the start position of the Offset Field and the file Size of the Size Field, thereby achieving specific positioning of the small file in the block.
In this embodiment, the Offset Field in a block represents the start position of binary Data of each small file in the block, one block contains N small files, and the Offset Field sequentially stores the start positions of 1 to N small files in the Raw Data Field. The Size Field represents the file Size of each small file. For example, the position of the 5 th small file in the block is found, the starting position of the 5 th small file is known through the Offset Field, then the Size Field knows that the Size of the 5 th small file is 100KB, then the starting position and the ending position of binary Data of the 5 th small file in the block in the Raw Data Field are determined through the starting position of the Offset Field and the file Size of the Size Field, and then the specific positioning of the small file in the block is realized.
In the training process, each data block is used as a unit of a shuffle, wherein each data block has a unique number as an index, which is respectively 0, 1, 2 and 3, and the index set of all data blocks is the input of the shuffle algorithm;
when a set of indexes is arranged, the arranged index order is used to train the loading blocks one by one.
In this embodiment, half of the data loaded is reused, and every other miniBatch loads half of the required data in disk, including:
when data exist in the cache, taking half of miniBatch from the disk and the cache in a counter statistics mode, stopping loading the data when the counter for loading the data from the disk and the cache is equal to miniBatch/2, then integrating the data loaded from the disk and the cache, sending the data together into the GPU for training, resetting the counter to 0, and loading the next miniBatch. The action of ending the data loading is that the number of the data to be loaded reaches the number of the miniBatch, then the data is sent to the GPU for training, when the data exists in the cache (except for the data in the cache when the first miniBatch in the first epoch is loaded, the data exists in the cache when the first miniBatch is loaded), two counters are respectively taken from the disk and the cache in a counting mode through the counter, the initial setting is 0, when the counter for loading the data from the disk and the cache is equal to the miniBatch/2, the data to be loaded is stopped, then the data loaded from the disk and the cache are integrated together and sent to the GPU for training, and the two counters are reset to 0 again for loading of the next miniBatch.
In the present embodiment, when data loading is performed:
carrying out data loading in a data block reading mode;
when data is loaded from a disk, only half of the files are loaded for miniBatch by utilizing the data structure of the data block, and then the other half of the files are obtained from the memory cache;
for getCachedData (), it will immediately evict the accessed file from the cache, avoiding the excessive reuse of data in the cache;
for the first getCachedData () in the deep learning training, there is no data in the cache, so the disk I/O loads the data, then the data returned in the getCachedData () is owned in the cache; after one miniBatch is obtained, the file loaded from the disk is put into the cache through putCachedData ().
Specifically, the embodiment mainly considers two aspects of designing a data structure of an efficient data loading block and a reasonable caching strategy:
first, to increase I/O efficiency while reducing the number of threads required to perform disk reads, the present embodiment compresses small files into blocks, which can greatly reduce the number of times that multiple files are read in a single block, opening, reading, and closing are performed. At the same time, when multiple files are aggregated into one block, read throughput may be improved by reducing I/O seek on a given disk.
Secondly, in order to cache the loaded data in the memory with limited memory size, the loaded data is reused to obtain effective loading, and meanwhile, the cache size is kept to adapt to the limited memory size. This embodiment uses two strategies to cache data. First, when the first epoch loads the first miniBatch, all data is cached in memory for reuse. Half of the data loaded can then be reused, and only half of the required data is loaded in disk every other miniBatch. In this way, the required disk throughput is reduced by half, the time required to load data is reduced by half, and each file has the same access time. Because of the limited memory size, the second strategy is to load more other files in the dataset by deleting the access data in the cache to free up memory.
Design for efficient data loading block:
as shown in fig. 3, fig. 3 shows that 256 small files are read into a block, each small file is opened first by binary reading, then binary data is added to the block in sequence, and then other data segments are designed according to fig. 4.
For the design of the data structure of a data block, it is not just to put each small file in a binary manner into one block, but it is also necessary to know the specific location of each file in the block. As shown in fig. 4, fig. 4 illustrates how a data structure in one block is designed, where the value of N can be set to a multiple of the value of batch size.
As shown in fig. 4, the data structure of the present embodiment includes five fields, where the first field N indicates that the data block integrates N small files, and the N field of the data block of the present embodiment is 256, so that only 4 bytes are needed to indicate the numbers of the N small files in the data block; the second Field Offset Field is an Offset Field, which contains N integers from 1 to N indexes, each integer describing the starting position of each file in the original data Field in the block in sequence, and each small file requires 4 bytes to represent the Offset Field, then N requires 4N bytes; the Size Field represents the Size of each small file that is sequentially described by N integers in a data block, each small file requiring 4 bytes to represent its Size Field, and a total of N small files requiring 4N bytes; label fields are also labels containing N integers, representing each file in turn, with one doclet consuming 4 bytes and N doclets consuming 4N bytes; raw Data Field is the original Data containing N small files, and the size of the original small files is not increased or decreased, and the size of the original small files is also increased in the Data block. The fields and data in the data block are stored in binary fashion. The data structure of the data block is not only used for putting a plurality of small files into one block, but also can realize the function of data positioning, and can provide effective support for the subsequent data adding into a cache through the position of one or a part of small files in two parameter positioning blocks of an Offset Field and a Size Field. For example, the offset of the first file is zero, the Size is given by the first integer in the Size Field, the offset of the second file is given by the second integer in the offset Field, and the Size is given by the second integer in the Size Field. From the number of bytes required for each field in the block, the maximum (12N+4) extra bytes required to compress N small files into one block in the block can be calculated. When N is 256, one data block is only 3076 bytes more than the original small files, namely only about 3KB more, but the original data size of each small file is about 100KB, and the total size of the 256 small files is about 25600KB, so that the size of the extra 3KB is negligible compared with the size of the original small file. Therefore, the size of the additional addition is far smaller than one ten thousandth, the size of the block is not increased more than the size of the original multiple small files, and excessive overhead of system resources is not caused.
In addition to Raw Data fields, other regions in the Data structure of FIG. 4 may be considered metadata for each Data item. Thus, the present invention can also extend the field to accommodate different types of training data sets. When a block is loaded, the first four bytes can be read to see how many data items are in deblocking. By the Offset Field and Size Field fields, the original data of a particular data item can be located directly and its tag found in the label Field. Because the entire data structure would be loaded together into main memory, the process of reading each file in a block would be very efficient.
The present embodiment can reduce the number of times of opening and closing to one-nth of the baseline algorithm version when reading a plurality of files by compressing every N small files into one data block. As read small files, performance is mainly limited by the IOPS of the disk, and when N small files are integrated into one data block, performance becomes better, and disk throughput also reduces the limitation.
An important function during training is data shuffling to avoid overfitting during deep learning training. Since each data item is read independently in the baseline algorithm version, it can be easily arranged for shuffling in units of each image. However, this shuffle method is not suitable for a method of compressing N small files into one block. Otherwise, many blocks must be loaded to read the N files, and this results in inefficient data loading.
In order to execute the shuffle while maintaining efficient loading performance, the present embodiment uses each data block as a unit of the shuffle. As shown in fig. 5, fig. 5 shows an example of how data blocks are used as shuffles, each block having a unique number as an index, 0, 1, 2, and 3, respectively. The index set of all blocks is the input to the shuffle algorithm. When a set of indices is arranged, the training can be performed on a block-by-block basis using the arranged index order, i.e., 3, 1, 0, and 2.
For this reason, the present embodiment can compress each N files into one block to achieve efficient data loading. Meanwhile, the method can carry out shuffling by taking the blocks as a unit during training, so as to avoid the problem of overfitting.
For the design of caching strategy:
as shown in fig. 6, the algorithm in fig. 6 demonstrates how a cache can be used to obtain efficient data loading when applying the optimization strategy of the present invention. When the line 6 of the algorithm loads data from the disk, only half of the file is loaded for miniBatch, and then the other half is acquired from the cache in the line 7. For getCachedData (), it will immediately evict the accessed file from the cache. Note that for the first getCachedData (), there is no data in the cache, so disk I/O loads data are performed. Then has data in the cache that can be returned in getCachedData (). After line 8 gets a miniBatch, the loaded file is placed from disk into cache at line 9.
Therefore, when the first ministch of the first epoch is loaded, the cache is free of data, the data of the first ministch is loaded from the disk, when the first ministch is loaded in the algorithm, the getCacheData () of the 7 th line is free of data, so the first ministch of the 8 th line is loaded from the first haffministch, the first ministch is also the data of the disk, then the data just loaded from the disk is put into the cache, there is one ministch data in the cache, the latter ministch is half of the data loaded from the disk, then the data loaded from the cache is deleted in the cache, and the data just loaded from the disk is loaded into the cache.
In the above algorithm, in the optimization using the memory as the cache, the present invention caches the loaded data for reuse. Meanwhile, the data accessed in the cache is deleted, so that the cache can adapt to the limited memory size. The load time can be reduced by half because half of the files are needed in each ministch.
In summary, since the data loading is dominant in the running time, the embodiment effectively brings high-efficiency data loading by compressing the small file into blocks and reducing the disk I/O operation, thereby reducing the overall running time, and by designing a reasonable cache strategy, the data loading time of DNN can achieve an optimization effect up to 10 times without affecting the accuracy, and the performance is increased.
In one embodiment, an electronic device is provided, which may be a server. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the electronic device is for storing data. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements an efficient data loading method based on deep learning.
It will be appreciated by those skilled in the art that the electronic device structure shown in this embodiment is merely a partial structure related to the present application and does not constitute a limitation of the electronic device to which the present application is applied, and that a specific electronic device may include more or fewer components than those shown in this embodiment, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.
Further, the logic instructions in the memory described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It can be seen that the present invention also provides an electronic device and a storage medium for a deep learning-based efficient data loading method, which includes: one or more memories, one or more processors. The memory is used for storing the program codes, intermediate data generated in the running process of the program, the output result of the model and model parameters; the processor is used for processor resources occupied by code running and a plurality of processor resources occupied when training the model.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but any insubstantial changes and substitutions made by those skilled in the art on the basis of the present invention are intended to be within the scope of the present invention as claimed.

Claims (10)

1. The high-efficiency data loading method based on deep learning is characterized by comprising the following steps of:
opening each small file in a binary reading mode, sequentially adding binary data of each small file into a data block, and generating a data structure of the data block;
starting data loading of a data block, and caching all data in a memory for reuse when a first epoch loads a first miniBatch;
half of the loaded data is repeatedly used, and half of the needed data is loaded in the disk every other miniBatch;
integrating the data loaded from the disk and the cache, then sending the integrated data to the GPU for training, and loading the next miniBatch;
and deleting the access data in the cache to release the memory, so that the cache can adapt to the limited memory size.
2. The method according to claim 1, characterized in that:
the Data structure includes N, offset Field, size Field, label Field, and Raw Data Field, N Field indicating that the Data block is an integration of N doclets; the Offset Field is an Offset Field containing N integers indexed from 1 to N, each integer describing in order the starting position of each file in the original data Field in a block; the Size Field represents the Size of each small file in a block described in sequence as N integers, each small file requiring N bytes to represent its Size Field; the Label Field contains N integers, which in turn represent the labels of each file; the Raw Data Field contains the Raw Data of N doclets; wherein, the fields and the data in the data block are stored in a binary manner.
3. The method according to claim 2, characterized in that:
when loading a data block, the number of data items in the data block is obtained by reading the first n bytes, and by the Offset Field and Size Field, the original data for a particular data item can be located directly and the tag of that data can be found in the LabelField Field.
4. A method according to claim 3, wherein locating the original data of a particular data item via the Offset Field and the Size Field comprises:
the first file has an Offset of zero, a Size given by a first integer in the Size Field, and the second file has an Offset given by a second integer in the Offset Field, a Size given by a second integer in the Size Field;
according to the number of bytes required by each field in the data block, the extra bytes required by compressing N small files into one data block in the data block are calculated, and the specific positioning of the small files in the data block is realized.
5. The method according to claim 4, wherein:
in the Offset Field, the starting positions of 1 to N small files in the Raw Data Field are sequentially stored, when the position of the mth small file in the Data block is found, the starting position of the mth small file can be known through the Offset Field, and the Size of the mth small file can be known through the Size Field;
the start position and end position of binary Data to the mth small file in the Data block in the Raw Data Field are determined by the start position of the Offset Field and the file Size of the Size Field, thereby achieving specific positioning of the small file in the block.
6. The method of claim 1, further performing:
in the training process, each data block is used as a unit of a shuffle, wherein each data block has a unique number as an index, which is respectively 0, 1, 2 and 3, and the index set of all data blocks is the input of the shuffle algorithm;
when a set of indexes is arranged, the arranged index order is used to train the loading blocks one by one.
7. The method of claim 1, wherein reusing half of the loaded data, loading half of the required data in disk every other miniBatch, comprises:
when data exist in the cache, taking half of miniBatch from the disk and the cache in a counter statistics mode, stopping loading the data when the counter for loading the data from the disk and the cache is equal to miniBatch/2, then integrating the data loaded from the disk and the cache, sending the data together into the GPU for training, resetting the counter to 0, and loading the next miniBatch.
8. The method of claim 1, wherein, when performing data loading:
carrying out data loading in a data block reading mode;
when data is loaded from a disk, only half of the files are loaded for miniBatch by utilizing the data structure of the data block, and then the other half of the files are obtained from the memory cache;
for getCachedData (), it will immediately evict the accessed file from the cache, avoiding the excessive reuse of data in the cache;
for the first getCachedData () in the deep learning training, there is no data in the cache, so the disk I/O loads the data, then the data returned in the getCachedData () is owned in the cache; after one miniBatch is obtained, the file loaded from the disk is put into the cache through putCachedData ().
9. An electronic device, comprising:
a memory storing computer executable instructions;
a processor configured to execute the computer-executable instructions,
wherein the computer executable instructions, when executed by the processor, implement the steps of the deep learning based efficient data loading method as recited in any one of claims 1-8.
10. A storage medium having stored thereon a computer program for implementing the steps of the deep learning based efficient data loading method according to any of claims 1-8 when executed by a processor.
CN202310267674.7A 2023-03-14 2023-03-14 Efficient data loading method, equipment and storage medium based on deep learning Pending CN116483748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310267674.7A CN116483748A (en) 2023-03-14 2023-03-14 Efficient data loading method, equipment and storage medium based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310267674.7A CN116483748A (en) 2023-03-14 2023-03-14 Efficient data loading method, equipment and storage medium based on deep learning

Publications (1)

Publication Number Publication Date
CN116483748A true CN116483748A (en) 2023-07-25

Family

ID=87210984

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310267674.7A Pending CN116483748A (en) 2023-03-14 2023-03-14 Efficient data loading method, equipment and storage medium based on deep learning

Country Status (1)

Country Link
CN (1) CN116483748A (en)

Similar Documents

Publication Publication Date Title
US10644721B2 (en) Processing core data compression and storage system
US10884744B2 (en) System and method of loop vectorization by compressing indices and data elements from iterations based on a control mask
Lemire et al. Consistently faster and smaller compressed bitmaps with roaring
JP6605573B2 (en) Parallel decision tree processor architecture
Andrzejewski et al. GPU-WAH: Applying GPUs to compressing bitmap indexes with word aligned hybrid
JPWO2020190808A5 (en)
Al Sideiri et al. CUDA implementation of fractal image compression
Stehle et al. ParPaRaw: Massively parallel parsing of delimiter-separated raw data
CN112015473A (en) Sparse convolution neural network acceleration method and system based on data flow architecture
CN112771546A (en) Operation accelerator and compression method
CN113407343A (en) Service processing method, device and equipment based on resource allocation
CN111026736B (en) Data blood margin management method and device and data blood margin analysis method and device
US20230385258A1 (en) Dynamic random access memory-based content-addressable memory (dram-cam) architecture for exact pattern matching
CN115836346A (en) In-memory computing device and data processing method thereof
CN116483748A (en) Efficient data loading method, equipment and storage medium based on deep learning
CN112200310A (en) Intelligent processor, data processing method and storage medium
US12001237B2 (en) Pattern-based cache block compression
Lu et al. G-Match: a fast GPU-friendly data compression algorithm
Fuentes-Alventosa et al. Cuvle: Variable-length encoding on cuda
CN114518841A (en) Processor in memory and method for outputting instruction using processor in memory
WO2021061183A1 (en) Shuffle reduce tasks to reduce i/o overhead
Ahmed et al. Efficient GPU Acceleration for Computing Maximal Exact Matches in Long DNA Reads
CN112906728A (en) Feature comparison method, device and equipment
KR102519210B1 (en) Method and apparatus for accelerating artificial neural network using double-stage weight sharing
US11442643B2 (en) System and method for efficiently converting low-locality data into high-locality data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination