CN112465046B - Method, system, equipment and medium for artificial intelligence training of mass small files - Google Patents

Method, system, equipment and medium for artificial intelligence training of mass small files Download PDF

Info

Publication number
CN112465046B
CN112465046B CN202011394898.7A CN202011394898A CN112465046B CN 112465046 B CN112465046 B CN 112465046B CN 202011394898 A CN202011394898 A CN 202011394898A CN 112465046 B CN112465046 B CN 112465046B
Authority
CN
China
Prior art keywords
data
block
training
data blocks
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011394898.7A
Other languages
Chinese (zh)
Other versions
CN112465046A (en
Inventor
刘慧兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202011394898.7A priority Critical patent/CN112465046B/en
Publication of CN112465046A publication Critical patent/CN112465046A/en
Application granted granted Critical
Publication of CN112465046B publication Critical patent/CN112465046B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a system, equipment and a storage medium for artificial intelligence training of massive small files, wherein the method comprises the following steps: responding to the starting of the artificial intelligence training task, storing and acquiring a data set from a remote center and combining small files in the data set into data blocks according to the structural definition of the blocks; in response to starting training or updating the epoch, generating a training task data set list based on a synchronous shuffle mechanism between data blocks and within the data blocks; obtaining file list information of the data blocks according to the training task data set list; and acquiring file data according to the file list information of the data blocks, caching the file data locally at one or more data block granularities, and training an artificial intelligence task. The method solves the problem of low I/O bandwidth utilization rate when massive small files read data in training, relieves the problem of mismatching of the I/O reading rate and the GPU computing rate, improves the utilization rate of computing resources, and accelerates the whole training process of massive small files.

Description

Method, system, equipment and medium for artificial intelligence training of mass small files
Technical Field
The present invention relates to the field of AI training, and more particularly, to a method, a system, a computer device, and a readable medium for artificial intelligence training of a large number of small files.
Background
AI (artificial intelligence) training of large-scale, massive, small files, generally includes the following features: 1. the data set with large scale is usually placed in an external storage medium (system), such as nfs, beegfs, cloud end and the like; 2. for a traditional file system, metadata (including access time, authority, modification time and the like) of a large number of small files usually exist on a disk, the metadata of the disk is required to be loaded to a memory when the files are acquired, the position of the files on the disk is acquired according to the information of the metadata after the metadata are found, finally, the storage information of the files can be acquired from the disk, and the performance of integrally reading the files is poor; 3. the reading of the small files causes the waste of I/O resources, and the utilization rate of high-temperature computing resources such as cpu, gpu and the like is low due to the fact that the I/O reading speed is not matched with the cpu and gpu computing speed; the ai training task typically performs multiple epoch training sessions on the same dataset.
For training of AI, in order to avoid a performance problem caused by each batch and epoch directly accessing a remote central storage, in the prior art, a remote central storage data set is usually pulled to a local cache and then trained, and for a problem of limited single-machine storage space, a common way is to arrange a virtual distributed file cache system, such as alloxio and alloxio, on an AI computation cluster. Although data is pulled from a remote center storage to a local cache, multiple times of RPC calling are needed when the data is read, the training performance of massive small files is poorer than that of a direct local disk, the granularity is uncontrollable when the data is eliminated, and the data swapping-in and swapping-out performance is poorer when the cache is read and written every time when the cache space is insufficient.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a method, a system, a computer device, and a computer-readable storage medium for artificial intelligence training of a large number of small files, in which a data block structure is defined, remote small files are merged into a defined data block format, and a synthesized data block is used for training an AI model based on a local cache policy, so as to improve the bandwidth utilization rate of I/O, alleviate the problem of mismatching of I/O with the computation resource rates of cpu, gpu, and the like, and avoid the problem of overfitting in the AI model training by a shuffle mechanism between data blocks and within data blocks, thereby improving the rate of the entire AI training task on the premise of ensuring the effectiveness and robustness of the training model.
Based on the above purpose, an aspect of the embodiments of the present invention provides a method for artificial intelligence training of a large number of small files, including the following steps: responding to the starting of an artificial intelligence training task, storing and acquiring a data set from a remote center and combining small files in the data set into data blocks according to the structural definition of the blocks; in response to starting training or updating the epoch, generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and within the data blocks; obtaining file list information of the data blocks according to the training task data set list; and acquiring file data according to the file list information of the data blocks, caching the file data locally at one or more data block granularities, and performing artificial intelligence task training.
In some embodiments, the retrieving a data set from a remote central storage and merging small files in the data set into data blocks according to a structural definition of a block comprises: and carrying out serialization operation on the metadata domain information in the data block header.
In some embodiments, the obtaining the file list information of the data block according to the training task data set list includes: and acquiring a block identification list corresponding to the training task data set list, reading data block data according to the block identification list, analyzing a metadata domain of the data block, and sequentially deserializing the metadata domain in the data block to obtain file list information of the data block.
In some embodiments, the generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and within the data blocks comprises: judging whether all the block identifiers corresponding to the training task data set list exist in the memory; and responding to the situation that all the block identifiers corresponding to the training task data set list do not exist in the memory completely, traversing the local distributed file system to obtain all the block identifier information, and storing the block identifier information in the memory.
In some embodiments, the generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and within the data blocks comprises: and disturbing the block identifier in the memory by using a random number generation method to obtain a first block identifier sequence.
In some embodiments, the generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and within the data blocks comprises: and sequentially deserializing the block metadata fields of the data blocks in the first block identification sequence to acquire the small file list information in the data blocks, and disturbing the file list by using a random number generation method again to delete the invalid small files.
In some embodiments, the method further comprises: in response to creating an artificial intelligence training task, scheduling the artificial intelligence training task onto a node where a dataset has been cached.
On the other hand, the embodiment of the invention also provides an artificial intelligence training system for mass small files, which comprises the following components: the merging module is configured to respond to starting of an artificial intelligence training task, acquire a data set from a remote center storage and merge small files in the data set into data blocks according to the structure definition of the blocks; a shuffle module configured to generate a list of training task data sets based on a synchronous shuffle mechanism between the data blocks and within the data blocks in response to starting training or updating an epoch; the acquisition module is configured to obtain file list information of the data blocks according to the training task data set list; and the execution module is configured to acquire file data according to the file list information of the data blocks, cache the file data locally at one or more data block granularities and train an artificial intelligence task.
In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method as above.
In another aspect of the embodiments of the present invention, a computer-readable storage medium is further provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.
The invention has the following beneficial technical effects: by defining a data block structure, remote small files are combined into a defined data block format, the AI model is trained by utilizing the synthesized data block based on a local cache strategy, the bandwidth utilization rate of I/O is improved, the problem that the I/O is not matched with the computation resource rates of cpu, gpu and the like is solved, and the overfitting problem in the AI model training is avoided through a shuffle mechanism among and in the data blocks, so that the rate of the whole AI training task is improved on the premise of ensuring the effectiveness and the robustness of the training model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
FIG. 1 is a schematic diagram of an embodiment of a method for artificial intelligence training of a large number of small files provided by the present invention;
FIG. 2 is a block diagram of a data block defined in accordance with an embodiment of the present invention;
fig. 3 is a schematic diagram of a hardware structure of an embodiment of a computer device for artificial intelligence training of a large number of small files according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
Based on the above purpose, a first aspect of the embodiments of the present invention provides an embodiment of a method for artificial intelligence training of a large number of small files. Fig. 1 is a schematic diagram illustrating an embodiment of a method for artificial intelligence training of a large number of small files provided by the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:
s1, responding to the starting of an artificial intelligence training task, acquiring a data set from a remote center storage, and combining small files in the data set into data blocks according to the structural definition of the blocks;
s2, responding to the start of training or updating the epochs, and generating a training task data set list based on a synchronous shuffle mechanism between data blocks and in the data blocks;
s3, obtaining file list information of the data blocks according to the training task data set list; and
and S4, acquiring file data according to the file list information of the data blocks, caching the file data locally at one or more data block granularities, and training an artificial intelligence task.
The method mainly comprises the steps of defining the structure and the organization mode of the data block, based on the data block and the synchronous shuffle mechanism in the data block, and based on the local cache mechanism with multidimensional granularity such as the data set sub-directory and the data block. When an AI training task is started, remote small file images are combined into a large data block by means of recombination and aggregation according to the defined data block structure and size and stored in a local distributed cache file system, when the combined AI training task reads picture data, content information of a group of images is directly obtained by one large data block instead of one image, information such as file names, offsets and sizes of the small file images in the data block is obtained through metadata field information of deserialized data blocks, and source file data information is locally cached in one or more data block granularities to train the AI task. The data block-based mode forms a false 'hot data' phenomenon for AI training, thereby improving the utilization rate of I/O bandwidth, relieving the problem that the I/O reading performance is not matched with the computation rates of cpu, gpu and the like, and improving the rate of AI training tasks.
In some embodiments, a method comprises: in response to creating the artificial intelligence training task, scheduling the artificial intelligence training task onto a node where the data set has been cached. When an AI training task is created, the task is preferentially scheduled to the nodes of the data set which are cached according to a scheduling algorithm. If the data set required by the training task is not cached locally, the corresponding data set stored in the remote center needs to be pulled to the node distributed by the training task.
In response to initiating the artificial intelligence training task, a data set is retrieved from a remote central storage and the small files in the data set are merged into data blocks according to the structural definition of the blocks. Fig. 2 is a schematic diagram illustrating a structure of a data block defined by an embodiment of the present invention. The embodiment of the invention defines the size (for example, 16M, 32M, etc.) of the data in the data block, and reorganizes and merges the remote data set small files into a large data block according to the structure of the data block in fig. 2. When a large data block is merged, the principle that the data blocks synthesized by different subdirectories are not crossed is followed as much as possible, and although the number of the merged data blocks can be increased by the method, the method can be more flexible and convenient for updating and eliminating the data set and controlling the cache granularity.
In some embodiments, the retrieving a data set from a remote hub storage and merging small files in the data set into data blocks according to a block structure definition comprises: and carrying out serialization operation on the metadata domain information in the data block header. Serialization operation is needed for metadata field information in the data block header (for example, using the message structure of protobuf), and the data part is specific information of the contained small file image.
In response to starting training or updating the epoch, a list of training task data sets is generated based on a synchronous shuffle mechanism between and within the data blocks.
In some embodiments, the generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and within the data blocks comprises: judging whether all block identifications corresponding to the training task data set list exist in a memory or not; and responding to the situation that all the block identifiers corresponding to the training task data set list do not exist in the memory completely, traversing the local distributed file system to obtain all the block identifier information, and storing the block identifier information in the memory. Judging whether all block IDs (block identifiers) of a data set corresponding to the AI training task exist in an internal memory, if not, traversing a local distributed file system to acquire all block ID information, storing corresponding information in the internal memory, and waiting for subsequent shuffle operation.
In some embodiments, the generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and within the data blocks comprises: and disturbing the block identifier in the memory by using a random number generation method to obtain a first block identifier sequence. The shuffle mechanism between data blocks: disturbing block IDs in the memory by using a random number generation method to generate a block ID sequence, for example, N blocks exist, the corresponding block IDs are stored in an array with the length of N, a random number K between [0, N-1] is generated for the first time, and the array [ K ] and the array [ N-1] are replaced; generating random number M between [0, N-2] for the second time, and replacing array [ M ] and array [ N-2]; and sequentially carrying out random permutation according to the method until the shuffle of all the data blocks contained in the training data set is completed.
In some embodiments, the generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and within the data blocks comprises: and sequentially deserializing the block metadata fields of the data blocks in the first block identification sequence to acquire the small file list information in the data blocks, and disturbing the file list by using a random number generation method again to delete the invalid small files. A shuffle mechanism within a data block: and for the generated block ID sequence, sequentially deserializing a block metadata field of the data block so as to obtain the information of the small file list in the data block, and disturbing the file list by using a random number generation method again. status =1 represents that the file fails in a data block (it is possible that the file is deleted or updated at the source end), so that files with status =1 in each block are filtered out, and finally, the file lists of all data blocks are merged to generate the training list of the epoch.
And obtaining file list information of the data blocks according to the training task data set list.
In some embodiments, the obtaining the file list information of the data block according to the training task data set list includes: and acquiring a block identification list corresponding to the training task data set list, reading data blocks according to the block identification list, analyzing metadata domains of the data blocks, and sequentially deserializing the metadata domains in the data blocks to obtain file list information of the data blocks. When small file image data are read, the corresponding data block and a plurality of subsequent corresponding data blocks are sequentially read into the cache of the local computing node according to the size and the mechanism of the local data block cache configuration, and the block metadata field of the data block is deserialized to obtain a file list, and the block offset and the size of the file. Thus, only the local cache needs to be queried and read for subsequent files in the data block.
And acquiring file data according to the file list information of the data blocks, caching the file data locally at one or more data block granularities, and training an artificial intelligence task.
The embodiment of the invention combines a training data set (a large amount of small files) into a large data block by defining a data block structure, stores the data block file by combining a local distributed cache file system, and carries out internal storage management on the metadata of the data block file by using a metadata server, thereby accelerating the data access, avoiding the over-fitting problem in AI training based on a shuffle mechanism between data blocks and in the data blocks, and improving the training speed on the premise of ensuring the precision of a training model.
It should be particularly noted that, the steps in the above-mentioned embodiments of the method for artificial intelligence training of a large amount of small files may be mutually intersected, replaced, added, and deleted, so that these methods for artificial intelligence training of a large amount of small files, which are transformed by reasonable permutation and combination, should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiments.
Based on the above object, a second aspect of the embodiments of the present invention provides a system for artificial intelligence training of a large number of small files, including: the merging module is configured to respond to starting of an artificial intelligence training task, acquire a data set from a remote center storage and merge small files in the data set into data blocks according to the structure definition of the blocks; the shuffle module is used for responding to the start of training or updating the epoch, and generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and in the data blocks; the acquisition module is configured to obtain file list information of the data blocks according to the training task data set list; and the execution module is configured to acquire file data according to the file list information of the data blocks, cache the file data locally at one or more data block granularities and train an artificial intelligence task.
In some embodiments, the merging module is configured to: and carrying out serialization operation on the metadata domain information in the data block head.
In some embodiments, the obtaining module is configured to: and acquiring a block identification list corresponding to the training task data set list, reading data blocks according to the block identification list, analyzing metadata domains of the data blocks, and sequentially deserializing the metadata domains in the data blocks to obtain file list information of the data blocks.
In some embodiments, the shuffling module is configured to: judging whether all the block identifiers corresponding to the training task data set list exist in the memory; and responding to the situation that all the block identifiers corresponding to the training task data set list do not exist in the memory completely, traversing the local distributed file system to obtain all the block identifier information, and storing the block identifier information in the memory.
In some embodiments, the shuffling module is configured to: and disturbing the block identifier in the memory by using a random number generation method to obtain a first block identifier sequence.
In some embodiments, the shuffling module is configured to: and sequentially deserializing the block metadata fields of the data blocks in the first block identification sequence to acquire the small file list information in the data blocks, and disturbing the file list by using a random number generation method again to delete the invalid small files.
In some embodiments, the system further comprises a creation module configured to: in response to creating an artificial intelligence training task, scheduling the artificial intelligence training task onto a node where a dataset has been cached.
In view of the above object, a third aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, responding to the starting of an artificial intelligence training task, acquiring a data set from a remote center storage, and combining small files in the data set into data blocks according to the structure definition of the blocks; s2, responding to the start of training or updating the epochs, and generating a training task data set list based on a synchronous shuffle mechanism between data blocks and in the data blocks; s3, obtaining file list information of the data blocks according to the training task data set list; and S4, acquiring file data according to the file list information of the data blocks, caching the file data locally at one or more data block granularities, and performing artificial intelligence task training.
In some embodiments, the retrieving a data set from a remote central storage and merging small files in the data set into data blocks according to a structural definition of a block comprises: and carrying out serialization operation on the metadata domain information in the data block header.
In some embodiments, the obtaining file list information for the data block according to the training task data set list includes: and acquiring a block identification list corresponding to the training task data set list, reading data blocks according to the block identification list, analyzing metadata domains of the data blocks, and sequentially deserializing the metadata domains in the data blocks to obtain file list information of the data blocks.
In some embodiments, the generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and within the data blocks comprises: judging whether all the block identifiers corresponding to the training task data set list exist in the memory; and responding to the situation that all the block identifiers corresponding to the training task data set list do not exist in the memory completely, traversing the local distributed file system to obtain all the block identifier information, and storing the block identifier information in the memory.
In some embodiments, the generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and within the data blocks comprises: and disturbing the block identifier in the memory by using a random number generation method to obtain a first block identifier sequence.
In some embodiments, the generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and within the data blocks comprises: and sequentially deserializing the block metadata fields of the data blocks in the first block identification sequence to acquire the small file list information in the data blocks, and disturbing the file list by using a random number generation method again to delete the invalid small files.
In some embodiments, the steps further comprise: in response to creating an artificial intelligence training task, scheduling the artificial intelligence training task onto a node where a dataset has been cached.
Fig. 3 is a schematic hardware structural diagram of an embodiment of the computer device for artificial intelligence training of the above-mentioned mass small files provided by the present invention.
Taking the apparatus shown in fig. 3 as an example, the apparatus includes a processor 301 and a memory 302, and may further include: an input device 303 and an output device 304.
The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and the bus connection is taken as an example in fig. 3.
The memory 302 is used as a non-volatile computer-readable storage medium for storing non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for artificial intelligence training of mass small files in the embodiments of the present application. The processor 301 executes various functional applications of the server and data processing by running the non-volatile software programs, instructions and modules stored in the memory 302, i.e. the method for artificial intelligence training of a large amount of small files, which implements the above method embodiment.
The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of a method of artificial intelligence training of a large number of small files, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 302 may optionally include memory located remotely from processor 301, which may be connected to local modules over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 303 may receive information such as a user name and a password that are input. The output means 304 may comprise a display device such as a display screen.
Program instructions/modules corresponding to the method for artificial intelligence training of one or more mass small files are stored in the memory 302 and, when executed by the processor 301, perform the method for artificial intelligence training of mass small files in any of the above-described method embodiments.
Any embodiment of the computer device for executing the method for artificial intelligence training of the mass small files can achieve the same or similar effects as any corresponding method embodiment.
The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, performs the method as above.
Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program of the method for artificial intelligence training of a large number of small files can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM), or a Random Access Memory (RAM). The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.
The foregoing are exemplary embodiments of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant only to be exemplary, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of an embodiment of the invention, also combinations between technical features in the above embodiments or in different embodiments are possible, and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit or scope of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (6)

1. A method for artificial intelligence training of a mass of small files is characterized by comprising the following steps:
responding to the starting of an artificial intelligence training task, storing and acquiring a data set from a remote center and combining small files in the data set into data blocks according to the structural definition of the blocks;
in response to starting training or updating the epoch, generating a training task data set list based on a synchronous shuffle mechanism between the data blocks and within the data blocks;
obtaining file list information of the data blocks according to the training task data set list; and
obtaining file data according to the file list information of the data blocks, caching the file data locally with one or more data block granularities, carrying out artificial intelligence task training,
the obtaining of the data sets from the remote central storage and the merging of the small files in the data sets into data blocks according to the structure definition of the blocks comprises: serializing metadata field information in the data block header,
the generating a training task data set list based on the synchronous shuffle mechanism between the data blocks and within the data blocks comprises: judging whether all block identifications corresponding to the training task data set list exist in a memory or not; and responding to the situation that all the block identifiers corresponding to the training task data set list do not exist in the memory completely, traversing the local distributed file system to acquire all the block identifier information, and storing the block identifier information in the memory,
the generating a training task data set list based on the synchronous shuffle mechanism between the data blocks and within the data blocks comprises: disturbing the block identifier in the memory by using a random number generation method to obtain a first block identifier sequence,
the generating a training task data set list based on the synchronous shuffle mechanism between the data blocks and within the data blocks comprises: and sequentially deserializing the block metadata fields of the data blocks in the first block identification sequence to acquire the small file list information in the data blocks, and disturbing the file list by using a random number generation method again to delete the invalid small files.
2. The method of claim 1, wherein obtaining file list information for the data block according to the training task data set list comprises:
and acquiring a block identification list corresponding to the training task data set list, reading data block data according to the block identification list, analyzing a metadata domain of the data block, and sequentially deserializing the metadata domain in the data block to obtain file list information of the data block.
3. The method of claim 1, further comprising:
in response to creating the artificial intelligence training task, scheduling the artificial intelligence training task onto a node where the data set has been cached.
4.A system for artificial intelligence training of a mass of small files, comprising:
the merging module is configured to respond to starting of an artificial intelligence training task, acquire a data set from a remote center storage and merge small files in the data set into data blocks according to the structure definition of the blocks;
a shuffle module configured to generate a list of training task data sets based on a synchronous shuffle mechanism between the data blocks and within the data blocks in response to starting training or updating an epoch;
the acquisition module is configured to obtain file list information of the data blocks according to the training task data set list; and
the execution module is configured to acquire file data according to the file list information of the data blocks, locally cache the file data in one or more data block granularities and train an artificial intelligence task,
the merge module is configured to: serializing metadata field information in the data block header,
the shuffling module is configured to: judging whether all block identifications corresponding to the training task data set list exist in a memory or not; and responding to the situation that all the block identifiers corresponding to the training task data set list do not exist in the memory completely, traversing the local distributed file system to acquire all the block identifier information, and storing the block identifier information in the memory,
the shuffling module is configured to: disturbing the block identifier in the memory by using a random number generation method to obtain a first block identifier sequence,
the shuffling module is configured to: and sequentially deserializing the block metadata fields of the data blocks in the first block identification sequence to acquire the small file list information in the data blocks, and disturbing the file list by using a random number generation method again to delete the invalid small files.
5. A computer device, comprising:
at least one processor; and
a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 3.
6. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.
CN202011394898.7A 2020-12-03 2020-12-03 Method, system, equipment and medium for artificial intelligence training of mass small files Active CN112465046B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011394898.7A CN112465046B (en) 2020-12-03 2020-12-03 Method, system, equipment and medium for artificial intelligence training of mass small files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011394898.7A CN112465046B (en) 2020-12-03 2020-12-03 Method, system, equipment and medium for artificial intelligence training of mass small files

Publications (2)

Publication Number Publication Date
CN112465046A CN112465046A (en) 2021-03-09
CN112465046B true CN112465046B (en) 2022-11-29

Family

ID=74805337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011394898.7A Active CN112465046B (en) 2020-12-03 2020-12-03 Method, system, equipment and medium for artificial intelligence training of mass small files

Country Status (1)

Country Link
CN (1) CN112465046B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114267404A (en) * 2022-03-03 2022-04-01 深圳佰维存储科技股份有限公司 eMMC test method, device, readable storage medium and electronic equipment
CN115904263B (en) * 2023-03-10 2023-05-23 浪潮电子信息产业股份有限公司 Data migration method, system, equipment and computer readable storage medium
CN116185308B (en) * 2023-04-25 2023-08-04 山东英信计算机技术有限公司 Data set processing method, device, equipment, medium and model training system
CN117472296B (en) * 2023-12-27 2024-03-15 苏州元脑智能科技有限公司 Data processing method, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190236173A1 (en) * 2018-01-31 2019-08-01 Accenture Global Solutions Limited Utilizing artificial intelligence to integrate data from multiple diverse sources into a data structure
CN110516817A (en) * 2019-09-03 2019-11-29 北京华捷艾米科技有限公司 A kind of model training data load method and device
CN110688230A (en) * 2019-10-17 2020-01-14 广州文远知行科技有限公司 Synchronous training method and device, computer equipment and storage medium
CN111124277A (en) * 2019-11-21 2020-05-08 苏州浪潮智能科技有限公司 Deep learning data set caching method, system, terminal and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190236173A1 (en) * 2018-01-31 2019-08-01 Accenture Global Solutions Limited Utilizing artificial intelligence to integrate data from multiple diverse sources into a data structure
CN110516817A (en) * 2019-09-03 2019-11-29 北京华捷艾米科技有限公司 A kind of model training data load method and device
CN110688230A (en) * 2019-10-17 2020-01-14 广州文远知行科技有限公司 Synchronous training method and device, computer equipment and storage medium
CN111124277A (en) * 2019-11-21 2020-05-08 苏州浪潮智能科技有限公司 Deep learning data set caching method, system, terminal and storage medium

Also Published As

Publication number Publication date
CN112465046A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
CN112465046B (en) Method, system, equipment and medium for artificial intelligence training of mass small files
CN109254733B (en) Method, device and system for storing data
CN107169083B (en) Mass vehicle data storage and retrieval method and device for public security card port and electronic equipment
US11809726B2 (en) Distributed storage method and device
JP6383110B2 (en) Data search method, apparatus and terminal
US10747739B1 (en) Implicit checkpoint for generating a secondary index of a table
US10102230B1 (en) Rate-limiting secondary index creation for an online table
WO2019219005A1 (en) Data processing system and method
CN110704438B (en) Method and device for generating bloom filter in blockchain
CN112463290A (en) Method, system, apparatus and storage medium for dynamically adjusting the number of computing containers
Mehmood et al. Distributed real-time ETL architecture for unstructured big data
CN116016702A (en) Application observable data acquisition processing method, device and medium
WO2016014333A1 (en) Distributing and processing streams over one or more networks for on-the-fly schema evolution
CN111177159B (en) Data processing system and method and data updating equipment
CN115185679A (en) Task processing method and device for artificial intelligence algorithm, server and storage medium
Shan et al. KubeAdaptor: a docking framework for workflow containerization on Kubernetes
CN112965939A (en) File merging method, device and equipment
JP4362839B1 (en) Meta information sharing distributed database system on virtual single memory storage
WO2016008317A1 (en) Data processing method and central node
CN110955461B (en) Processing method, device, system, server and storage medium for computing task
CN111427920A (en) Data acquisition method, device, system, computer equipment and storage medium
Liu et al. A large-scale rendering system based on hadoop
CN114942727A (en) Micro-kernel file system extensible page cache system and method
US11711220B1 (en) System and methods for computation, storage, and consensus in distributed systems
CN110288309B (en) Data interaction method, device, system, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant