CN112749127A - Data providing method and system for model training - Google Patents

Data providing method and system for model training Download PDF

Info

Publication number
CN112749127A
CN112749127A CN202011609668.8A CN202011609668A CN112749127A CN 112749127 A CN112749127 A CN 112749127A CN 202011609668 A CN202011609668 A CN 202011609668A CN 112749127 A CN112749127 A CN 112749127A
Authority
CN
China
Prior art keywords
file
model training
target object
identifier
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011609668.8A
Other languages
Chinese (zh)
Inventor
余虹建
李锦丰
朱军
李秋庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Juyun Technology Co ltd
Original Assignee
Beijing Juyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Juyun Technology Co ltd filed Critical Beijing Juyun Technology Co ltd
Priority to CN202011609668.8A priority Critical patent/CN112749127A/en
Publication of CN112749127A publication Critical patent/CN112749127A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a data providing method and a data providing system for model training, relates to the technical field of computers, and can effectively improve the data acquisition efficiency of a model training task. The method comprises the following steps: receiving a data request of a model training task, wherein the data request carries a data set identifier and a file identifier of a target file required by the model training task; according to the data set identification and the file identification, positioning the storage position of a target object corresponding to the target file in an object storage server in a hierarchical directory; outputting the target object from the object storage server according to the storage position; and converting the target object into a corresponding file in a preset file system, wherein the preset file system is a file system on which the model training task is based. The invention can be applied to machine learning.

Description

Data providing method and system for model training
Technical Field
The invention relates to the technical field of computers, in particular to a data providing method and system for model training.
Background
In recent years, artificial intelligence technology has become more and more widely used in industry and life. Machine learning is an important branch in the field of artificial intelligence, and an ideal mathematical model can be obtained through training of a large amount of data. Because the data amount required by model training is huge, the operation task is heavy, and in many cases, data reading and operation are required to be carried out by depending on a computer cluster. How to support such a huge deep learning training cluster to perform rapid reading of models and training samples becomes an urgent problem to be solved in the field.
Existing file systems for machine learning training mainly include parallel file systems and network file systems, such as CephFS, beegfr, GPFS, and the like, but these file systems are not designed in consideration of the unique characteristics of cluster deep learning operation, and data reading efficiency is low.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data providing method and system for model training, which can effectively improve data reading efficiency in a model training task.
In a first aspect, an embodiment of the present invention provides a data providing method for model training, including: receiving a data request of a model training task, wherein the data request carries a data set identifier and a file identifier of a target file required by the model training task; according to the data set identification and the file identification, positioning the storage position of a target object corresponding to the target file in an object storage server in a hierarchical directory; outputting the target object from the object storage server according to the storage position; and converting the target object into a corresponding file in a preset file system, wherein the preset file system is a file system on which the model training task is based.
Optionally, before the locating, according to the dataset identifier and the file identifier, a storage location of a target object corresponding to the target file in an object storage server in a hierarchical directory, the method further includes: and generating the hierarchical directory according to the object identification of each object stored in the object storage server.
Optionally, after the target object is output from the object storage server according to the storage location and before the target object is converted into a corresponding file in a preset file system, the method further includes: and caching the target object according to a preset caching strategy.
Optionally, before the locating, according to the dataset identifier and the file identifier, a storage location of a target object corresponding to the target file in an object storage server in a hierarchical directory, the method further includes: searching a target object corresponding to the data set identifier and the file identifier in a cache according to the data set identifier and the file identifier; the converting the target object into a corresponding file in a preset file system comprises: and responding to the target object in the cache, and converting the target object into a corresponding file in a preset file system.
Optionally, before the locating, according to the dataset identifier and the file identifier, a storage location of a target object corresponding to the target file in an object storage server in a hierarchical directory, the method further includes: searching a target object corresponding to the data set identifier and the file identifier in a cache according to the data set identifier and the file identifier; the positioning, according to the dataset identifier and the file identifier, a storage location of a target object corresponding to the target file in an object storage server in a hierarchical directory includes: and responding to the fact that the target object does not exist in the cache, and positioning the storage position of the target object corresponding to the target file in the object storage server in the hierarchical directory according to the data set identification and the file identification.
Optionally, caching the target object based on a data request of a first model training task; and searching a target object corresponding to the data set identification and the file identification in a cache based on a data request of a second model training task, wherein the first model training task is different from or the same as the second model training task.
In a second aspect, an embodiment of the present invention further provides a storage system for model training, including: the device comprises a request receiving unit, a model training task processing unit and a model matching unit, wherein the request receiving unit is used for receiving a data request of a model training task, and the data request carries a data set identifier and a file identifier of a target file required by the model training task; the positioning unit is used for positioning the storage position of a target object corresponding to the target file in the object storage server in a hierarchical directory according to the data set identifier and the file identifier; a data output unit, configured to output the target object from the object storage server according to the storage location; and the conversion unit is used for converting the target object into a corresponding file in a preset file system, wherein the preset file system is a file system on which the model training task is based.
Optionally, the system further includes: and the directory generation unit is used for generating the hierarchical directory according to the object identifiers of all the objects stored in the object storage server before positioning the storage positions of the object objects corresponding to the object files in the hierarchical directory in the object storage server according to the data set identifiers and the file identifiers.
Optionally, the system further includes: and the caching unit is used for caching the target object according to a preset caching strategy after the target object is output from the object storage server according to the storage position and before the target object is converted into a corresponding file in a preset file system.
Optionally, the system further includes: the searching unit is used for searching the target object corresponding to the data set identifier and the file identifier in the cache according to the data set identifier and the file identifier before positioning the storage position of the target object corresponding to the target file in the hierarchical directory in the object storage server; the conversion unit is specifically configured to, in response to the target object existing in the cache, convert the target object into a corresponding file in a preset file system.
Optionally, the system further includes: the searching unit is used for searching the target object corresponding to the data set identifier and the file identifier in the cache according to the data set identifier and the file identifier before positioning the storage position of the target object corresponding to the target file in the hierarchical directory in the object storage server; the location unit is specifically configured to, in response to that the target object does not exist in the cache, locate, in a hierarchical directory, a storage location of a target object in an object storage server, where the target object corresponds to the target file, according to the dataset identifier and the file identifier.
Optionally, caching the target object based on a data request of a first model training task; and searching a target object corresponding to the data set identification and the file identification in a cache based on a data request of a second model training task, wherein the first model training task is different from or the same as the second model training task.
In a third aspect, an embodiment of the present invention further provides an electronic device, including: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, and is used for executing any data providing method for model training provided by the embodiment of the invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement any one of the data providing methods for model training provided by the embodiments of the present invention.
The data providing method, the data providing device, the electronic device and the storage medium for model training provided by the embodiments of the present invention can receive a data request of a model training task, where the data request carries a data set identifier and a file identifier of a target file required by the model training task, then locate, according to the data set identifier and the file identifier, a storage location of a target object corresponding to the target file in an object storage server in a hierarchical directory, output, according to the storage location, the target object from the object storage server, and convert the target object into a file corresponding to a preset file system, where the preset file system is a file system on which the model training task is based. Therefore, on one hand, the model training task can utilize the strong advantage of object storage, the reading speed is high, and the capacity expansion is convenient to carry out, on the other hand, the object in the object storage is converted into the file in the file system, so that the model training system can conveniently and efficiently use the file to carry out the model training, and the data reading efficiency of the cluster model training service is effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a data providing method for model training provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of an application of the data providing method for model training according to the embodiment of the present invention;
FIG. 3 is a flow chart of a data providing method for model training according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a data providing apparatus for model training according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In machine learning, on the one hand, a computer with powerful computing power is required for model training, and on the other hand, sufficient data samples are also required for computer learning. Because the data amount required by model training is huge, the operation task is heavy, and in many cases, data reading and operation are required to be carried out by depending on a computer cluster. How to support such a huge machine learning, especially a deep learning training cluster, and rapidly read a model and a training sample becomes a problem to be solved in the field.
Existing file systems for machine learning training mainly include parallel file systems and network file systems, such as CephFS, beegfr, GPFS, and the like, but these file systems are not designed in consideration of the unique characteristics of cluster deep learning operation, and data reading efficiency is low.
For example, driver is an information storage cache for deep learning jobs on a GPU cluster. The method is integrated with a framework, a shuffle list is taken in advance, and then the shuffle list is pre-loaded to a plurality of local file systems from a remote storage system to form a unique cache storage system. However, the driver is a cache system, which means that the user reading the data set for the first time will become slow, and in order to support deep learning training using the cache system, the deep learning user is also required to modify the code, which is very inconvenient to use. For another example, deep io explores the pipelining of I/O acquisition and computation by using an object interface to encapsulate into a library and tightly interface with a framework, using entropy-aware sampling techniques. However, Deep Learning Task (DLT) jobs are considered by DeepIO in isolation, and the benefit of caching a single job is marginal unless the entire data is suitable for caching in a small memory.
In view of the above, the inventor finds that, in research, for a deep learning data set storage system, it is not a bottleneck to implement I/O in a multitask discretization task through a unique storage system architecture, and the storage system can effectively support a training task without changing a training code, thereby effectively improving data reading efficiency.
Technical ideas, embodiments and advantageous technical effects of the embodiments of the present invention will be described in detail below with reference to specific examples in order to enable those skilled in the art to better understand the technical ideas, embodiments and advantageous technical effects of the examples.
In a first aspect, embodiments of the present invention provide a data providing method for model training, which can effectively improve data reading efficiency of a cluster model training service.
As shown in fig. 1, a data providing method for model training according to an embodiment of the present invention includes:
s11, receiving a data request of a model training task, wherein the data request carries a data set identifier and a file identifier of a target file required by the model training task;
the model training task may read the training data in a manner that reads a file. Each file may be stored in the object storage server in the form of an object. All training data required for a model training task may form a data set. Each data set may be stored in a bucket of the object store server. Because of the large amount of data required for model training, a data set can often include tens of millions of file numbers.
Alternatively, the model training task may be various machine learning tasks running in a single machine or a computer cluster, such as deep learning. Each model training task may train a corresponding model. Data such as samples required for model training can be acquired by sending data requests to the storage system. The data request may carry a data set identifier and a file identifier of a target file required by the model training task, such as dataset1-filename 1.
S12, according to the data set identification and the file identification, locating the storage position of the target object corresponding to the target file in the object storage server in the hierarchical directory;
in the embodiment of the invention, the storage system can be realized based on object storage, the model training task can be realized based on a file system, and in order to quickly find the target object corresponding to the target file in the object storage server, a layered and graded directory can be preset for each object in the object storage server. Alternatively, the hierarchical directory may be a multi-way tree hierarchical directory structure for indicating storage paths of the objects. Because the hierarchical directory only records the object names and the storage paths, the data volume of the hierarchical directory is very small, but clear guidance can be provided for acquiring the target object, and the time required for acquiring the target object is greatly shortened.
Optionally, the hierarchical directory may be independent of the object storage server, or may be integrated in the object storage server, which is not limited in the embodiment of the present invention.
After receiving a data request sent by a model training task, the storage system can identify the data set identifier and the file identifier, and search a target object corresponding to a target file in a pre-established hierarchical directory, so as to obtain the storage position of the target object.
S13, outputting the target object from the object storage server according to the storage location;
after the storage location of the target object in the object storage server is found, the target object may be transmitted from the object storage server to the model training task.
S14, converting the target object into a corresponding file in a preset file system, wherein the preset file system is a file system on which the model training task is based.
Because the storage system is implemented based on object storage, and the model training task is implemented based on a file system, in order to facilitate the model training of the data read by the model training task application, in this step, a fuse (user space file system) system may be set, which is used to convert the target object into a file in the preset file system. The preset file system is a file system on which the model training task is based. Alternatively, the fuse system may be set in any processing step after the target object is output from the object storage server and before the target object enters the model training task for model training.
The data providing method for model training provided by the embodiment of the invention can receive a data request of a model training task, the data request carries a data set identifier and a file identifier of a target file required by the model training task, then positions a storage position of a target object corresponding to the target file in an object storage server in a hierarchical directory according to the data set identifier and the file identifier, outputs the target object from the object storage server according to the storage position, and converts the target object into a corresponding file in a preset file system, wherein the preset file system is a file system based on the model training task. Therefore, on one hand, the model training task can utilize the strong advantage of object storage, the reading speed is high, and the capacity expansion is convenient to carry out, on the other hand, the object in the object storage is converted into the file in the file system, so that the model training system can conveniently and efficiently use the file to carry out the model training, and the data reading efficiency of the cluster model training service is effectively improved.
Furthermore, due to the data providing method for model training provided by the embodiment of the invention, decoupling of data storage and model operation is realized, the data storage does not depend on the model operation, and a user does not need to modify codes during the model training.
In an embodiment of the invention, data used for model training can be stored in an object storage server, and a series of storage systems including the object storage server can provide powerful data reading support for a model training task. For example, hierarchical catalogs provide a basis for quickly and accurately finding a target object in a large data set.
In order to obtain such a hierarchical directory, optionally, in an embodiment of the present invention, before locating, in the hierarchical directory, a storage location of a target object corresponding to the target file in the object storage server according to the dataset identifier and the file identifier in step S12, the data providing method for model training provided by the embodiment of the present invention may further include: and generating the hierarchical directory according to the object identification of each object stored in the object storage server. For example, in one embodiment of the invention, the object name of one storage object is: image/train/n01782513_108.JPEG, the object name can be split into: image, train, n01782513_108.JPEG, the directory information thus formed may be: image \ train \ n01782513_108.JPEG, i.e. the n01782513_108.JPEG file under the train subfolder under the Image folder. After all objects in the data set are scanned, a hierarchical directory containing storage paths for all objects can be formed.
After the hierarchical directory is obtained, the hierarchical directory can be used for quickly positioning the position of the target object in the object storage server, so that the target object is output from the object storage server, and the target object is converted into a corresponding file. Further, after outputting the target object from the object storage server according to the storage location and before converting the target object into a corresponding file in a preset file system, the data providing method for model training provided by the embodiment of the present invention may further include: and caching the target object according to a preset caching strategy. The target object is cached, that is, the target object is cached in a cache (cache) with a model training task deployed, so that subsequent model training can directly utilize data in the cache without loading data from a remote end, and the data acquisition efficiency of the model training task is further improved.
Optionally, the preset caching policy may include, for example: the cache is used for how each training task is distributed, all objects of each data set are cached, or partial objects of the data sets are cached. Since a data set typically has a large volume and occupies a large cache space, if it is determined that only a portion of the objects of the data set are cached, which objects in the data set are preferentially cached, and so on. And finally, data reading of each model training task can be satisfactorily supported.
Specifically, in an embodiment of the present invention, before the step S12 locates, according to the dataset identifier and the file identifier, a storage location of a target object corresponding to the target file in the object storage server in the hierarchical directory, the data providing method for model training provided by an embodiment of the present invention may further include: searching a target object corresponding to the data set identifier and the file identifier in a cache according to the data set identifier and the file identifier; based on this, the step S14 of converting the target object into a corresponding file in a preset file system may specifically include: and responding to the target object in the cache, and converting the target object into a corresponding file in a preset file system.
Optionally, in another embodiment of the present invention, before positioning, in the hierarchical directory, a storage location of the target object corresponding to the target file in the object storage server according to the dataset identifier and the file identifier in step S12, the data providing method for model training provided by the embodiment of the present invention may further include: searching a target object corresponding to the data set identifier and the file identifier in a cache according to the data set identifier and the file identifier; based on this, in step S12, according to the dataset identifier and the file identifier, locating, in the hierarchical directory, a storage location of the target object corresponding to the target file in the object storage server may specifically include: and responding to the fact that the target object does not exist in the cache, and positioning the storage position of the target object corresponding to the target file in the object storage server in the hierarchical directory according to the data set identification and the file identification.
For example, in an embodiment of the present invention, if the data of the model training task request includes object-12, object-18, and object-77 in the dataset2, it may be first found in the cache whether these three files are cached, and if object-18 exists in the cache but object-12 and object-77 do not exist, object-18 may be directly obtained from the cache, and object-18 is converted into file-18 in the preset file system. Then, the storage positions of the object-12 and the object-77 in the object storage server are positioned in the hierarchical directory, the object-12 and the object-77 are obtained from the object storage server, and the object-12 and the object-77 are converted into file-12 and file-77 under a preset file system respectively.
It should be noted that in a single machine or a computer cluster deploying model training tasks, multiple model training tasks may be run simultaneously or sequentially. Moreover, model training is a process of training through a large amount of data and continuously iterating and optimizing, and each model training task also performs multiple rounds of training on the data set. Therefore, the data in the cache can effectively reduce the times of obtaining the data from the far end for the training after the second round in the same model training task and a plurality of different model training tasks using the data, and further improve the data reading efficiency of the model training task. For example, in one embodiment of the present invention, if multiple model training tasks are based on the same cache and the used data partially overlap, the model training tasks may all use the corresponding same data in the cache.
The foregoing embodiment illustrates that, in the embodiment of the present invention, the model training task for caching the target object may be the same as or different from the model training task for searching the target object in the cache. For example, in one embodiment of the invention, the target object is cached based on a data request of a first model training task; and searching a target object corresponding to the data set identification and the file identification in a cache based on a data request of a second model training task, wherein the first model training task is different from or the same as the second model training task.
The data providing method for model training provided by the embodiment of the present invention is described in detail below with specific embodiments.
Fig. 2 is an application diagram of a data providing method for model training according to an embodiment of the present invention.
As shown in FIG. 2, the job manager is a task manager that can schedule the deployment and operation of tasks in the cluster on which computer node. ws1 and ws2 are two different computer nodes in the cluster, respectively. Under the scheduling of the job manager, the deep learning tasks DLTjob1 are scheduled to be run in ws1, and the deep learning tasks DLTjob2, DLTjob3 and DLTjob4 are scheduled to be run in ws 2. Both Ws1 and Ws2 have cache managers and user space file system fuses. And the Cache manager can Cache the data set of each deep learning task. The data set is in the form of objects and is obtained from an object storage cluster of the object storage server. Fuse may retrieve an object from a cache or object storage server and convert the object into a file under a file system for use by each model training task. When data is obtained from the object storage server, the hierarchical directory name space can be used to speed up data search, and the data in the remote object storage server is read into the computer node executing the model training task through the corresponding object storage gateway rgw.
Based on the application diagram shown in fig. 2, in an embodiment of the present invention, as shown in fig. 3, a data providing method for model training provided in an embodiment of the present invention may include:
s201, generating the hierarchical directory according to the object identification of each object stored in the object storage server;
s202, receiving a data request of a model training task, wherein the data request carries a data set identifier and a file identifier of a target file required by the model training task;
s203, searching a target object corresponding to the data set identifier and the file identifier in a cache according to the data set identifier and the file identifier; if not, executing step S204, and if found, executing step S206;
s204, responding to the fact that the target object does not exist in the cache, and positioning the storage position of the target object corresponding to the target file in the object storage server in a hierarchical directory according to the data set identification and the file identification;
s205, outputting the target object from the object storage server according to the storage position;
s206, converting the target object into a corresponding file in a preset file system;
and S207, providing the file to a model training task so that the model training task performs model training by using the file.
Correspondingly, the embodiment of the invention also provides a storage system for model training, which can effectively improve the data reading efficiency of the model training task.
As shown in fig. 4, a storage system for model training provided by an embodiment of the present invention may include:
a request receiving unit 31, configured to receive a data request of a model training task, where the data request carries a data set identifier and a file identifier of a target file required by the model training task;
a positioning unit 32, configured to position, in a hierarchical directory, a storage location of a target object in an object storage server, where the target object corresponds to the target file, according to the dataset identifier and the file identifier;
a data output unit 33 for outputting the target object from the object storage server according to the storage location;
a converting unit 34, configured to convert the target object into a corresponding file in a preset file system, where the preset file system is a file system on which the model training task is based.
The data storage system for model training provided in the embodiments of the present invention is capable of receiving a data request of a model training task, where the data request carries a data set identifier and a file identifier of a target file required by the model training task, then locating a storage location of a target object corresponding to the target file in an object storage server in a hierarchical directory according to the data set identifier and the file identifier, outputting the target object from the object storage server according to the storage location, and converting the target object into a file corresponding to a preset file system, where the preset file system is a file system on which the model training task is based. Therefore, on one hand, the model training task can utilize the strong advantage of object storage, the reading speed is high, and the capacity expansion is convenient to carry out, on the other hand, the object in the object storage is converted into the file in the file system, so that the model training system can conveniently and efficiently use the file to carry out the model training, and the data reading efficiency of the cluster model training service is effectively improved.
The data storage system for model training provided by the embodiment of the invention may further include:
and the directory generation unit is used for generating the hierarchical directory according to the object identifiers of all the objects stored in the object storage server before positioning the storage positions of the object objects corresponding to the object files in the hierarchical directory in the object storage server according to the data set identifiers and the file identifiers.
The data storage system for model training provided by the embodiment of the invention may further include:
and the caching unit is used for caching the target object according to a preset caching strategy after the target object is output from the object storage server according to the storage position and before the target object is converted into a corresponding file in a preset file system.
The data storage system for model training provided by the embodiment of the invention may further include:
the searching unit is used for searching the target object corresponding to the data set identifier and the file identifier in the cache according to the data set identifier and the file identifier before positioning the storage position of the target object corresponding to the target file in the hierarchical directory in the object storage server;
the converting unit 34 is specifically configured to, in response to that the target object exists in the cache, convert the target object into a corresponding file in a preset file system.
The data storage system for model training provided by the embodiment of the invention may further include:
the searching unit is used for searching the target object corresponding to the data set identifier and the file identifier in the cache according to the data set identifier and the file identifier before positioning the storage position of the target object corresponding to the target file in the hierarchical directory in the object storage server;
the positioning unit 32 may be specifically configured to, in response to that the target object does not exist in the cache, position, in a hierarchical directory, a storage location of the target object in the object storage server, where the target object corresponds to the target file, according to the dataset identifier and the file identifier.
Optionally, in the data storage system for model training provided in the embodiment of the present invention, the target object may be cached based on a data request of a first model training task; and searching a target object corresponding to the data set identification and the file identification in a cache based on a data request of a second model training task, wherein the first model training task is different from or the same as the second model training task.
In a third aspect, an embodiment of the present invention further provides an electronic device, which can effectively improve data acquisition efficiency of a model training task.
As shown in fig. 5, an electronic device provided in an embodiment of the present invention may include: the device comprises a shell 51, a processor 52, a memory 53, a circuit board 54 and a power circuit 55, wherein the circuit board 54 is arranged inside a space enclosed by the shell 51, and the processor 52 and the memory 53 are arranged on the circuit board 54; a power supply circuit 55 for supplying power to each circuit or device of the electronic apparatus; the memory 53 is used to store executable program code; the processor 52 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 53, for executing the data acquisition method or the data providing method provided in any of the foregoing embodiments.
For specific execution processes of the above steps by the processor 52 and further steps executed by the processor 52 by running the executable program code, reference may be made to the description of the foregoing embodiments, and details are not described herein again.
The above electronic devices exist in a variety of forms, including but not limited to:
(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.
(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio, video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.
(4) A server: the device for providing the computing service comprises a processor, a hard disk, a memory, a system bus and the like, and the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.
(5) And other electronic equipment with data interaction function.
Accordingly, an embodiment of the present invention further provides a computer-readable storage medium, where one or more programs are stored, and the one or more programs can be executed by one or more processors to implement any one of the data acquisition methods or the data providing methods provided in the foregoing embodiments, so that corresponding technical effects can also be achieved, which have been described in detail above and are not described herein again.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.
In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
For convenience of description, the above devices are described separately in terms of functional division into various units/modules. Of course, the functionality of the units/modules may be implemented in one or more software and/or hardware implementations of the invention.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A data providing method for model training, comprising:
receiving a data request of a model training task, wherein the data request carries a data set identifier and a file identifier of a target file required by the model training task;
according to the data set identification and the file identification, positioning the storage position of a target object corresponding to the target file in an object storage server in a hierarchical directory;
outputting the target object from the object storage server according to the storage position;
and converting the target object into a corresponding file in a preset file system, wherein the preset file system is a file system on which the model training task is based.
2. The method of claim 1, wherein before locating, in a hierarchical directory, a storage location of a target object corresponding to the target file in an object storage server according to the dataset identifier and the file identifier, the method further comprises:
and generating the hierarchical directory according to the object identification of each object stored in the object storage server.
3. The method according to claim 1, wherein after outputting the target object from the object storage server according to the storage location and before converting the target object into a corresponding file in a preset file system, the method further comprises:
and caching the target object according to a preset caching strategy.
4. The method of claim 3, wherein before locating, in a hierarchical directory, a storage location of a target object corresponding to the target file in an object storage server according to the dataset identifier and the file identifier, the method further comprises:
searching a target object corresponding to the data set identifier and the file identifier in a cache according to the data set identifier and the file identifier;
the converting the target object into a corresponding file in a preset file system comprises:
and responding to the target object in the cache, and converting the target object into a corresponding file in a preset file system.
5. The method of claim 3, wherein before locating, in a hierarchical directory, a storage location of a target object corresponding to the target file in an object storage server according to the dataset identifier and the file identifier, the method further comprises:
searching a target object corresponding to the data set identifier and the file identifier in a cache according to the data set identifier and the file identifier;
the positioning, according to the dataset identifier and the file identifier, a storage location of a target object corresponding to the target file in an object storage server in a hierarchical directory includes:
and responding to the fact that the target object does not exist in the cache, and positioning the storage position of the target object corresponding to the target file in the object storage server in the hierarchical directory according to the data set identification and the file identification.
6. The method according to claim 4 or 5, characterized in that the target object is cached based on a data request of a first model training task; and searching a target object corresponding to the data set identification and the file identification in a cache based on a data request of a second model training task, wherein the first model training task is different from or the same as the second model training task.
7. A storage system for model training, comprising:
the device comprises a request receiving unit, a model training task processing unit and a model matching unit, wherein the request receiving unit is used for receiving a data request of a model training task, and the data request carries a data set identifier and a file identifier of a target file required by the model training task;
the positioning unit is used for positioning the storage position of a target object corresponding to the target file in the object storage server in a hierarchical directory according to the data set identifier and the file identifier;
a data output unit, configured to output the target object from the object storage server according to the storage location;
and the conversion unit is used for converting the target object into a corresponding file in a preset file system, wherein the preset file system is a file system on which the model training task is based.
8. The system of claim 7, further comprising:
and the directory generation unit is used for generating the hierarchical directory according to the object identifiers of all the objects stored in the object storage server before positioning the storage positions of the object objects corresponding to the object files in the hierarchical directory in the object storage server according to the data set identifiers and the file identifiers.
9. An electronic device, characterized in that the electronic device comprises: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the method of any of the preceding claims 1-6.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores one or more programs which are executable by one or more processors to implement the method of any of the preceding claims 1 to 6.
CN202011609668.8A 2020-12-28 2020-12-28 Data providing method and system for model training Pending CN112749127A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011609668.8A CN112749127A (en) 2020-12-28 2020-12-28 Data providing method and system for model training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011609668.8A CN112749127A (en) 2020-12-28 2020-12-28 Data providing method and system for model training

Publications (1)

Publication Number Publication Date
CN112749127A true CN112749127A (en) 2021-05-04

Family

ID=75649575

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011609668.8A Pending CN112749127A (en) 2020-12-28 2020-12-28 Data providing method and system for model training

Country Status (1)

Country Link
CN (1) CN112749127A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114020355A (en) * 2021-11-01 2022-02-08 上海米哈游天命科技有限公司 Object loading method and device based on cache space
CN115858473A (en) * 2023-01-29 2023-03-28 北京阿丘科技有限公司 Data interaction method and device based on training system and object storage system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156289A (en) * 2016-06-28 2016-11-23 北京百迈客云科技有限公司 The method of the data in a kind of read-write object storage system and device
CN107026876A (en) * 2016-01-29 2017-08-08 杭州海康威视数字技术股份有限公司 A kind of file data accesses system and method
CN107045530A (en) * 2017-01-20 2017-08-15 华中科技大学 A kind of method that object storage system is embodied as to local file system
CN108984560A (en) * 2017-06-01 2018-12-11 杭州海康威视数字技术股份有限公司 File memory method and device
CN110198334A (en) * 2018-04-19 2019-09-03 腾讯科技(深圳)有限公司 Access method, device and storage medium based on object storage service
CN111258959A (en) * 2020-01-10 2020-06-09 北京猎豹移动科技有限公司 Data acquisition method, data providing method and device
CN112085208A (en) * 2020-07-30 2020-12-15 北京聚云科技有限公司 Method and device for model training by using cloud

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107026876A (en) * 2016-01-29 2017-08-08 杭州海康威视数字技术股份有限公司 A kind of file data accesses system and method
CN106156289A (en) * 2016-06-28 2016-11-23 北京百迈客云科技有限公司 The method of the data in a kind of read-write object storage system and device
CN107045530A (en) * 2017-01-20 2017-08-15 华中科技大学 A kind of method that object storage system is embodied as to local file system
CN108984560A (en) * 2017-06-01 2018-12-11 杭州海康威视数字技术股份有限公司 File memory method and device
CN110198334A (en) * 2018-04-19 2019-09-03 腾讯科技(深圳)有限公司 Access method, device and storage medium based on object storage service
CN111258959A (en) * 2020-01-10 2020-06-09 北京猎豹移动科技有限公司 Data acquisition method, data providing method and device
CN112085208A (en) * 2020-07-30 2020-12-15 北京聚云科技有限公司 Method and device for model training by using cloud

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114020355A (en) * 2021-11-01 2022-02-08 上海米哈游天命科技有限公司 Object loading method and device based on cache space
CN114020355B (en) * 2021-11-01 2024-01-30 上海米哈游天命科技有限公司 Object loading method and device based on cache space
CN115858473A (en) * 2023-01-29 2023-03-28 北京阿丘科技有限公司 Data interaction method and device based on training system and object storage system
CN115858473B (en) * 2023-01-29 2023-10-10 北京阿丘科技有限公司 Data interaction method and device based on training system and object storage system

Similar Documents

Publication Publication Date Title
CN107733977B (en) Cluster management method and device based on Docker
CN107609186B (en) Information processing method and device, terminal device and computer readable storage medium
CN112749127A (en) Data providing method and system for model training
CN104424225B (en) Document handling method based on document transmission process and device
CN103440243A (en) Teaching resource recommendation method and device thereof
CN112087487A (en) Model training task scheduling method and device, electronic equipment and storage medium
CN110263187A (en) Draw this recognition methods, device, storage medium and computer equipment
CN102255866A (en) Method and device for downloading data
CN111258958A (en) Data acquisition method, data providing method and device
CN109885535A (en) A kind of method and relevant apparatus of file storage
CN111831618A (en) Data writing method, data reading method, device, equipment and storage medium
CN111680489A (en) Target text matching method and device, storage medium and electronic equipment
CN111159265A (en) ETL data migration method and system
CN113688139A (en) Object storage method, gateway, device and medium
CN111258959A (en) Data acquisition method, data providing method and device
CN112085208A (en) Method and device for model training by using cloud
CN111352837A (en) Testing method of bioinformatics high-performance computing platform
CN109582347B (en) Method and device for acquiring front-end codes
CN108874495B (en) Theme resource conversion method and device and electronic equipment
CN111427917A (en) Search data processing method and related product
CN111444148A (en) Data transmission method and device based on MapReduce
CN112860412B (en) Service data processing method and device, electronic equipment and storage medium
CN114817160A (en) File decompression method and device, electronic equipment and computer readable storage medium
CN105188154B (en) A kind of method, apparatus and system being automatically brought into operation smart machine
CN108920658B (en) Mobile device desktop moving method and device and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination