CN113392863A - Method and device for acquiring machine learning training data set and terminal - Google Patents

Method and device for acquiring machine learning training data set and terminal Download PDF

Info

Publication number
CN113392863A
CN113392863A CN202010175419.6A CN202010175419A CN113392863A CN 113392863 A CN113392863 A CN 113392863A CN 202010175419 A CN202010175419 A CN 202010175419A CN 113392863 A CN113392863 A CN 113392863A
Authority
CN
China
Prior art keywords
data set
training data
training
node server
machine learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010175419.6A
Other languages
Chinese (zh)
Inventor
吕剑
吕旭涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Intellifusion Technologies Co Ltd
Original Assignee
Shenzhen Intellifusion Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Intellifusion Technologies Co Ltd filed Critical Shenzhen Intellifusion Technologies Co Ltd
Priority to CN202010175419.6A priority Critical patent/CN113392863A/en
Publication of CN113392863A publication Critical patent/CN113392863A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application is applicable to the field of machine learning, and provides a method, a device and a terminal for acquiring a machine learning training data set, wherein the method comprises the following steps: acquiring a packaging instruction of training data in a machine learning training task, and sending a local cache query request of a training data set to a distribution node server of the machine learning training task based on the packaging instruction; sending a remote cache query request of a training data set to a remote server under the condition that the training data set does not exist in a local cache of a distributed node server; under the condition that the training data set does not exist in the cache of the remote server, computing resources corresponding to the packaging strategy are obtained through a resource scheduler; a packing strategy is executed based on computing power resources to obtain a training data set, and the training data set is distributed to a distribution node server, so that the problems of data packing and data transmission in large-scale data scene algorithm training are effectively solved, and the training time of a machine learning algorithm is greatly shortened.

Description

Method and device for acquiring machine learning training data set and terminal
Technical Field
The application belongs to the field of machine learning, and particularly relates to a method, a device and a terminal for acquiring a machine learning training data set.
Background
Machine learning becomes an important process in the aspect of artificial intelligence today. The data is used as the basic elements of the artificial intelligence algorithm, and the importance of the data is self-evident, and the probability of generating a high-quality model is higher only if the data has good quality and enough quantity; with the high-speed development of the internet and the internet of things sensors, the means for acquiring data are more and more abundant, and the amount of acquired data is more and more huge.
When executing a machine learning training task, a large amount of training data is needed, and is packaged and distributed to a training node server. The main acquisition mode of the current training data is to acquire source data from a data center when a training task is triggered, and realize data packaging and data distribution on the basis of the source data.
The source data may be stored in a private storage server cluster, that is, data is stored in a distributed storage system of a private data center, or stored in a public cloud storage service, that is, data is stored in a storage service provided by a public cloud provider.
In the two modes, when data is acquired, the data is inevitably influenced by factors such as network transmission, hardware storage equipment, equipment resource availability and the like, and particularly under a large-scale data scene, namely under a scene of millions of pictures, tens of millions of pictures and even more than one billion of data,
if a large number of users use the same data for algorithm training or the same algorithm uses the same data set for multi-version algorithm training, the existing training data transmission mode causes great loss of computing resources and time resources in the data packaging and transmission process, and cannot timely and effectively realize the rapid packaging and transmission of data required by algorithm training and support the timely and effective transmission requirement of the data during large-scale distributed machine learning algorithm training.
Disclosure of Invention
The embodiment of the application provides a method, a device and a terminal for acquiring a machine learning training data set, and aims to solve the problems that the existing training data transmission mode in the prior art causes great loss of computing resources and time resources in the data packaging and transmission process, rapid packaging and transmission of data required by algorithm training cannot be timely and effectively realized, and the timely and effective transmission requirement of the data cannot be supported during large-scale distributed machine learning algorithm training.
A first aspect of an embodiment of the present application provides a method for acquiring a machine learning training data set, including:
acquiring a packing instruction of training data in a machine learning training task, wherein the packing instruction comprises a packing strategy of the training data;
based on the packing instruction, sending a local cache query request of a training data set to a distribution node server of the machine learning training task;
sending a remote cache query request for the training data set to a remote server if it is determined that the training data set does not exist in the local cache of the distribution node server based on the local cache query request;
obtaining, by a resource scheduler, a computing power resource corresponding to the packing policy if it is determined that the training data set does not exist in the cache of the remote server based on the remote cache query request;
and executing the packaging strategy based on the computing power resource to obtain the training data set, distributing the training data set to the distribution node server, and instructing the distribution node server to cache the training data set to the local.
A second aspect of the embodiments of the present application provides an apparatus for acquiring a machine learning training data set, including:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a packing instruction of training data in a machine learning training task, and the packing instruction comprises a packing strategy of the training data;
a first sending module, configured to send a local cache query request of a training data set to a distribution node server of the machine learning training task based on the packing instruction;
a second sending module, configured to send a remote cache query request for the training data set to a remote server when it is determined that the training data set does not exist in the local cache of the distribution node server based on the local cache query request;
a second obtaining module, configured to obtain, by a resource scheduler, a computational resource corresponding to the packing policy when it is determined that the training data set does not exist in the cache of the remote server based on the remote cache query request;
and the distribution module is used for executing the packaging strategy based on the computing power resource to obtain the training data set, distributing the training data set to the distribution node server, and instructing the distribution node server to cache the training data set to the local.
A third aspect of the present application provides a terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect when executing the computer program.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to the first aspect as described above.
A fifth aspect of the present application provides a computer program product, which computer program product comprises the steps of the method according to the first aspect.
As can be seen from the above, in the embodiment of the present application, based on the packing instruction in the machine learning training task, a local cache query request of a training data set is sent to a distribution node server of the machine learning training task; sending a remote cache query request of a training data set to a remote server under the condition that the training data set does not exist in a local cache of a distributed node server; under the condition that a training data set does not exist in the cache of the remote server, computing resources corresponding to a packing strategy are obtained through a resource scheduler, the packing strategy is executed based on the computing resources, a training data set is obtained and distributed to a distribution node server, the distribution node server is instructed to cache the training data set to the local, a multi-level training data set obtaining mode is set in the process, the obtaining mode of the training data set is optimally and preferentially realized, the obtaining of the training data set after data packing is realized from the local cache of the distribution node server of a training task or the local cache of the remote server, the frequency of obtaining the training data source from a data center is reduced, the loss of the computing resources and time resources in the data packing transmission process of the training data transmission mode is reduced, and the rapid packing and transmission timeliness and effectiveness of the data required by algorithm training are improved, the requirement for timely and effective transmission of data during large-scale distributed machine learning algorithm training is met to the maximum extent.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a first flowchart of a method for acquiring a machine learning training data set according to an embodiment of the present disclosure;
FIG. 2 is a second flowchart of a method for acquiring a machine learning training data set according to an embodiment of the present disclosure;
FIG. 3 is a diagram of a cache architecture for a training data set according to an embodiment of the present application;
FIG. 4 is a block diagram of an apparatus for acquiring a machine learning training data set according to an embodiment of the present disclosure;
fig. 5 is a structural diagram of a terminal according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
In particular implementations, the terminals described in embodiments of the present application include, but are not limited to, other portable devices such as mobile phones, laptop computers, or tablet computers having touch sensitive surfaces (e.g., touch screen displays and/or touch pads). It should also be understood that in some embodiments, the device is not a portable communication device, but is a desktop computer having a touch-sensitive surface (e.g., a touch screen display and/or touchpad).
In the discussion that follows, a terminal that includes a display and a touch-sensitive surface is described. However, it should be understood that the terminal may include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.
The terminal supports various applications, such as one or more of the following: a drawing application, a presentation application, a word processing application, a website creation application, a disc burning application, a spreadsheet application, a gaming application, a telephone application, a video conferencing application, an email application, an instant messaging application, an exercise support application, a photo management application, a digital camera application, a web browsing application, a digital music player application, and/or a digital video player application.
Various applications that may be executed on the terminal may use at least one common physical user interface device, such as a touch-sensitive surface. One or more functions of the touch-sensitive surface and corresponding information displayed on the terminal can be adjusted and/or changed between applications and/or within respective applications. In this way, a common physical architecture (e.g., touch-sensitive surface) of the terminal can support various applications with user interfaces that are intuitive and transparent to the user.
It should be understood that, the sequence numbers of the steps in this embodiment do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation to the implementation process of the embodiment of the present application.
In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.
Referring to fig. 1, fig. 1 is a first flowchart of a method for acquiring a machine learning training data set according to an embodiment of the present application. As shown in fig. 1, a method for acquiring a machine learning training data set includes the following steps:
step 101, obtaining a packing instruction of training data in a machine learning training task, wherein the packing instruction comprises a packing strategy of the training data.
The machine learning training task is, for example, a model training task for classifying and identifying fruits, or a model training task for identifying and matching faces, and the like.
The training data is specifically related data used by the training nodes to implement the execution of the training task, and the training data includes contents such as training characters, training pictures, corresponding labeling result data and the like.
Here, the packing strategy is specifically a strategy for packing training data in a machine learning training task that needs to be executed currently, and the strategy may specifically include: the configuration of computational resources, the configuration of packaging processing flow, the configuration of processing process number and the like required for executing the packaging action can be configured according to the actual needs.
In specific implementation, the packing policy configuration may be performed through a data set packing policy management center, and the main parameters include:
1. and (4) allocating computing resources. Including cpu _ number: the number of Central Processing Units (CPU), the number of mems, the number of gpu _ number: the number of GPU of the graphics processors and storage: and (4) storage space.
2. And (5) packaging flow configuration. Including process _ pipeline: and configuring processing flow nodes and serial-parallel relation among the nodes.
3. The number of processing processes is as follows: process _ number.
4. Target data set processed: source _ dataset.
And 102, sending a local cache query request of a training data set to a distribution node server of the machine learning training task based on the packaging instruction.
Wherein the distribution node server is assigned to perform a machine learning training task.
Wherein the training data set corresponds to a packed state of training data required by the machine learning training task. Here, after the machine learning training task is triggered, based on a packing instruction of training data in the machine learning training task, it is first queried from a local of the distribution node server whether a training data set required by the machine learning training task exists.
Step 103, sending a remote cache query request of the training data set to a remote server under the condition that it is determined that the training data set does not exist in the local cache of the distribution node server based on the local cache query request.
And when the training data set required by the machine learning training task does not exist in the cache of the distribution node server, searching the cache of the remote server for the existence. The remote server stores the cache content of the training data set, is the data content of original training data which is packaged and is different from the source data of the training data required by the machine learning training task stored in the data center.
By searching the training data set locally at the training node and remotely at the remote server, the acquisition of the source data of the training data from the data center is reduced as much as possible, and the rapid acquisition process of the packed training data set is promoted.
And 104, acquiring computing resources corresponding to the packaging strategy through a resource scheduler under the condition that the training data set does not exist in the cache of the remote server based on the remote cache query request.
When the training data set is not found in the local search of the training node and the remote search of the remote server, the training task is a new task which is not executed at all to a great extent, and at the moment, the training task is directly packed according to a normal data packing process. Here, computationally resource scheduling needs to be performed in conjunction with a resource scheduler.
The resource scheduler is mainly used for scheduling resources of the training node server cluster by acquiring computing power resource allocation in the packing strategy, adding one or more server nodes with computing power requirements into a data packing computing power resource pool through an optimal combination algorithm, and only taking charge of the computing work of the current packing task so as to utilize the optimal performance of the resource scheduler, maximize the utilization of the resource and minimize the packing time. After the packing task is completed, the server nodes which are correspondingly scheduled by the computing resources are released from the computing resource pool through the resource scheduler, so that the server nodes participate in other data set processing tasks or other types of tasks, and the computing resources are fully utilized.
And 105, executing the packaging strategy based on the computing power resource to obtain the training data set, distributing the training data set to the distribution node server, and instructing the distribution node server to cache the training data set to the local.
In each of the above-described implementation steps, the execution subject may specifically be a machine learning training task processing device, and may specifically be a data set processor included in the machine learning training task processing device.
After the training data are packaged, the packaged training data sets are distributed to the distribution node servers so as to be cached locally, on one hand, the distribution node servers can execute machine learning training tasks based on the cached training data sets, on the other hand, the training node servers can locally store related training data sets when the same or similar training tasks are executed next time, and the training data can be rapidly acquired.
In the process, through strategies such as a data set packing strategy, a data caching strategy and a data distribution strategy, the utilization rate of computing resources and storage resources in each node server is improved, the data transmission frequency is greatly reduced, the bandwidth flow is saved, the data reading and writing time is shortened, the service life of hardware is prolonged, the algorithm training task scheduling process is optimized, the algorithm training task can be matched with a training node server with the best resources, the problems of data packing, data transmission and the like of large-scale data scene algorithm training are effectively solved, and the machine learning algorithm training time is greatly shortened.
Correspondingly, after instructing the distribution node server to cache the training data set locally, the method further includes: instructing the distribution node server to perform the machine learning training task based on the training data set.
Further, as an optional implementation manner, after the performing the packing strategy based on the computing power resource to obtain the training data set, the method further includes:
judging whether the training data set is cached locally in a same-level node server which processes a training task of the same type as the machine learning training task;
if the judgment result is negative, the training data set is distributed to the same-level node server, and the same-level node server is instructed to cache the training data set to the local.
Here, the data distribution strategy is executed after the computing resources execute the packing strategy to obtain the training data set, and when the training task is a new task that has not executed the same or similar task at all, the packed training data set corresponding to the training task needs to be distributed, on one hand, the training data set is distributed to the local training nodes, and on the other hand, the training data set is distributed to the same-level node servers of the training task of the same type as the machine learning training task, so that other nodes can directly acquire the training data set from the local training nodes when subsequently executing the same or similar task.
In the specific implementation process, a data distribution strategy can be set, the data distribution strategy determines a data distribution rule and determines how to distribute used data to the machine learning training node in time, so that the data can be timely obtained from a local cache of the training node when the same type of algorithm training task B is trained after the data is used by the algorithm training task A.
After the training data set is obtained through packaging or obtained from a remote server and distributed to the local of a training node server, or the training data set is obtained through local query of the training node server, the training node server executes a machine learning training task, at this time, relevant information of the machine learning training task executed by a training node server (namely, a distributed node server) layer can be recorded, and the type, the execution time, the training data set corresponding to the training task and the like can be recorded.
Specifically, in practical application, the data distribution strategy is mainly that when a machine learning training task a1 of type L (such as fruit classification training) is initiated in a training node server node1 and a training node server node2, a data set DS1 is used and a caching strategy is used, the data set is cached in a node1 and a node2, and meanwhile, it is reported that a node1 and a node2 have executed an L-type algorithm training task and have cached the data set DS1, and a data set DS1 may be cached in a remote server.
Further, the distributed node server may be a node server in a server cluster that has processed the machine learning training task or processed a training task of the same type as the machine learning training task.
The corresponding training data set can be found locally at the training task execution node as far as possible, and the efficiency of the whole training task execution process is improved.
For example, during specific implementation, an algorithm training task scheduling policy may be set, and a training task scheduler is used to find whether a node server cluster has executed the same large-class algorithm training, and if so, the node servers are preferentially selected as the distribution node servers and the implementation process is executed; and the training task scheduler can also be used for searching whether a training node server node (1 … x) which executes the training of the same large algorithm and has a data set cache to be used by the algorithm training task T locally exists in the node server cluster, and if so, the algorithm training task T is scheduled to the searched node (1.. x), so that the algorithm training task can be ensured to acquire training data from the local cache of the training node preferentially.
According to the embodiment of the application, a local cache query request of a training data set is sent to a distribution node server of a machine learning training task based on a packing instruction in the machine learning training task; sending a remote cache query request of a training data set to a remote server under the condition that the training data set does not exist in a local cache of a distributed node server; under the condition that a training data set does not exist in the cache of the remote server, computing resources corresponding to a packing strategy are obtained through a resource scheduler, the packing strategy is executed based on the computing resources, a training data set is obtained and distributed to a distribution node server, the distribution node server is instructed to cache the training data set to the local, a multi-level training data set obtaining mode is set in the process, the obtaining mode of the training data set is optimally and preferentially realized, the obtaining of the training data set after data packing is realized from the local cache of the distribution node server of a training task or the local cache of the remote server, the frequency of obtaining the training data source from a data center is reduced, the loss of the computing resources and time resources in the data packing transmission process of the training data transmission mode is reduced, and the rapid packing and transmission timeliness and effectiveness of the data required by algorithm training are improved, the requirement for timely and effective transmission of data during large-scale distributed machine learning algorithm training is met to the maximum extent.
The embodiment of the application also provides different implementation modes of the acquisition method of the machine learning training data set.
Referring to fig. 2, fig. 2 is a second flowchart of a method for acquiring a machine learning training data set according to an embodiment of the present application. As shown in fig. 2, a method for acquiring a machine learning training data set includes the following steps:
step 201, obtaining a packing instruction of training data in a machine learning training task, where the packing instruction includes a packing strategy of the training data.
The implementation process of this step is the same as that of step 101 in the foregoing embodiment, and is not described here again.
Step 202, based on the packing instruction, sending a local cache query request of a training data set to a current distribution node server of the machine learning training task.
Specifically, the current distribution node server here is a server node corresponding to the machine learning training task and about to execute the machine learning training task this time, and first locally queries the node whether a training data set is stored.
Step 203, sending the local cache query request to a node server of the same level that has processed the machine learning training task or processed a training task of the same type as the machine learning training task, when it is determined that the training data set does not exist in the local cache of the currently allocated node server based on the local cache query request.
Under the condition that the training data set does not exist in the local cache of the current distributed node server, a cache query request can be made to the node servers in the same level, so that the local data of the training data set stored in the historical training task can be utilized as far as possible, the external data reading frequency is reduced, and the efficiency is improved.
Correspondingly, after sending a local cache query request of a training data set to a current distribution node server of the machine learning training task based on the packing instruction, the method further includes: and under the condition that the training data set exists in the local cache of the current distribution node server based on the local cache query request, determining the updated use heat value according to the use heat value and a preset value of the training data set in the current distribution node server.
In a specific implementation, the process may be to add the heat of use value of the training data set to the preset value to obtain an updated heat of use value. The preset value is 1, for example, so that the use heat value of the training data set is counted and increased along with the use condition of the training data set.
Here, the data distribution policy further includes: a training data set used by the training task is distributed to a remote server. The use heat value of the training data set needs to be set, the training data set in the training node server is screened based on the use heat value, and the training data set with the use heat value higher than the threshold value can be distributed to the remote server and cached locally in the remote server. Hierarchical storage of the training data set is achieved. The training data set with the heat degree lower than the set threshold value in the training node server can be periodically cleared, and the storage space is saved.
Step 204, determining that the training data set does not exist in the local cache of the distribution node server under the condition that the training data set does not exist in the local cache of the same-level node server based on the local cache query request.
Correspondingly, after the local cache query request is sent to the node server at the same level which processes the machine learning training task or processes a training task of the same type as the machine learning training task, the method further includes:
and under the condition that the training data set exists in the local cache of the node server in the same level based on the local cache query request, distributing the training data set in the local cache of the node server in the same level to the current distribution node server, and instructing the current distribution node server to cache the training data set to the local.
And the training data set is acquired from the node servers in the same layer, and is locally stored in the current distribution node server executing the current training task.
Step 205, sending the remote cache query request of the training data set to a first-level cache server under the condition that it is determined that the training data set does not exist in the local cache of the distribution node server based on the local cache query request.
Here, the remote server may include different tier servers, specifically, the first tier cache server and a root node tier cache server described later. The difference between the two is that the respective stored data sets differ.
The first-level cache server stores a training data set with a use heat value within a set range within preset time. All used training data sets are stored in the root node level cache server.
Step 206, sending the remote cache query request to a root node level cache server if it is determined that the training data set does not exist in the cache of the first level cache server based on the remote cache query request.
This process realizes that the training data set carries out the inquiry that has the level at remote server to promote the convenient degree that data acquisition, raise the efficiency.
Correspondingly, in the case that it is determined that the training data set does not exist in the cache of the root node level cache server based on the remote cache query request, it is determined that the training data set does not exist in the cache of the remote server.
Specifically, the above implementation process is further explained. The cache architecture of the whole training data set is shown in fig. 3 and is composed of a root node level cache server (root level in the figure), a first level cache server (L2 level in the figure) and a local (L1 level in the figure) cache of the training node server. The caching strategy is corresponding to a caching strategy, the caching strategy determines a caching data period, a caching rule and the amount of caching content, so that data to be used in machine learning algorithm training can be acquired from the local place as long as a cache exists, repeated data packaging and frequent transmission are reduced, input and output are reduced, and the data acquisition time is shortened. When the cache policy is implemented, the following settings may be set:
the root layer stores all used training data;
caching data of last N days (N configurable) by cache storage of the L2 layer according to a Least Recently Used (LRU) principle, wherein the upper limit of the capacity is M (M configurable and less than or equal to the sum of the sizes of K and K of the current server storage space H) and the use heat Top K (the sum of the sizes of K is less than or equal to M);
the training node server L1 level caches the data that was used by the current node for the last N days (N configurable), with a capacity ceiling of M (M configurable and less than or equal to the current server storage space H) and with a heat within Top K (the sum of the sizes of K is less than or equal to M) according to the LRU principle.
When a training task is started, the data set processor sends a cache query request to the training node server according to the configured data set ID, when the training data set exists in the cache, the data path cached in the training node server is directly returned, and the training node server directly reads data to execute the training task; when the training data set does not exist in the local cache of the training node server, the data set processor sends a cache query request to the L2 layer, when the training data set exists, the data is returned to the data set processor from the L2 layer cache, and the data set processor downloads the training data set cache data and returns the training data set cache data to the training node server to execute a training task.
When no result is returned when the training data set cache data is obtained from the L2 layer node, the data set processor sends a request to the root layer of the cache root server to determine whether the cache of the training data set exists, if so, the data set is returned, and similarly, the training node data set processor downloads the data and distributes the data to the local part of the training node server; if no cache exists, the data set processor executes the data packing task flow.
The main parameters of the cache strategy are as follows: buffer recent time length: n, taking days as a unit; storage capacity: m, taking GB as a unit; using heat: is an integer greater than or equal to 0.
Step 207, obtaining, by a resource scheduler, a computational resource corresponding to the packing policy, when it is determined that the training data set does not exist in the cache of the remote server based on the remote cache query request.
The implementation process of this step is the same as that of step 104 in the foregoing embodiment, and is not described here again.
And 208, executing the packaging strategy based on the computing power resource to obtain the training data set, distributing the training data set to the distribution node server, and instructing the distribution node server to cache the training data set to the local.
The implementation process of this step is the same as that of step 105 in the foregoing embodiment, and is not described here again.
According to the embodiment of the application, a local cache query request of a training data set is sent to a distribution node server of a machine learning training task based on a packing instruction in the machine learning training task; sending a remote cache query request of a training data set to a remote server under the condition that the training data set does not exist in a local cache of a distributed node server; under the condition that a training data set does not exist in the cache of the remote server, computing resources corresponding to a packing strategy are obtained through a resource scheduler, the packing strategy is executed based on the computing resources, a training data set is obtained and distributed to a distribution node server, the distribution node server is instructed to cache the training data set to the local, a multi-level training data set obtaining mode is set in the process, the obtaining mode of the training data set is optimally and preferentially realized, the obtaining of the training data set after data packing is realized from the local cache of the distribution node server of a training task or the local cache of the remote server, the frequency of obtaining the training data source from a data center is reduced, the loss of the computing resources and time resources in the data packing transmission process of the training data transmission mode is reduced, and the rapid packing and transmission timeliness and effectiveness of the data required by algorithm training are improved, the requirement for timely and effective transmission of data during large-scale distributed machine learning algorithm training is met to the maximum extent.
Referring to fig. 4, fig. 4 is a device for acquiring a machine learning training data set according to an embodiment of the present application, and for convenience of explanation, only a portion related to the embodiment of the present application is shown.
The apparatus 400 for acquiring a machine learning training data set comprises:
a first obtaining module 401, configured to obtain a packing instruction of training data in a machine learning training task, where the packing instruction includes a packing strategy of the training data;
a first sending module 402, configured to send a local cache query request of a training data set to a distribution node server of the machine learning training task based on the packing instruction;
a second sending module 403, configured to send a remote cache query request of the training data set to a remote server if it is determined that the training data set does not exist in the local cache of the distribution node server based on the local cache query request;
a second obtaining module 404, configured to obtain, by a resource scheduler, a computational resource corresponding to the packing policy if it is determined that the training data set does not exist in the cache of the remote server based on the remote cache query request;
a distributing module 405, configured to execute the packing policy based on the computing resources, obtain the training data set, distribute the training data set to the distribution node server, and instruct the distribution node server to cache the training data set locally.
The first sending module 402 is specifically configured to:
based on the packing instruction, sending a local cache query request of a training data set to a current distribution node server of the machine learning training task;
sending the local cache query request to a same-level node server which processes the machine learning training task or processes a training task of the same type as the machine learning training task under the condition that the training data set does not exist in the local cache of the current distributed node server based on the local cache query request;
determining that the training data set does not exist in the local cache of the distribution node server if it is determined that the training data set does not exist in the local cache of the peer node server based on the local cache query request.
The device also includes:
and the updating module is used for determining the updated use heat value according to the use heat value and a preset value of the training data set in the current distribution node server under the condition that the training data set exists in the local cache of the current distribution node server based on the local cache query request.
The device also includes:
a reading module, configured to, when it is determined that the training data set exists in the local cache of the peer node server based on the local cache query request, distribute the training data set in the local cache of the peer node server to the current distribution node server, and instruct the current distribution node server to cache the training data set to the local.
The second sending module 403 is specifically configured to:
sending the remote cache query request for the training data set to a first level cache server;
sending the remote cache query request to a root node level cache server if it is determined, based on the remote cache query request, that the training data set is not present in the cache of the first level cache server;
the first-level cache server stores a training data set with a use heat value within a set range within preset time; all used training data sets are stored in the root node level cache server.
The device also includes:
the judging module is used for judging whether the training data set is cached locally in a same-level node server which processes the training task of the same type as the machine learning training task; if the judgment result is negative, the training data set is distributed to the same-level node server, and the same-level node server is instructed to cache the training data set to the local.
And the distributed node servers are node servers which process the machine learning training tasks or process training tasks of the same type as the machine learning training tasks in the server cluster.
The device for acquiring the machine learning training data set provided by the embodiment of the application can realize each process of the embodiment of the method for acquiring the machine learning training data set, can achieve the same technical effect, and is not repeated here to avoid repetition.
Fig. 5 is a structural diagram of a terminal according to an embodiment of the present application. As shown in the figure, the terminal 5 of this embodiment includes: at least one processor 50 (only one shown in fig. 5), a memory 51, and a computer program 52 stored in the memory 51 and executable on the at least one processor 50, the steps of any of the various method embodiments described above being implemented when the computer program 52 is executed by the processor 50.
The terminal 5 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal 5 may include, but is not limited to, a processor 50, a memory 51. It will be appreciated by those skilled in the art that fig. 5 is only an example of a terminal 5 and does not constitute a limitation of the terminal 5 and may include more or less components than those shown, or some components in combination, or different components, for example the terminal may also include input output devices, network access devices, buses, etc.
The Processor 50 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 51 may be an internal storage unit of the terminal 5, such as a hard disk or a memory of the terminal 5. The memory 51 may also be an external storage device of the terminal 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like provided on the terminal 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the terminal 5. The memory 51 is used for storing the computer program and other programs and data required by the terminal. The memory 51 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed terminal and method can be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The present application realizes all or part of the processes in the method of the above embodiments, and may also be implemented by a computer program product, when the computer program product runs on a terminal, the steps in the above method embodiments may be implemented when the terminal executes the computer program product.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A method for acquiring a machine learning training data set is characterized by comprising the following steps:
acquiring a packing instruction of training data in a machine learning training task, wherein the packing instruction comprises a packing strategy of the training data;
based on the packing instruction, sending a local cache query request of a training data set to a distribution node server of the machine learning training task;
sending a remote cache query request for the training data set to a remote server if it is determined that the training data set does not exist in the local cache of the distribution node server based on the local cache query request;
obtaining, by a resource scheduler, a computing power resource corresponding to the packing policy if it is determined that the training data set does not exist in the cache of the remote server based on the remote cache query request;
and executing the packaging strategy based on the computing power resource to obtain the training data set, distributing the training data set to the distribution node server, and instructing the distribution node server to cache the training data set to the local.
2. The method of claim 1, wherein sending a request for a local cache query of a training data set to a distribution node server of the machine learning training task based on the packed instruction comprises:
based on the packing instruction, sending a local cache query request of a training data set to a current distribution node server of the machine learning training task;
sending the local cache query request to a same-level node server which processes the machine learning training task or processes a training task of the same type as the machine learning training task under the condition that the training data set does not exist in the local cache of the current distributed node server based on the local cache query request;
determining that the training data set does not exist in the local cache of the distribution node server if it is determined that the training data set does not exist in the local cache of the peer node server based on the local cache query request.
3. The method of claim 2, wherein after sending a request for a local cache query of a training data set to a server of a currently assigned node of the machine learning training task based on the packed instruction, the method further comprises:
and under the condition that the training data set exists in the local cache of the current distribution node server based on the local cache query request, determining the updated use heat value according to the use heat value and a preset value of the training data set in the current distribution node server.
4. The method according to claim 2, wherein after sending the local cache query request to a peer node server that has processed the machine learning training task or processed a training task of a same type as the machine learning training task, the method further comprises:
and under the condition that the training data set exists in the local cache of the node server in the same level based on the local cache query request, distributing the training data set in the local cache of the node server in the same level to the current distribution node server, and instructing the current distribution node server to cache the training data set to the local.
5. The method of claim 1, wherein sending a remote cache query request for the training data set to a remote server comprises:
sending the remote cache query request for the training data set to a first level cache server;
sending the remote cache query request to a root node level cache server if it is determined, based on the remote cache query request, that the training data set is not present in the cache of the first level cache server;
the first-level cache server stores a training data set with a use heat value within a set range within preset time; all used training data sets are stored in the root node level cache server.
6. The method of claim 1, wherein after the performing the packing strategy based on the computational resource to obtain the training data set, further comprises:
judging whether the training data set is cached locally in a same-level node server which processes a training task of the same type as the machine learning training task;
if the judgment result is negative, the training data set is distributed to the same-level node server, and the same-level node server is instructed to cache the training data set to the local.
7. The acquisition method according to claim 1,
and the distribution node server is a node server which processes the machine learning training task or processes a training task of the same type as the machine learning training task in the server cluster.
8. An apparatus for acquiring a machine learning training data set, comprising:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a packing instruction of training data in a machine learning training task, and the packing instruction comprises a packing strategy of the training data;
a first sending module, configured to send a local cache query request of a training data set to a distribution node server of the machine learning training task based on the packing instruction;
a second sending module, configured to send a remote cache query request for the training data set to a remote server when it is determined that the training data set does not exist in the local cache of the distribution node server based on the local cache query request;
a second obtaining module, configured to obtain, by a resource scheduler, a computational resource corresponding to the packing policy when it is determined that the training data set does not exist in the cache of the remote server based on the remote cache query request;
and the distribution module is used for executing the packaging strategy based on the computing power resource to obtain the training data set, distributing the training data set to the distribution node server, and instructing the distribution node server to cache the training data set to the local.
9. A terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202010175419.6A 2020-03-13 2020-03-13 Method and device for acquiring machine learning training data set and terminal Pending CN113392863A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010175419.6A CN113392863A (en) 2020-03-13 2020-03-13 Method and device for acquiring machine learning training data set and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010175419.6A CN113392863A (en) 2020-03-13 2020-03-13 Method and device for acquiring machine learning training data set and terminal

Publications (1)

Publication Number Publication Date
CN113392863A true CN113392863A (en) 2021-09-14

Family

ID=77615985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010175419.6A Pending CN113392863A (en) 2020-03-13 2020-03-13 Method and device for acquiring machine learning training data set and terminal

Country Status (1)

Country Link
CN (1) CN113392863A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114827783A (en) * 2022-07-01 2022-07-29 西南民族大学 Aggregation tree-based bandwidth scheduling method for cross-domain distributed machine learning
CN115562877A (en) * 2022-11-15 2023-01-03 北京阿丘科技有限公司 Arrangement method, device and equipment of distributed computing power resources and storage medium
WO2023098794A1 (en) * 2021-12-02 2023-06-08 华为技术有限公司 Training acceleration method and related device
CN117215973A (en) * 2023-09-13 2023-12-12 之江实验室 Processing method of cache data, deep learning training method and system
CN114217743B (en) * 2021-09-17 2024-05-31 支付宝(杭州)信息技术有限公司 Data access method and device for distributed graph learning architecture

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114217743B (en) * 2021-09-17 2024-05-31 支付宝(杭州)信息技术有限公司 Data access method and device for distributed graph learning architecture
WO2023098794A1 (en) * 2021-12-02 2023-06-08 华为技术有限公司 Training acceleration method and related device
CN114827783A (en) * 2022-07-01 2022-07-29 西南民族大学 Aggregation tree-based bandwidth scheduling method for cross-domain distributed machine learning
CN114827783B (en) * 2022-07-01 2022-10-14 西南民族大学 Aggregation tree-based bandwidth scheduling method for cross-domain distributed machine learning
CN115562877A (en) * 2022-11-15 2023-01-03 北京阿丘科技有限公司 Arrangement method, device and equipment of distributed computing power resources and storage medium
CN115562877B (en) * 2022-11-15 2023-03-24 北京阿丘科技有限公司 Arranging method, device and equipment of distributed computing power resources and storage medium
CN117215973A (en) * 2023-09-13 2023-12-12 之江实验室 Processing method of cache data, deep learning training method and system
CN117215973B (en) * 2023-09-13 2024-05-28 之江实验室 Processing method of cache data, deep learning training method and system

Similar Documents

Publication Publication Date Title
CN113392863A (en) Method and device for acquiring machine learning training data set and terminal
US10922316B2 (en) Using computing resources to perform database queries according to a dynamically determined query size
US10257307B1 (en) Reserved cache space in content delivery networks
KR20200027413A (en) Method, device and system for storing data
US10318346B1 (en) Prioritized scheduling of data store access requests
US20200272636A1 (en) Tiered storage for data processing
US9774676B2 (en) Storing and moving data in a distributed storage system
CN104102693A (en) Object processing method and device
US10884980B2 (en) Cognitive file and object management for distributed storage environments
WO2020019313A1 (en) Graph data updating method, system, computer readable storage medium, and device
CN111737168A (en) Cache system, cache processing method, device, equipment and medium
US20230006891A1 (en) Techniques and architectures for efficient allocation of under-utilized resources
CN112084173A (en) Data migration method and device and storage medium
US11381506B1 (en) Adaptive load balancing for distributed systems
US11297147B2 (en) Managed data export to a remote network from edge devices
EP2622499B1 (en) Techniques to support large numbers of subscribers to a real-time event
CN111857992A (en) Thread resource allocation method and device in Radosgw module
US11625273B1 (en) Changing throughput capacity to sustain throughput for accessing individual items in a database
US10067678B1 (en) Probabilistic eviction of partial aggregation results from constrained results storage
US11500931B1 (en) Using a graph representation of join history to distribute database data
US11727003B2 (en) Scaling query processing resources for efficient utilization and performance
RU2635255C2 (en) System coherent cache with possibility of fragmentation/ defragmentation
CN115604269A (en) Load balancing method and device of server, electronic equipment and storage medium
CN105323320B (en) A kind of method and device of content distribution
US11537616B1 (en) Predicting query performance for prioritizing query execution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination