WO2023051228A1 - 样例数据的处理方法、装置、设备和存储介质 - Google Patents

样例数据的处理方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2023051228A1
WO2023051228A1 PCT/CN2022/118411 CN2022118411W WO2023051228A1 WO 2023051228 A1 WO2023051228 A1 WO 2023051228A1 CN 2022118411 W CN2022118411 W CN 2022118411W WO 2023051228 A1 WO2023051228 A1 WO 2023051228A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample data
meta
information
meta information
training
Prior art date
Application number
PCT/CN2022/118411
Other languages
English (en)
French (fr)
Inventor
徐之浩
车漾
张凯
顾荣
Original Assignee
阿里巴巴(中国)有限公司
阿里云计算有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴(中国)有限公司, 阿里云计算有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2023051228A1 publication Critical patent/WO2023051228A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of computer technology, in particular to a method and device for processing sample data, an electronic device and a storage medium.
  • the computing-side distributed cache solution is widely used to accelerate data access: by caching the data in the storage system in the computing environment, the machine learning training jobs running in the computing environment can run at a lower latency and faster High bandwidth to get the required data. Since machine learning often performs multiple rounds of training, data caches can be reused between multiple rounds to improve the efficiency of the machine learning training process.
  • the cache system faces greater challenges: on the one hand, in order to obtain the maximum data access acceleration, the cache system needs to cache the data in the entire data set, which means that the cache The system requires a large amount of storage resources in the computing environment. On the other hand, affected by the eviction strategy of the cache system, if the cache system cannot completely cache the data in the entire dataset, then the data access efficiency of the machine learning training job will be significantly reduced compared to the case of full cache.
  • the embodiment of the present application provides a sample data processing method to solve the problem of data access efficiency of machine learning training jobs when the cache system occupies a large amount of storage resources in the computing environment and the cache system cannot completely cache the data in the entire data set. Bottom question.
  • the embodiment of the present application also provides a sample data processing device, an electronic device, and a storage medium, so as to ensure the implementation and application of the above method.
  • the embodiment of the present application discloses a method for processing sample data, the method including:
  • the sequence of meta-information includes several pieces of meta-information, and the meta-information is used to index to corresponding sample data;
  • increasing the preset number includes:
  • the preset number is increased.
  • the Kubernetes cluster is deployed with working nodes and a distributed cache system, the working nodes are used to execute the training task, and the distributed cache system is used to pre-store target sample data.
  • the embodiment of the present application also discloses a sample data processing device, which includes:
  • a sequence acquisition module configured to acquire a training task and a meta information sequence corresponding to the training task, the meta information sequence includes several meta information, and the meta information is used to index to corresponding sample data;
  • a sequence traversal module configured to traverse the meta information sequence and determine a preset number of target meta information
  • a task execution module configured to pre-store the target sample data corresponding to the target meta-information, and execute the training task using the target sample data pre-stored at the same time;
  • a data eviction module configured to return to the step of traversing the meta information sequence to determine a preset number of target meta information when the previous pre-stored target sample data is used up, and simultaneously evict the previous A pre-stored target sample data.
  • a job acquisition module configured to acquire a machine learning job, and meta-information corresponding to sample data in a sample data set that needs to be used when the machine learning job performs learning and training;
  • a sequence generation module configured to randomly generate a total sequence of meta information by using the meta information corresponding to the sample data in the sample data set;
  • a job splitting module configured to split the total sequence of meta information into multiple sequences of meta information, and split the machine learning job into multiple training tasks based on the multiple sequences of meta information; wherein, multiple The training tasks are executed in parallel.
  • a job acquisition module is also used to acquire other machine learning jobs using the sample data set
  • a job splitting module configured to split the other machine learning jobs into multiple other training tasks based on the multiple meta-information sequences; wherein, the other training tasks are the same as the training tasks corresponding to the same meta-information sequence Execute synchronously.
  • a time determination module configured to determine the time it takes to execute the training task using the previously stored target sample data
  • the time determination module is also used to determine the shortest time it takes to execute the training task using pre-stored target sample data in history;
  • a time calculation module configured to calculate the abnormal time according to the shortest time and the abnormal value parameters
  • a quantity increasing module configured to increase the preset quantity when the time is greater than the abnormal time.
  • the quantity increasing module includes:
  • the number of determination sub-module is used to detect that the system time of the preset number is increased to the current system time when the time is greater than the abnormal time, execute the traversal of the meta information sequence, and determine the preset number of The number of steps for the target meta information;
  • the number increasing sub-module is used to increase the preset number when the number is greater than the preset number.
  • the Kubernetes cluster is deployed with working nodes and a distributed cache system, the working nodes are used to execute the training task, and the distributed cache system is used to pre-store target sample data.
  • the Kubernetes cluster is deployed with a machine learning framework, and the data reading module of the machine learning framework is replaced by a Dataset Indexing Service component, so as to maintain the metadata sequence corresponding to the training task through the Dataset Indexing Service component.
  • the embodiment of the present application also discloses an electronic device, including: a processor; and a memory, on which executable code is stored, and when the executable code is executed, the processor executes the One or more methods for processing the sample data.
  • the embodiment of the present application also discloses one or more machine-readable media, on which executable code is stored, and when the executable code is executed, the processor executes the program as described in one or more of the embodiments of the present application.
  • the processing method of the sample data is not limited to
  • the embodiment of the present application includes the following advantages:
  • the training task and the meta information sequence corresponding to the training task are obtained.
  • the meta information sequence includes several meta information, and the meta information is used to index to the corresponding sample data; the meta information sequence is traversed to determine the preset number of Target meta information: pre-store the target sample data corresponding to the target meta-information, and use the previous pre-stored target sample data to execute the training task.
  • the step of extracting a preset amount of target metadata, and at the same time expel the target sample data pre-stored last time.
  • the execution Required for training tasks to save resource usage of the cache system.
  • the sample data that will be used in the execution of the training task is pre-prepared according to the meta-information sequence, which avoids the situation that the sample data cannot be hit during the execution of the training task, and solves the performance bottleneck caused by slow data access.
  • the process of eviction and pre-storage is performed synchronously with the execution of training tasks, and the entire process runs in a pipelined manner, shortening the time for executing training tasks.
  • the meta-information is used to randomly generate the meta-information sequence, and the random ordering of the meta-information in the meta-information sequence ensures that the process of executing the training task according to the meta-information sequence The randomness of the sample data used in to prevent overfitting of the learning model obtained by executing the training task.
  • Fig. 1 is a flow chart of the steps of an example data processing method embodiment of the present application
  • Fig. 2 is a flow chart of the steps of another example data processing method embodiment of the present application.
  • Fig. 3 is a flow chart of steps in an embodiment of adjusting the quantity of pre-stored sample data of the present application
  • FIG. 4 is a schematic diagram of a modified embodiment of a machine learning programming framework of the present application.
  • Fig. 5 is the frame schematic diagram of a kind of Kubernetes cluster embodiment of the present application.
  • Fig. 6 is one of the frame diagrams of a sample data processing embodiment of the present application.
  • FIG. 7 is the second schematic diagram of a sample data processing embodiment of the present application.
  • Fig. 8 is a structural block diagram of a sample data device embodiment of the present application.
  • Fig. 9 is a schematic structural diagram of a device provided by an embodiment of the present application.
  • Machine learning methods mainly include two processes: training and inference. Among them, the machine learning training process needs to learn the correct correlation between data and data from large-scale data sets, and obtain "experience”, while the machine learning inference process judges emerging data based on "experience”.
  • heterogeneous computing devices such as GPUs are often used to achieve parallel acceleration of data processing.
  • heterogeneous computing devices such as GPUs
  • more and more heterogeneous computing devices with stronger computing capabilities have emerged, further accelerating the machine learning training process.
  • the acceleration of data processing capabilities puts forward higher requirements on the data access speed of the program, and the computing and storage separation architecture adopted on the cloud further limits the data access speed, so the data access speed has gradually become the main factor of the machine learning training program. performance bottleneck.
  • the prior art proposes the following schemes in order to solve the above problems:
  • distributed caching can be realized on the computing side.
  • the distributed caching can integrate the storage resources of each node, provide a larger cache pool, and cache the entire data required for machine learning training. data set.
  • it requires a lot of storage resources.
  • multiple machine learning training jobs require different data sets at the same time, which will make the distributed cache system face a greater burden.
  • the data pre-storage capabilities provided by the PyTorch and Tensorflow frameworks also help alleviate the performance impact caused by data access bottlenecks.
  • the pre-storage process relies on user-written machine learning training programs. Before the program runs, the user cannot predict in advance whether there will be a data access bottleneck, so it is impossible to reasonably configure the pre-storage function provided by the framework.
  • the core idea of this application is to replace the existing data reading components by injection at the framework layer of the application container when deploying machine learning applications in the cluster, and manage the pre-storage and expulsion of sample data through a unified service. Under the premise of lower data space, the purpose of improving the running speed of the application is achieved.
  • the Dataset Indexing Service component performs pre-storage and expulsion management of sample data in the cache system according to the characteristics of different application data access sequences, and efficiently utilizes the cache.
  • the present application provides a method and device for processing sample data, which will be described in detail in the following embodiments. First, terms and terms involved in one or more embodiments of the present application are explained.
  • Kubernetes is an open source system for automatically deploying, scaling, and managing containerized applications.
  • Machine Learning Essentially using data to solve cognitive problems for computers. Knowledge understanding, information processing and even prediction.
  • Remote storage system refers to the cloud storage or storage server used to store training sample data sets away from the computing side.
  • Distributed cache In a distributed environment or system, the data stored remotely is stored in a machine close to the user or application, so as to reduce the delay of remote data transmission, so that users and applications can quickly access the desired data.
  • GPU Graphics processing unit, similar to CPU, except that GPU is designed to perform complex mathematical and geometric calculations, and is widely used in artificial intelligence.
  • Worker nodes Computer nodes with GPUs or for learning and training.
  • Meta information also known as metadata, is data describing data (data about data), mainly describing data attributes (property), used to support functions such as indicating storage location, historical data, resource search, file recording, etc. It can be regarded as an electronic catalog. In order to achieve the purpose of cataloging, it is necessary to describe and collect the content or characteristics of the data, and then achieve the purpose of assisting data retrieval.
  • FIG. 1 it is a flow chart of steps of an embodiment of a method for processing sample data of the present application, including the following steps:
  • Step 101 acquire a training task and a meta information sequence corresponding to the training task, the meta information sequence includes several meta information, and the meta information is used to index to corresponding sample data.
  • the method is applied to the Kubernetes cluster.
  • the Kubernetes cluster deploys machine learning frameworks such as TensorFlow and PyTorch, introduces the Dataset Indexing Service component, and automatically replaces the data reading module of the machine learning framework.
  • the Kubernetes cluster deploys working nodes and a distributed cache system. .
  • the training task is scheduled to the working node, and the metadata sequence corresponding to the training task is maintained through the Dataset Indexing Service component.
  • the meta-information sequence is randomly generated after the Dataset Indexing Service component obtains the meta-information corresponding to the sample data required for the training task.
  • the random ordering of the meta-information in the meta-information sequence ensures that samples are sampled during the execution of the training task according to the meta-information sequence.
  • the randomness of the use (consumption) of example data prevents overfitting of the learning model obtained by executing the training task.
  • Step 102 traversing the meta information sequence to determine a preset number of target meta information.
  • the working node when a working node needs to consume (use) sample data to perform training tasks, the working node will request sample data from the Dataset Indexing Service component, and the Dataset Indexing Service component will sequentially traverse the meta information sequence to determine a preset number of The target meta information is returned to the working node, for example, the preset number is 2, and the meta information sequence is 1, 3, 6, 5, 4, 5, then the preset number of target meta information is determined to be 1, 3.
  • Step 103 pre-store target sample data corresponding to the target meta-information, and use the previously stored target sample data to execute the training task.
  • the remote storage system stores sample data corresponding to the meta information in the meta information sequence.
  • the target sample data corresponding to the target meta information from the remote storage system, pre-store it in the distributed cache system, and use the pre-stored target sample data to perform the training task at the same time
  • the meta information sequence is 1, 3, 6, 5, 4, 5, 1 and 3 sample data have been pre-stored in the distributed cache system, therefore, request sample data from the Dataset Indexing Service component at the working node, and the Dataset Indexing Service component sequentially traverses the elements Information sequence, the determined target meta information is 6, 5, then the distributed cache system starts to pre-store 6, 5 sample data, and at the same time the working node obtains 1, 3 sample data from the distributed cache system to perform training tasks.
  • Step 104 when the previous pre-stored target sample data is used up, return to the step of traversing the meta-information sequence to determine a preset number of target meta-information, and at the same time expel the previous pre-stored Target sample data.
  • the worker node will request the sample data from the Dataset Indexing Service component again, and the Dataset Indexing Service component traverses the meta information sequence again in order to determine the new The preset number of target meta information.
  • the meta information corresponding to the pre-stored sample data will not be determined as the target meta information.
  • the preset number is 2, and the meta information sequence is 1, 3, 6, 5, 4, 5, then the preset number of target meta information determined last time is 1, 3, and the meta information sequence is traversed again, starting from meta information 6, and the preset number of target meta information is determined to be 6, 5.
  • the previously pre-stored target sample data in the distributed cache system is evicted.
  • the training task and the meta information sequence corresponding to the training task are acquired.
  • the meta information sequence includes several meta information, and the meta information is used to index to the corresponding sample data; the meta information sequence is traversed to determine the preset number of targets Meta information; pre-store the target sample data corresponding to the target meta-information, and use the previous pre-stored target sample data to execute the training task.
  • the step of presetting the number of target meta information, and at the same time expel the target sample data pre-stored last time. Applying the embodiment of this application, pre-store the sample data that will be used (consumed) to execute the training task according to the meta-information sequence, and expel the sample data that has already been used.
  • sample data that will be used when executing the training task is pre-prepared, avoiding the situation that the sample data cannot be hit when executing the training task, and solving the performance bottleneck caused by slow data access.
  • process of eviction and pre-storage is performed synchronously with the execution of training tasks, and the entire process runs in a pipelined manner, shortening the time for executing training tasks.
  • the meta information sequence is randomly generated after the Dataset Indexing Service component obtains the meta information corresponding to the sample data required for the training task.
  • the random ordering of the meta information in the meta information sequence ensures that the training task is executed according to the meta information sequence
  • the randomness of the use (consumption) of the sample data in the middle is to prevent the learning model obtained from performing the training task from overfitting.
  • FIG. 2 shows a flow chart of steps of another example data processing method embodiment of the present application, including the following steps:
  • Step 201 acquiring a machine learning job and meta information corresponding to sample data in a sample data set that needs to be used for learning and training of the machine learning job.
  • the machine learning job and the meta information corresponding to the sample data in the sample data set that needs to be used for learning and training of the machine learning job wherein the meta information corresponding to the sample data in the sample data set is stored in the distributed cache
  • the metadata corresponding to the sample data is obtained from the distributed cache system through the Dataset Indexing Service component.
  • Step 202 using the meta information corresponding to the sample data in the sample data set to randomly generate a total sequence of meta information.
  • the random data access sequence of machine learning training is the root cause of the significant reduction in data access efficiency.
  • the data access sequence in the machine learning training process is completely random. If the cache system cannot cache all the data, cache eviction strategies such as LRU will evict the data that has not been used recently in time, causing machine learning training to fail in the next data access. Cache misses.
  • the Dataset Indexing Service component obtains the meta information corresponding to the sample data from the distributed cache system
  • the meta information is used to randomly generate the total sequence of meta information, and machine learning jobs can be accessed according to the order of the meta information in the total sequence of meta information (use) sample data for learning and training.
  • the meta-information corresponding to the sample data is randomly generated in advance as a total sequence of meta-information, and when learning and training a machine learning job, the sample data can be accessed according to the order of the meta-information in the total sequence of meta-information, ensuring that The randomness used in the sample data during the learning training process to prevent the resulting learning model from overfitting. At the same time, it solves the problem of cache miss in the next sample data access during learning and training.
  • Step 203 splitting the total sequence of meta information into multiple sequences of meta information, and splitting the machine learning job into multiple training tasks based on the multiple sequences of meta information; where multiple training tasks are executed in parallel .
  • the total sequence of meta-information is split into multiple meta-information sequences through the Dataset Indexing Service component, and the machine learning job is split into multiple training tasks based on multiple meta-information sequences, and multiple training tasks are scheduled to different
  • the worker nodes are executed in parallel. The steps for the worker nodes to execute the training task are described above, and will not be repeated here.
  • the machine learning job is divided into multiple training tasks based on multiple meta-information sequences, and the multiple training tasks are respectively scheduled to different work nodes for parallel execution, which can shorten the learning and training cost of the machine learning job. time.
  • it also includes: obtaining other machine learning jobs using the sample data set; splitting the other machine learning jobs into multiple other training tasks based on multiple sequences of meta information; wherein , the other training tasks are executed synchronously with the training tasks corresponding to the same meta-information sequence.
  • machine learning jobs that use sample data sets (of machine learning jobs), because machine learning jobs and other machine learning jobs use the same sample data sets for learning and training, and the machine can be coordinated in a centralized manner
  • the learning job and other machine learning jobs share the same total sequence of meta information, so other machine learning jobs can be split into multiple other training tasks based on multiple meta information sequences, and multiple other training tasks can be scheduled to different work nodes
  • Execution the training task and other training tasks corresponding to the same meta-information sequence can be executed synchronously.
  • the training task and other training tasks use the same sample data, so that the sample data in the distributed cache system is shared by multiple Training tasks are reused to improve cache utilization.
  • the machine learning job is divided into multiple training tasks based on multiple meta-information sequences, and the multiple training tasks are respectively scheduled to different work nodes for parallel execution, which can shorten the learning and training cost of the machine learning job. time.
  • the machine learning job is split into multiple training tasks, and other machine learning jobs are split into multiple other training tasks.
  • the training tasks corresponding to the same meta-information sequence can be synchronized with other training tasks.
  • Execution During execution, the training task and other training tasks use the same sample data, so that the sample data in the distributed cache system can be reused by multiple training tasks, improving cache utilization.
  • FIG. 3 shows a flow chart of the steps of an embodiment of adjusting the quantity of pre-stored sample data of the present application, including the following steps:
  • Step 301 Determine the time it takes to execute the training task using the previously stored target sample data.
  • Step 302 Determine the shortest time it takes to execute the training task in the history using the pre-stored target sample data.
  • Step 303 Calculate the abnormal time according to the shortest time and the abnormal value parameters.
  • Step 304 When the time is greater than the abnormal time, increase the preset amount.
  • the number of pre-stored sample data in the distributed cache system is related to the speed of using sample data when performing training tasks and the speed of pulling sample data from remote storage systems.
  • the number of pre-stored sample data is too small to make some samples When the sample data is used, it is still not in the distributed cache system.
  • the cache miss leads to a decrease in the sample data access efficiency, and the excessive amount of pre-stored sample data occupies too much storage resources in the distributed cache system.
  • the time it takes to execute the training task using the preset number of target sample data pre-stored in the previous time determines the time it takes to execute the training task using the preset number of target sample data pre-stored in the previous time, and the shortest time it takes to execute the training task using the pre-stored target sample data in history, and combine the shortest time with Multiply the abnormal value parameters to get the abnormal time; when the time it takes to execute the training task using the preset number of target sample data pre-stored in the previous time is greater than the abnormal time, it means that some sample data are still not in the distributed cache system when they are used , it is necessary to increase the preset number, that is, increase the number of pre-stored sample data in the distributed cache system to meet the needs of training tasks.
  • increasing the preset amount includes: when the time is greater than the abnormal time, detecting the last increase of the preset amount The number of times the step of traversing the meta information sequence to determine a preset number of target meta information is performed from the system time to the current system time; when the number is greater than the preset number, the preset number is increased.
  • the number of sample data consumed in each worker node request interval is certain, and the calculation logic for this part of the training samples is fixed, so the overall calculation time is also very stable , only a small range of time fluctuations may occur. Therefore, the amount of pre-stored sample data in the distributed cache system increases to an appropriate amount, and no adjustment is required within a certain period of time.
  • the time it takes to execute the training task using the previously pre-stored preset number of target sample data is greater than or greater than the abnormal time, it is detected that the last time the system time is increased by the preset number to the current system time to execute the traversal meta-information sequence, and determine The number of steps to output a preset number of target meta information.
  • increase the preset number that is, increase the number of pre-stored sample data in the distributed cache system.
  • the number is less than or equal to the preset number
  • there is no need to increase the preset number that is, there is no need to increase the number of pre-stored sample data in the distributed cache system.
  • t i represents the time interval between receiving the request initiated by the working node this time and the last time it received the request initiated by the working node, that is, the time it takes to execute the training task using the target sample data pre-stored last time;
  • T i min Indicates the minimum value of the request interval in the past time period, that is, the shortest time it takes to execute the training task using the pre-stored target sample data in the history;
  • is the outlier parameter;
  • i indicates the system time to detect the last increase in the preset number
  • the current system time executes the steps of traversing the meta-information sequence to determine a preset number of target meta-information;
  • P is a stable value parameter, which is the preset number of times.
  • the number of pre-stored sample data in the distributed cache system can be adjusted to avoid too little pre-stored sample data , so that part of the sample data is still not in the distributed cache system when it is used, the cache miss leads to a decrease in the access efficiency of the sample data, and the excessive amount of pre-stored sample data occupies too much storage resources in the distributed cache system.
  • FIG. 4 shows a schematic diagram of a modified embodiment of a machine learning programming framework of the present application
  • the technical solution in this application requires modifying the underlying data access logic of the machine learning programming framework (such as PyTorch, Tensorflow), and users often use the standard version of the machine learning programming framework, which does not contain the code logic to implement the technical solution of this application.
  • the Service Auto Injector component (hereinafter referred to as the Injector component) will create a Dataset Indexing Service component (hereinafter referred to as the Service component) in the cluster.
  • the Service component After the Service component is started, the sample data will be pre-stored. Next, the Injector component injects InitContainer (the container used for initialization) into the machine learning job submitted by the user.
  • the InitContainer uses the image defined in this solution, and the image contains code logic changes to the machine learning programming framework. InitContainer will be started prior to the machine learning job submitted by the user. When the InitContainer is started, the code logic that needs to be changed will be overwritten to the corresponding position of the user image, so as to realize the logic replacement that the user does not perceive. When a user-defined machine learning job starts, its data access process will follow the workflow below.
  • modifying the data access logic of the machine learning framework can also enable machine learning to perform cache management on the distributed cache system while accessing data.
  • FIG. 5 it shows a schematic framework diagram of a Kubernetes cluster embodiment of the present application.
  • the Service component obtains all training sample data (sample data) in the entire data set from the distributed cache system. meta information.
  • the order of the sample data consumed during the machine learning training process should be completely random.
  • the Service component when it obtains the meta information of the training sample, it will first scramble it to generate a random The total sequence of meta information, and split the total sequence of meta information into multiple sequences of meta information to correspond to the training tasks of multiple machine learning training workers;
  • the machine learning training Worker When any machine learning training Worker needs to consume data to perform training tasks, the machine learning training Worker requests training sample data from the Service component, and the Service component traverses the metadata sequence corresponding to the machine learning training Worker, and selects unused ones The meta information corresponding to the training sample data is returned.
  • Machine learning training Worker reads the corresponding sample data from the distributed cache system according to the returned meta-information to execute the training task.
  • FIG. 6 it shows one of the frame diagrams of a sample data processing embodiment of the present application.
  • Sample data 1, 3, and 5 have been pre-stored in the distributed cache system.
  • the Service component returns meta information 3, 5 to After machine learning trains Worker, machine learning trains Worker to read sample data from the distributed cache system3,5.
  • the sliding window (dotted line window) moves to the right, and the new meta information 6, 4 moves into the sliding window, and the Service component immediately performs 6, 4 Pre-storage operation of sample data.
  • the sliding window moves to the right, the meta information 3 and 5 of the sliding window are moved out.
  • the Service component Since the sample data corresponding to the meta information 3 and 5 has just been requested and is being consumed by machine learning training workers for model training (executing training tasks), this At this time, the Service component will not immediately expel the 3, 5 sample data from the distributed cache system. When the machine learning training Worker requests sample data from the Service component again, it indicates that the previously requested sample data 3 and 5 have been consumed. At this time, the Service component can expel the sample data 3 and 5.
  • the Service component instructs the distributed cache system to expel these sample data to save the resource usage of the cache system.
  • new meta-information is moved into the sliding window, and the Service component instructs the distributed cache system to pre-store the sample data corresponding to the meta-information from the remote storage system to the distributed cache system, so that the machine learning training Worker needs this part of the sample data It can hit the cache when instantiating data to avoid performance bottlenecks caused by slow data access.
  • the above process of expulsion and pre-storage is carried out synchronously with the model training process of the machine learning training Worker, and the whole process runs in a pipelined manner, shortening the overall machine learning training time.
  • this embodiment also provides a sample data processing device, which is applied to electronic devices such as terminal devices and servers.
  • FIG. 8 shows a structural block diagram of an embodiment of a sample data device of the present application, which may specifically include the following modules:
  • a sequence acquisition module 801 configured to acquire a training task and a meta information sequence corresponding to the training task, the meta information sequence includes several meta information, and the meta information is used to index to corresponding sample data;
  • a sequence traversal module 802 configured to traverse the meta information sequence and determine a preset number of target meta information
  • the task execution module 803 is configured to pre-store the target sample data corresponding to the target meta-information, and simultaneously execute the training task using the target sample data pre-stored last time;
  • the data eviction module 804 is configured to return to the step of traversing the meta information sequence to determine a preset number of target meta information when the previously pre-stored target sample data is used up, and simultaneously evict the The target sample data pre-stored last time.
  • a job acquisition module configured to acquire a machine learning job, and meta-information corresponding to sample data in a sample data set that needs to be used when the machine learning job performs learning and training;
  • a sequence generation module configured to randomly generate a total sequence of meta information by using the meta information corresponding to the sample data in the sample data set;
  • a job splitting module configured to split the total sequence of meta information into multiple sequences of meta information, and split the machine learning job into multiple training tasks based on the multiple sequences of meta information; wherein, multiple The training tasks are executed in parallel.
  • a job acquisition module is also used to acquire other machine learning jobs using the sample data set
  • a job splitting module configured to split the other machine learning jobs into multiple other training tasks based on the multiple meta-information sequences; wherein, the other training tasks are the same as the training tasks corresponding to the same meta-information sequence Execute synchronously.
  • a time determination module configured to determine the time it takes to execute the training task using the previously stored target sample data
  • the time determination module is also used to determine the shortest time it takes to execute the training task using pre-stored target sample data in history;
  • a time calculation module configured to calculate the abnormal time according to the shortest time and the abnormal value parameters
  • a quantity increasing module configured to increase the preset quantity when the time is greater than the abnormal time.
  • the quantity increasing module includes:
  • the number of determination sub-module is used to detect that the system time of the preset number is increased to the current system time when the time is greater than the abnormal time, execute the traversal of the meta information sequence, and determine the preset number of The number of steps for the target meta information;
  • the number increasing sub-module is used to increase the preset number when the number is greater than the preset number.
  • the Kubernetes cluster is deployed with a working node and a distributed cache system, the working node is used to execute the training task, and the distributed cache system is used to prestore the target sample example data.
  • the embodiment of the present application also provides a non-volatile readable storage medium, and one or more modules (programs) are stored in the storage medium.
  • the device can execute Instructions for each method step in the embodiments of the present application.
  • the embodiments of the present application provide one or more machine-readable media, on which instructions are stored, and when executed by one or more processors, the electronic device executes the method described in one or more of the above embodiments.
  • the electronic devices include various types of devices such as terminal devices and servers (clusters).
  • Embodiments of the present disclosure can be implemented as devices using any appropriate hardware, firmware, software, or any combination thereof to perform desired configurations, and the devices may include electronic devices such as terminal devices and servers (clusters).
  • FIG. 9 schematically illustrates an exemplary apparatus 900 that may be used to implement various embodiments described in this application.
  • FIG. 9 illustrates an exemplary apparatus 900 having one or more processors 902, a control module (chipset) 904 coupled to at least one of the processor(s) 902 , a memory 906 coupled to the control module 904, a non-volatile memory (NVM)/storage device 908 coupled to the control module 904, one or more input/output devices 910 coupled to the control module 904, and Coupled to the network interface 912 of the control module 904 .
  • processors 902 a control module (chipset) 904 coupled to at least one of the processor(s) 902
  • a memory 906 coupled to the control module 904
  • NVM non-volatile memory
  • FIG. 9 illustrates an exemplary apparatus 900 having one or more processors 902 , a control module (chipset) 904 coupled to at least one of the processor(s) 902 , a memory 906 coupled to the control module 904, a non-volatile memory (NVM)/storage device 908 coupled to the control module 904, one
  • the processor 902 may include one or more single-core or multi-core processors, and the processor 902 may include any combination of general-purpose processors or special-purpose processors (such as graphics processors, application processors, baseband processors, etc.).
  • the apparatus 900 can serve as a terminal device, a server (cluster), and other devices described in the embodiments of this application.
  • apparatus 900 may include one or more computer-readable media (e.g., memory 906 or NVM/storage 908) having instructions 914 and in combination with the one or more computer-readable media configured to
  • the one or more processors 902 execute instructions 914 to implement modules to perform the actions described in this disclosure.
  • control module 904 may include any suitable interface controller to provide any suitable Interface.
  • the control module 904 may include a memory controller module to provide an interface to the memory 906 .
  • a memory controller module may be a hardware module, a software module and/or a firmware module.
  • Memory 906 may be used, for example, to load and store data and/or instructions 914 for apparatus 900 .
  • memory 906 may comprise any suitable volatile memory, such as suitable DRAM.
  • memory 906 may include Double Data Rate Type Quad Synchronous Dynamic Random Access Memory (DDR4 SDRAM).
  • DDR4 SDRAM Double Data Rate Type Quad Synchronous Dynamic Random Access Memory
  • control module 904 may include one or more input/output controllers to provide interfaces to NVM/storage device(s) 908 and input/output device(s) 910 .
  • NVM/storage 908 may be used to store data and/or instructions 914 .
  • NVM/storage 908 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more hard drives (HDD), one or more compact disc (CD) drives, and/or one or more digital versatile disc (DVD) drives).
  • suitable non-volatile memory e.g., flash memory
  • suitable non-volatile storage device(s) e.g., one or more hard drives (HDD), one or more compact disc (CD) drives, and/or one or more digital versatile disc (DVD) drives.
  • HDD hard drives
  • CD compact disc
  • DVD digital versatile disc
  • the NVM/storage device 908 may comprise a storage resource that is physically part of the device on which the apparatus 900 is installed, or it may not necessarily be part of the device to be accessible by the device. For example, NVM/storage 908 may be accessed over a network via input/output device(s) 910 .
  • Input/output device(s) 910 may provide an interface for apparatus 900 to communicate with any other suitable device, and input/output device(s) 910 may include communication components, audio components, sensor components, and the like.
  • Network interface 912 may provide an interface for device 900 to communicate over one or more networks, and device 900 may communicate with one or more wireless networks according to any of one or more wireless network standards and/or protocols.
  • the components perform wireless communication, for example, access to a wireless network based on communication standards, such as WiFi, 2G, 3G, 4G, 5G, etc., or a combination of them for wireless communication.
  • At least one of the processor(s) 902 may be packaged with the logic of one or more controllers of the control module 904 (eg, a memory controller module). For one embodiment, at least one of the processor(s) 902 may be packaged with the logic of one or more controllers of the control module 904 to form a system-in-package (SiP). For one embodiment, at least one of the processor(s) 902 may be integrated on the same die as the logic of the one or more controllers of the control module 904 . For one embodiment, at least one of the processor(s) 902 may be integrated on the same die with the logic of the one or more controllers of the control module 904 to form a system on chip (SoC).
  • SoC system on chip
  • the apparatus 900 may be, but not limited to, a terminal device such as a server, a desktop computing device, or a mobile computing device (eg, a laptop computing device, a handheld computing device, a tablet computer, a netbook, etc.).
  • device 900 may have more or fewer components and/or a different architecture.
  • device 900 includes one or more cameras, a keyboard, a liquid crystal display (LCD) screen (including a touchscreen display), a non-volatile memory port, multiple antennas, a graphics chip, an application-specific integrated circuit ( ASIC) and speakers.
  • LCD liquid crystal display
  • ASIC application-specific integrated circuit
  • the main control chip can be used as the processor or control module in the detection device, and the sensor data and location information can be stored in the memory or NVM/storage device, the sensor group can be used as the input/output device, and the communication interface can include a network interface.
  • the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
  • Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to the embodiments of the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable xxxx terminal equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable xxxx terminal equipment produce a machine An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable xxxx terminal device to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种样例数据的处理方法、装置、设备和存储介质,包括:获取训练任务以及训练任务对应的元信息序列,元信息序列包括若干元信息,元信息用于索引到对应的样例数据(101);遍历元信息序列,确定出预设数量的目标元信息(102);预存目标元信息对应的目标样例数据,同时使用前一次预存的目标样例数据执行训练任务(103);当前一次预存的目标样例数据被使用完时,返回执行遍历元信息序列,确定出预设数量的目标元信息的步骤,同时驱逐前一次预存的目标样例数据(104)。应用该方法,按照元信息序列先预将要被执行训练任务时使用的样例数据,驱逐已经被使用的样例数据,只要预存少量的样例数据就能满足执行训练任务的需要,以节约缓存系统的资源使用。

Description

样例数据的处理方法、装置、设备和存储介质
本申请要求2021年09月28日递交的申请号为202111144871.7、发明名称为“样例数据的处理方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种样例数据的处理方法和装置、一种电子设备和一种存储介质。
背景技术
近年来,随着异构计算设备自身的研究发展,越来越多拥有更强计算能力的异构计算设备出现,进一步地加速了机器学习训练过程。然而数据处理能力的加快对程序的数据访问速度提出了更高的要求,而云上所采用的计算和存储分离的架构进一步地限制了数据访问速度,数据访问速度因此逐渐成为机器学习训练程序的主要性能瓶颈。
为了解决上述问题,目前普遍采用计算侧分布式缓存的方案实现数据访问的加速:通过将存储系统中的数据在计算环境中缓存,计算环境中运行的机器学习训练作业能够以更低延迟和更高的带宽获取到所需的数据。由于机器学习往往进行多轮训练,数据缓存能够在多轮之间被复用,提升机器学习训练过程的效率。
然而,随着所使用的数据集规模越来越大,缓存系统面临着更大的挑战:一方面,为了获得最大的数据访问加速,缓存系统需要缓存下整个数据集中的数据,这意味着缓存系统需要占用计算环境中的大量存储资源。另一方面,受缓存系统的驱逐策略影响,如果缓存系统无法完全缓存整个数据集中的数据,那么机器学习训练作业的数据访问的效率相比于完全缓存的情况下将会有显著的降低。
发明内容
本申请实施例提供了一种样例数据的处理方法,以解决缓存系统占用计算环境中的大量存储资源,以及在缓存系统无法完全缓存整个数据集中的数据时,机器学习训练作业的数据访问效率底的问题。
相应地,本申请实施例还提供了一种样例数据的处理装置、一种电子设备以及一种存储介质,用以保证上述方法的实现及应用。
为了解决上述问题,本申请实施例公开了一种样例数据的处理方法,所述方法包括:
获取训练任务以及所述训练任务对应的元信息序列,所述元信息序列包括若干元信息,所述元信息用于索引到对应的样例数据;
遍历所述元信息序列,确定出预设数量的目标元信息;
预存所述目标元信息对应的目标样例数据,同时使用前一次预存的目标样例数据执行所述训练任务;
当所述前一次预存的目标样例数据被使用完时,返回执行所述遍历所述元信息序列,确定出预设数量的目标元信息的步骤,同时驱逐所述前一次预存的目标样例数据。
可选地,还包括:
获取机器学习作业,以及所述机器学习作业进行学习训练时需要使用到的样例数据集中样例数据对应的元信息;
采用所述样例数据集中样例数据对应的元信息随机生成元信息总序列;
将所述元信息总序列拆分成多个元信息序列,并基于所述多个元信息序列将所述机器学习作业拆分成多个训练任务;其中,多个训练任务并列执行。
可选地,还包括:
获取使用所述样例数据集的其他机器学习作业;
基于多个所述元信息序列将所述其他机器学习作业拆分成多个其他训练任务;其中,所述其他训练任务与对应同一个所述元信息序列的训练任务同步执行。
可选地,还包括:
确定使用所述前一次预存的目标样例数据执行所述训练任务所花费的时间;
确定历史中使用预存的目标样例数据,执行所述训练任务所花费的最短时间;
根据所述最短时间和异常值参数,计算得到异常时间;
当所述时间大于所述异常时间时,增加所述预设数量。
可选地,所述当所述时间大于所述异常时间时,增加所述预设数量,包括:
当所述时间大于所述异常时间时,检测上一次增加所述预设数量的系统时间至当前系统时间执行所述遍历所述元信息序列,确定出预设数量的目标元信息的步骤的次数;
当所述次数大于预设次数时,增加所述预设数量。
可选地,应用于Kubernetes集群,所述Kubernetes集群部署有工作节点和分布式缓存系统,所述工作节点用于执行所述训练任务,所述分布式缓存系统用于预存目标样例数据。
本申请实施例还公开了一种样例数据的处理装置,所述装置包括:
序列获取模块,用于获取训练任务以及所述训练任务对应的元信息序列,所述元信息序列包括若干元信息,所述元信息用于索引到对应的样例数据;
序列遍历模块,用于遍历所述元信息序列,确定出预设数量的目标元信息;
任务执行模块,用于预存所述目标元信息对应的目标样例数据,同时使用前一次预存的目标样例数据执行所述训练任务;
数据驱逐模块,用于当所述前一次预存的目标样例数据被使用完时,返回执行所述遍历所述元信息序列,确定出预设数量的目标元信息的步骤,同时驱逐所述前一次预存 的目标样例数据。
可选地,还包括:
作业获取模块,用于获取机器学习作业,以及所述机器学习作业进行学习训练时需要使用到的样例数据集中样例数据对应的元信息;
序列生成模块,用于采用所述样例数据集中样例数据对应的元信息随机生成元信息总序列;
作业拆分模块,用于将所述元信息总序列拆分成多个元信息序列,并基于所述多个元信息序列将所述机器学习作业拆分成多个训练任务;其中,多个训练任务并列执行。
可选地,还包括:
作业获取模块,还用于获取使用所述样例数据集的其他机器学习作业;
作业拆分模块,用于基于多个所述元信息序列将所述其他机器学习作业拆分成多个其他训练任务;其中,所述其他训练任务与对应同一个所述元信息序列的训练任务同步执行。
可选地,还包括:
时间确定模块,用于确定使用所述前一次预存的目标样例数据执行所述训练任务所花费的时间;
时间确定模块,还用于确定历史中使用预存的目标样例数据,执行所述训练任务所花费的最短时间;
时间计算模块,用于根据所述最短时间和异常值参数,计算得到异常时间;
数量增加模块,用于当所述时间大于所述异常时间时,增加所述预设数量。
可选地,所述数量增加模块,包括:
次数确定子模块,用于当所述时间大于所述异常时间时,检测上一次增加所述预设数量的系统时间至当前系统时间执行所述遍历所述元信息序列,确定出预设数量的目标元信息的步骤的次数;
数量增加子模块,用于当所述次数大于预设次数时,增加所述预设数量。
可选地,应用于Kubernetes集群,所述Kubernetes集群部署有工作节点和分布式缓存系统,所述工作节点用于执行所述训练任务,所述分布式缓存系统用于预存目标样例数据。
可选地,所述Kubernetes集群部署有机器学习框架,所述机器学习框架的数据读取模块被Dataset Indexing Service组件替换,以通过所述Dataset Indexing Service组件维护所述训练任务对应的元信息序列。
本申请实施例还公开了一种电子设备,包括:处理器;和存储器,其上存储有可执行代码,当所述可执行代码被执行时,使得所述处理器执行如本申请实施例中一个或多个所述的样例数据的处理方法。
本申请实施例还公开了一个或多个机器可读介质,其上存储有可执行代码,当所述可执行代码被执行时,使得处理器执行如本申请实施例中一个或多个所述的样例数据的处理方法。
与现有技术相比,本申请实施例包括以下优点:
在本申请实施例中,获取训练任务以及训练任务对应的元信息序列,元信息序列包括若干元信息,元信息用于索引到对应的样例数据;遍历元信息序列,确定出预设数量的目标元信息;预存目标元信息对应的目标样例数据,同时使用前一次预存的目标样例数据执行训练任务,当前一次预存的目标样例数据被使用完时,返回执行遍历元信息序列,确定出预设数量的目标元信息的步骤,同时驱逐前一次预存的目标样例数据。应用本申请实施例,按照元信息序列中元信息的排序,先预将要被执行训练任务时使用的样例数据,驱逐已经被使用的样例数据,只要预存少量的样例数据就能满足执行训练任务需要,以节约缓存系统的资源使用。同时按照元信息序列先预将要被执行训练任务时使用的样例数据,避免了在执行训练任务时无法命中样例数据的情况发生,解决数据访问慢造成的性能瓶颈。另外,驱逐和预存的过程与执行训练任务同步进行,整个过程以流水线方式运行,缩短执行训练任务的时间。
此外,获取到训练任务所需要的样例数据对应的元信息后,采用该元信息随机生成元信息序列,元信息序列中元信息的随机排序,保证了在按照元信息序列执行训练任务的过程中样例数据的使用的随机性,以防止执行训练任务得到的学习模型过拟合。
附图说明
图1是本申请的一种样例数据的处理方法实施例的步骤流程图;
图2是本申请的另一种样例数据的处理方法实施例的步骤流程图;
图3是本申请的一种预存样例数据的数量的调整实施例的步骤流程图;
图4是本申请的一种机器学习编程框架修改实施例的示意图;
图5是本申请的一种Kubernetes集群实施例的框架示意图;
图6是本申请的一种样例数据处理实施例的框架示意图之一;
图7是本申请的一种样例数据处理实施例的框架示意图之二;
图8是本申请的一种样例数据的装置实施例的结构框图;
图9是本申请一实施例提供的装置的结构示意图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
过去几年得益于数据存储和数据收集技术的进步,像机器学习这样从数据中学习专 家知识的方法成为了解决计算机认知问题的常用方法。其中,机器学习在各个领域均展现了很强的解决实际问题的能力。机器学习方法主要包含训练和推断两个过程。其中,机器学习训练过程需要从大规模的数据集中学习到数据与数据之间的正确相关性,获取“经验”,而机器学习推断过程根据“经验”对新出现的数据进行判断。
实际执行机器学习训练时往往使用异构计算设备如GPU以实现数据处理的并行加速。近年来,随着异构计算设备自身的研究发展,越来越多拥有更强计算能力的异构计算设备出现,进一步地加速了机器学习训练过程。数据处理能力的加快对程序的数据访问速度提出了更高的要求,而云上所采用的计算和存储分离的架构进一步地限制了数据访问速度,数据访问速度因此逐渐成为机器学习训练程序的主要性能瓶颈。而现有技术为解决上述问题,提出了以下方案:
利用目前业界流行的Alluxio和其他缓存技术,能够在计算侧实现分布式缓存,分布式缓存得以将各个节点的存储资源整合起来,提供一个更大的缓存池,缓存下机器学习训练所需的整个数据集。然而却需要占用大量存储资源。随着数据集规模越来越大,所需的存储资源也越来越多,尤其在多租户场景下,多个机器学习训练作业同时需要不同的多个数据集,这将使得分布式缓存系统面临更大的负担。
对机器学习编程框架如(PyTorch和Tensorflow)内部对数据访问过程进行优化。例如当数据访问瓶颈发生时,Data Echoing(数据回放)通过重放之前已经使用的数据来提升计算设备的资源利用率,然而这样的优化实质上修改了机器学习训练的语义,对机器学习方法的有效性可能造成潜在影响。
PyTorch和Tensorflow框架提供的数据预存能力同样有助于减轻数据访问瓶颈造成的性能影响,然而预存过程依赖于用户编写的机器学习训练程序。在程序运行前,用户无法提前预知是否会出现数据访问瓶颈,因此无法对框架提供的预存功能进行合理的配置。
为了解决上述问题,本申请的核心思想是在集群部署机器学习应用时,在应用容器的框架层通过注入的方式替换已有数据读取组件,通过统一的服务管理样例数据的预存和驱逐,在较低的数据空间前提下达到提升应用运行速度的目的。
1.引入Dataset Indexing Service组件,并且自动替换TensorFlow,PyTorch等机器学习框架的数据读取模块,控制机器学习训练作业的样例数据访问顺序。
2.Dataset Indexing Service组件根据不同应用数据访问顺序特性进行缓存系统中样例数据的预存和驱逐管理,高效利用缓存。
3.根据机器学习训练速度动态控制样例数据预存的行为,保障机器学习训练过程尽可能不受数据访问瓶颈影响。
本申请提供了一种样例数据的处理方法和装置,在下面的实施例中进行详细说明。首先,对本申请一个或多个实施例涉及的名词术语进行解释。
Kubernetes:是用于自动部署、扩展和管理容器化(containerized)应用程序的开源系统。
机器学习:本质上是利用数据解决计算机的认知问题。知识理解,信息加工甚至是预测。
远程存储系统:是指远离计算侧的用于存储训练样例数据集的云存储或者存储服务器。
分布式缓存:指在分布式环境或系统下,将远程储存的数据存储到离用户或应用近的机器,以减少远程数据传输的延迟,让用户和应用可以很快访问到想要的数据。
GPU:图形处理器,与CPU类似,只不过GPU是专为执行复杂的数学和几何计算而设计的,在人工智能应用非常广泛。
工作节点:具有如GPU或者用于进行学习训练的计算机节点。
元信息:也称为元数据,为描述数据的数据(data about data),主要是描述数据属性(property)的信息,用来支持如指示存储位置、历史数据、资源查找、文件记录等功能。算是一种电子式目录,为了达到编制目录的目的,必须在描述并收藏数据的内容或特色,进而达成协助数据检索的目的。
参照图1,是本申请的一种样例数据的处理方法实施例的步骤流程图,包括如下步骤:
步骤101,获取训练任务以及所述训练任务对应的元信息序列,所述元信息序列包括若干元信息,所述元信息用于索引到对应的样例数据。
其中,方法应用于Kubernetes集群,Kubernetes集群部署有如TensorFlow,PyTorch等机器学习框架,引入Dataset Indexing Service组件,并且自动替换机器学习框架的数据读取模块,同时Kubernetes集群部署有工作节点和分布式缓存系统。
具体地,在获取训练任务以及训练任务对应的元信息序列,将训练任务调度到工作节点中,并通过Dataset Indexing Service组件维护训练任务对应的元信息序列。元信息序列为Dataset Indexing Service组件获取到训练任务所需要的样例数据对应的元信息后随机生成,元信息序列中元信息的随机排序,保证了在按照元信息序列执行训练任务的过程中样例数据使用(消费)的随机性,以防止执行训练任务得到的学习模型过拟合。
步骤102,遍历所述元信息序列,确定出预设数量的目标元信息。
具体地,当工作节点需要消费(使用)样例数据执行训练任务时时,工作节点会向Dataset Indexing Service组件请求样例数据,Dataset Indexing Service组件依序按遍历元信息序列,确定出预设数量的目标元信息,并返回至工作节点,例如预设数量为2,元信息序列为1、3、6、5、4、5,那么确定出预设数量的目标元信息为1、3。
步骤103,预存所述目标元信息对应的目标样例数据,同时使用前一次预存的目标样例数据执行所述训练任务。
其中,远程存储系统储存有元信息序列中元信息对应的样例数据。
具体地,开始从远程存储系统拉取目标元信息对应的目标样例数据,预存在分布式缓存系统中,同时使用前一次预存的目标样例数据执行训练任务,例如,元信息序列为1、3、6、5、4、5,分布式缓存系统中已经预存有1、3样例数据,因此,,在工作节点向Dataset Indexing Service组件请求样例数据,Dataset Indexing Service组件依序按遍历元信息序列,确定的目标元信息为6、5,那么分布式缓存系统开始预存6、5样例数据,同时工作节点从分布式缓存系统获取到1、3样例数据执行训练任务。
步骤104,当所述前一次预存的目标样例数据被使用完时,返回执行所述遍历所述元信息序列,确定出预设数量的目标元信息的步骤,同时驱逐所述前一次预存的目标样例数据。
具体地,在前一次预存的目标样例数据被使用(消费)完时,工作节点会再次向Dataset Indexing Service组件请求样例数据,Dataset Indexing Service组件依序再次遍历元信息序列,确定出新的预设数量目标元信息,需要说明的是,已经预存过的样例数据对应的元信息不会在被确定为目标元信息,例如预设数量为2,元信息序列为1、3、6、5、4、5,那么前一次确定出预设数量的目标元信息为1、3,再次遍历元信息序列,从元信息6开始遍历,确定出预设数量的目标元信息为6、5。同时驱逐分布缓存系统中前一次预存的目标样例数据。
本申请实施例中,获取训练任务以及训练任务对应的元信息序列,元信息序列包括若干元信息,元信息用于索引到对应的样例数据;遍历元信息序列,确定出预设数量的目标元信息;预存目标元信息对应的目标样例数据,同时使用前一次预存的目标样例数据执行训练任务,当前一次预存的目标样例数据被使用完时,返回执行遍历元信息序列,确定出预设数量的目标元信息的步骤,同时驱逐前一次预存的目标样例数据。应用本申请实施例,按照元信息序列先预存执行训练任务将要被使用(消费)的样例数据,驱逐已经被使用的样例数据,只要预存少量的样例数据就能满足执行训练任务需要,不需要将执行训练任务所需要的样例数据全部进行预存,以节约缓存系统的资源使用。同时按照元信息序列,先预将要被执行训练任务时使用的样例数据,避免了在执行训练任务时无法命中样例数据的情况发生,解决数据访问慢造成的性能瓶颈。另外,驱逐和预存的过程与执行训练任务同步进行,整个过程以流水线方式运行,缩短执行训练任务的时间。
此外,元信息序列为Dataset Indexing Service组件获取到训练任务所需要的样例数据对应的元信息后随机生成,元信息序列中元信息的随机排序,保证了在按照元信息序列执行训练任务的过程中样例数据的使用(消费)的随机性,以防止执行训练任务得到的学习模型过拟合。
在上述实施例的基础上,提出了可选实施例,在此需要说明的是,为了使描述简要,在可选实施例中仅描述与上述实施例的不同之处。
在本申请一实施例中,参照图2,示出了本申请的另一种样例数据的处理方法实施例的步骤流程图,包括如下步骤:
步骤201,获取机器学习作业,以及所述机器学习作业进行学习训练时需要使用到的样例数据集中样例数据对应的元信息。
具体地,获取机器学习作业,以及机器学习作业进行学习训练时需要使用到的样例数据集中样例数据对应的元信息,其中,样例数据集中样例数据对应的元信息储存在分布式缓存系统中,通过Dataset Indexing Service组件从分布式缓存系统中获取样例数据对应的元信息。
步骤202,采用所述样例数据集中样例数据对应的元信息随机生成元信息总序列。
其中,基于对大规模数据集进行机器学习训练的程序的性能分析,可以发现机器学习训练随机的数据访问顺序是造成数据访问效率显著降低的根本原因。机器学习训练过程中的数据访问顺序是完全随机的,如果缓存系统无法缓存全部数据,缓存驱逐策略如LRU将会把时间上最近未被使用的数据驱逐,造成机器学习训练在接下来的数据访问中缓存不命中。
具体地,在Dataset Indexing Service组件从分布式缓存系统中获取样例数据对应的元信息后,采用该元信息随机生成元信息总序列,机器学习作业可以按照元信息总序列中元信息的排序访问(使用)样例数据,以进行学习训练。
本申请实施例中,提前将取样例数据对应的元信息随机生成元信息总序列,在机器学习作业进行学习训练时,可以按照元信息总序列中元信息的排序访问样例数据,保证了进行学习训练的过程中样例数据使用的随机性,以防止得到的学习模型过拟合。同时解决了学习训练在接下来的样例数据访问中缓存不命中的问题。
步骤203,将所述元信息总序列拆分成多个元信息序列,并基于所述多个元信息序列将所述机器学习作业拆分成多个训练任务;其中,多个训练任务并列执行。
具体地,通过Dataset Indexing Service组件将元信息总序列拆分成多个元信息序列,并基于多个元信息序列将机器学习作业拆分成多个训练任务,将多个训练任务分别调度到不同的工作节点并列执行。工作节点执行训练任务的步骤以在上文进行描述,此处不再重复描述。
本申请实施例中,基于多个元信息序列将机器学习作业拆分成多个训练任务,并将多个训练任务分别调度到不同的工作节点并列执行,可以缩短机器学习作业进行学习训练所花费的时间。
在本申请一实施例中,还包括:获取使用所述样例数据集的其他机器学习作业;基于多个所述元信息序列将所述其他机器学习作业拆分成多个其他训练任务;其中,所述其他训练任务与对应同一个所述元信息序列的训练任务同步执行。
具体地,获取使用样例数据集(机器学习作业的)的其他机器学习作业,因为机器 学习作业和其他机器学习作业进行学习训练时所使用的样例数据集相同,可以通过中心化方式协调机器学习作业和其他机器学习作业共享同一个元信息总序列,因此,可以基于多个元信息序列将其他机器学习作业拆分成多个其他训练任务,将多个其他训练任务调度到不同的工作节点执行,对应同一个元信息序列的训练任务和其他训练任务可以同步执行,在执行的过程中,训练任务和其他训练任务使用同一份样例数据,使分布式缓存系统中样例数据被多个训练任务重复利用,提升缓存利用率。
本申请实施例中,基于多个元信息序列将机器学习作业拆分成多个训练任务,并将多个训练任务分别调度到不同的工作节点并列执行,可以缩短机器学习作业进行学习训练所花费的时间。
另外,基于多个元信息序列,将机器学习作业拆分成多个训练任务,将其他机器学习作业拆分成多个其他训练任务,对应同一个元信息序列的训练任务和其他训练任务可以同步执行,在执行的过程中,训练任务和其他训练任务使用同一份样例数据,使分布式缓存系统中样例数据被多个训练任务重复利用,提升缓存利用率。
在上述实施例的基础上,提出了可选实施例,在此需要说明的是,为了使描述简要,在可选实施例中仅描述与上述实施例的不同之处。
在本申请一实施例中,参照图3,示出了本申请的一种预存样例数据的数量的调整实施例的步骤流程图,包括如下步骤:
步骤301:确定使用所述前一次预存的目标样例数据执行所述训练任务所花费的时间。
步骤302:确定历史中使用预存的目标样例数据,执行所述训练任务所花费的最短时间。
步骤303:根据所述最短时间和异常值参数,计算得到异常时间。
步骤304:当所述时间大于所述异常时间时,增加所述预设数量。
其中,分布式缓存系统中预存样例数据的数量由执行训练任务时使用样例数据的速度和从远程存储系统中拉取样例数据的速度共同相关,预存样例数据的数量过少使得部分样例数据被使用时仍然不在分布式缓存系统中,缓存不命中导致样例数据访问效率下降,而预存样例数据的数量过多占据分布式缓存系统中过多存储资源。
具体地,确定出使用前一次预存的预设数量目标样例数据执行训练任务所花费的时间,以及历史中使用预存的目标样例数据,执行训练任务所花费的最短时间,将该最短时间和异常值参数相乘,得到异常时间;当使用前一次预存的预设数量目标样例数据执行训练任务所花费的时间大于异常时间时,说明部分样例数据被使用时仍然不在分布式缓存系统中,需要增加预设数量,即增加分布式缓存系统中预存样例数据的数量,以满足执行训练任务时的需求。
本申请一实施例中,所述当所述时间大于所述异常时间时,增加所述预设数量,包 括:当所述时间大于所述异常时间时,检测上一次增加所述预设数量的系统时间至当前系统时间执行所述遍历所述元信息序列,确定出预设数量的目标元信息的步骤的次数;当所述次数大于预设次数时,增加所述预设数量。
其中,对于一个机器学习作业来说,每次工作节点请求间隔中消费的样例数据数量是一定的,且对这部分训练样例的运算逻辑是固定的,因此整体的运算时间也很稳定的,仅仅可能出现较小范围时间波动。因此在分布式缓存系统中预存样例数据的数量增加到合适的数量,在一定时间内不需要进行调整。
具体地,当使用前一次预存的预设数量目标样例数据执行训练任务所花费的时间大于大于异常时间时,检测上一次增加预设数量的系统时间至当前系统时间执行遍历元信息序列,确定出预设数量的目标元信息的步骤的次数,当该次数大于预设次数时,增加预设数量,即增加分布式缓存系统中预存样例数据的数量,当该次数小于或等于预设次数时,则不需要增加预设数量,即不需要增加分布式缓存系统中预存样例数据的数量。
另外需要说明的是机器学习作业消费样例数据的速度比较稳定,因此在分布式缓存系统中预存样例数据的数量一旦扩容到合适的大小,通常情况下不再需要缩小。
具体通过以下公式判断是否需要增加预设数量:
Figure PCTCN2022118411-appb-000001
Figure PCTCN2022118411-appb-000002
式中:t i表示本次收到工作节点发起请求与上次收到工作节点发起请求之间的时间间隔,即使用前一次预存的目标样例数据执行训练任务所花费的时间;T i min表示过去时间段请求间隔时间的最小值,即历史中使用预存的目标样例数据,执行训练任务所花费的最短时间;α为异常值参数;i表示检测上一次增加预设数量的系统时间至当前系统时间执行遍历元信息序列,确定出预设数量的目标元信息的步骤的次数;P为稳定值参数,即为预设次数。
当t i大于α*T i min,且i大于P时,增加预设数量,即增加分布式缓存系统中预存样例数据的数量。
本申请实施例中,通过检测使用预设数量目标样例数据执行训练任务所花费的时间,可以对分布式缓存系统中预存样例数据的数量进行调整,以避免预存样例数据的数量过少,使得部分样例数据被使用时仍然不在分布式缓存系统中,缓存不命中导致样例数据访问效率下降,而预存样例数据的数量过多占据分布式缓存系统中过多存储资源。
为了更好地理解本申请中的实施例,以下对样例数据的处理方法加以示例性说明,但应当理解的是,本申请实施例并不限于此。
参照图4,示出了本申请的一种机器学习编程框架修改实施例的示意图,
本申请中的技术方案的要求修改机器学习编程框架(如PyTorch,Tensorflow)的底层数据访问逻辑,而用户往往使用的是标准版的机器学习编程框架,不包含实现本申请技术方案的代码逻辑。当用户提交作业时,Service Auto Injector组件(以下简称Injector组件)会在集群中创建Dataset Indexing Service组件(以下简称Service组件),Service组件启动后开始进行样例数据的预存。接着,Injector组件在用户提交的机器学习作业中注入InitContainer(用来做初始化工作的容器),该InitContainer使用本方案所定义的镜像,镜像中包含对机器学习编程框架的代码逻辑变动。InitContainer会优先于用户提交的机器学习作业启动,当InitContainer启动时,会将需要变动的代码逻辑覆盖到用户镜像的对应位置,实现用户无感知的逻辑替换。当用户定义的机器学习作业启动时,其数据访问过程将按照下文中的工作流程进行。
需要说明的是,在不引入Dataset Indexing Service组件,采用修改机器学习框架的数据访问逻辑,同样可以使得机器学习在访问数据的同时对分布式缓存系统进行缓存管理。
参照图5,示出了本申请的一种Kubernetes集群实施例的框架示意图,在机器学习训练作业开始前,Service组件从分布式缓存系统中获取整个数据集中全部训练样例数据(样例数据)的元信息。机器学习训练过程消费样例的数据的顺序应当是完全随机的,为了防止得到机器学习模型过拟合,因此Service组件在获得训练样例的元信息时,首先会将其打乱生成一个随机的元信息总序列,并将元信息总序列拆分成多个元信息序列以对应多个机器学习训练Worker的训练任务;
当任意机器学习训练Worker需要消费数据以执行训练任务时,该机器学习训练Worker向Service组件请求训练样例数据,Service组件遍历该机器学习训练Worker对应的元信息序列,选择其中未被使用过的训练样例数据对应的元信息返回。
机器学习训练Worker根据返回的元信息从分布式缓存系统中读取对应的样例数据执行训练任务。
参照图6,示出了本申请的一种样例数据处理实施例的框架示意图之一,分布缓存系统中已经预存有样例数据1,3,5,当Service组件返回元信息3,5至机器学习训练Worker后,机器学习训练Worker从分布缓存系统中读取样例数据3,5。参照图7,此时,机器学习训练Worker向Service组件请求训练样例时,滑动窗口(虚线窗口)右移,新的元信息6、4移动进入滑动窗口,此时Service组件立刻进行6、4样例数据的预存操作。滑动窗口右移的同时,移出滑动窗口的元信息3,5,由于元信息3,5对应的样例数据刚刚被请求,正在被机器学习训练Worker消费以进行模型训练(执行训练任务),此时Service组件不会立即将3,5样例数据驱逐出分布式缓存系统。当机器学习训练Worker再次向Service组件请求样例数据时,标志着之前请求的样例数据3,5已经消费完成,此时Service组件可将样例数据3,5驱逐。
本申请实施例中,在被消费过的样例数据不会再被使用,因此Service组件指导分布 式缓存系统将这些样例数据驱逐,以节约缓存系统的资源使用。同时,滑动窗口中移入了新的元信息,Service组件指导分布式缓存系统将该元信息对应的样例数据从远程存储系统中预存到分布式缓存系统中,使得机器学习训练Worker需要这部分样例数据时能够命中缓存,以避免数据访问慢造成的性能瓶颈。上述驱逐和预存的过程与机器学习训练Worker的模型训练过程同步进行,整个过程以流水线方式运行,缩短整体机器学习训练时间。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
在上述实施例的基础上,本实施例还提供了一种样例数据的处理装置,应用于终端设备、服务器等电子设备中。
参照图8,示出了本申请的一种样例数据的装置实施例的结构框图,具体可以包括如下模块:
序列获取模块801,用于获取训练任务以及所述训练任务对应的元信息序列,所述元信息序列包括若干元信息,所述元信息用于索引到对应的样例数据;
序列遍历模块802,用于遍历所述元信息序列,确定出预设数量的目标元信息;
任务执行模块803,用于预存所述目标元信息对应的目标样例数据,同时使用前一次预存的目标样例数据执行所述训练任务;
数据驱逐模块804,用于当所述前一次预存的目标样例数据被使用完时,返回执行所述遍历所述元信息序列,确定出预设数量的目标元信息的步骤,同时驱逐所述前一次预存的目标样例数据。
在本申请一实施例中,还包括:
作业获取模块,用于获取机器学习作业,以及所述机器学习作业进行学习训练时需要使用到的样例数据集中样例数据对应的元信息;
序列生成模块,用于采用所述样例数据集中样例数据对应的元信息随机生成元信息总序列;
作业拆分模块,用于将所述元信息总序列拆分成多个元信息序列,并基于所述多个元信息序列将所述机器学习作业拆分成多个训练任务;其中,多个训练任务并列执行。
在本申请一实施例中,还包括:
作业获取模块,还用于获取使用所述样例数据集的其他机器学习作业;
作业拆分模块,用于基于多个所述元信息序列将所述其他机器学习作业拆分成多个其他训练任务;其中,所述其他训练任务与对应同一个所述元信息序列的训练任务同步 执行。
在本申请一实施例中,还包括:
时间确定模块,用于确定使用所述前一次预存的目标样例数据执行所述训练任务所花费的时间;
时间确定模块,还用于确定历史中使用预存的目标样例数据,执行所述训练任务所花费的最短时间;
时间计算模块,用于根据所述最短时间和异常值参数,计算得到异常时间;
数量增加模块,用于当所述时间大于所述异常时间时,增加所述预设数量。
在本申请一实施例中,所述数量增加模块,包括:
次数确定子模块,用于当所述时间大于所述异常时间时,检测上一次增加所述预设数量的系统时间至当前系统时间执行所述遍历所述元信息序列,确定出预设数量的目标元信息的步骤的次数;
数量增加子模块,用于当所述次数大于预设次数时,增加所述预设数量。
在本申请一实施例中,应用于Kubernetes集群,所述Kubernetes集群部署有工作节点和分布式缓存系统,所述工作节点用于执行所述训练任务,所述分布式缓存系统用于预存目标样例数据。
本申请实施例还提供了一种非易失性可读存储介质,该存储介质中存储有一个或多个模块(programs),该一个或多个模块被应用在设备时,可以使得该设备执行本申请实施例中各方法步骤的指令(instructions)。
本申请实施例提供了一个或多个机器可读介质,其上存储有指令,当由一个或多个处理器执行时,使得电子设备执行如上述实施例中一个或多个所述的方法。本申请实施例中,所述电子设备包括终端设备、服务器(集群)等各类型的设备。
本公开的实施例可被实现为使用任意适当的硬件,固件,软件,或及其任意组合进行想要的配置的装置,该装置可包括终端设备、服务器(集群)等电子设备。图9示意性地示出了可被用于实现本申请中所述的各个实施例的示例性装置900。
对于一个实施例,图9示出了示例性装置900,该装置具有一个或多个处理器902、被耦合到(一个或多个)处理器902中的至少一个的控制模块(芯片组)904、被耦合到控制模块904的存储器906、被耦合到控制模块904的非易失性存储器(NVM)/存储设备908、被耦合到控制模块904的一个或多个输入/输出设备910,以及被耦合到控制模块904的网络接口912。
处理器902可包括一个或多个单核或多核处理器,处理器902可包括通用处理器或专用处理器(例如图形处理器、应用处理器、基频处理器等)的任意组合。在一些实施例中,装置900能够作为本申请实施例中所述终端设备、服务器(集群)等设备。
在一些实施例中,装置900可包括具有指令914的一个或多个计算机可读介质(例 如,存储器906或NVM/存储设备908)以及与该一个或多个计算机可读介质相合并被配置为执行指令914以实现模块从而执行本公开中所述的动作的一个或多个处理器902。
对于一个实施例,控制模块904可包括任意适当的接口控制器,以向(一个或多个)处理器902中的至少一个和/或与控制模块904通信的任意适当的设备或组件提供任意适当的接口。
控制模块904可包括存储器控制器模块,以向存储器906提供接口。存储器控制器模块可以是硬件模块、软件模块和/或固件模块。
存储器906可被用于例如为装置900加载和存储数据和/或指令914。对于一个实施例,存储器906可包括任意适当的易失性存储器,例如,适当的DRAM。在一些实施例中,存储器906可包括双倍数据速率类型四同步动态随机存取存储器(DDR4SDRAM)。
对于一个实施例,控制模块904可包括一个或多个输入/输出控制器,以向NVM/存储设备908及(一个或多个)输入/输出设备910提供接口。
例如,NVM/存储设备908可被用于存储数据和/或指令914。NVM/存储设备908可包括任意适当的非易失性存储器(例如,闪存)和/或可包括任意适当的(一个或多个)非易失性存储设备(例如,一个或多个硬盘驱动器(HDD)、一个或多个光盘(CD)驱动器和/或一个或多个数字通用光盘(DVD)驱动器)。
NVM/存储设备908可包括在物理上作为装置900被安装在其上的设备的一部分的存储资源,或者其可被该设备访问可不必作为该设备的一部分。例如,NVM/存储设备908可通过网络经由(一个或多个)输入/输出设备910进行访问。
(一个或多个)输入/输出设备910可为装置900提供接口以与任意其他适当的设备通信,输入/输出设备910可以包括通信组件、音频组件、传感器组件等。网络接口912可为装置900提供接口以通过一个或多个网络通信,装置900可根据一个或多个无线网络标准和/或协议中的任意标准和/或协议来与无线网络的一个或多个组件进行无线通信,例如接入基于通信标准的无线网络,如WiFi、2G、3G、4G、5G等,或它们的组合进行无线通信。
对于一个实施例,(一个或多个)处理器902中的至少一个可与控制模块904的一个或多个控制器(例如,存储器控制器模块)的逻辑封装在一起。对于一个实施例,(一个或多个)处理器902中的至少一个可与控制模块904的一个或多个控制器的逻辑封装在一起以形成系统级封装(SiP)。对于一个实施例,(一个或多个)处理器902中的至少一个可与控制模块904的一个或多个控制器的逻辑集成在同一模具上。对于一个实施例,(一个或多个)处理器902中的至少一个可与控制模块904的一个或多个控制器的逻辑集成在同一模具上以形成片上系统(SoC)。
在各个实施例中,装置900可以但不限于是:服务器、台式计算设备或移动计算设备(例如,膝上型计算设备、手持计算设备、平板电脑、上网本等)等终端设备。在各个 实施例中,装置900可具有更多或更少的组件和/或不同的架构。例如,在一些实施例中,装置900包括一个或多个摄像机、键盘、液晶显示器(LCD)屏幕(包括触屏显示器)、非易失性存储器端口、多个天线、图形芯片、专用集成电路(ASIC)和扬声器。
其中,检测装置中可采用主控芯片作为处理器或控制模块,传感器数据、位置信息等存储到存储器或NVM/存储设备中,传感器组可作为输入/输出设备,通信接口可包括网络接口。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程xxxx终端设备的处理器以产生一个机器,使得通过计算机或其他可编程xxxx终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程xxxx终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程xxxx终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一 个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种样例数据的处理方法和装置,一种电子设备和一种存储介质,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (11)

  1. 一种样例数据的处理方法,其特征在于,所述方法包括:
    获取训练任务以及所述训练任务对应的元信息序列,所述元信息序列包括若干元信息,所述元信息用于索引到对应的样例数据;
    遍历所述元信息序列,确定出预设数量的目标元信息;
    预存所述目标元信息对应的目标样例数据,同时使用前一次预存的目标样例数据执行所述训练任务;
    当所述前一次预存的目标样例数据被使用完时,返回执行所述遍历所述元信息序列,确定出预设数量的目标元信息的步骤,同时驱逐所述前一次预存的目标样例数据。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    获取机器学习作业,以及所述机器学习作业进行学习训练时需要使用到的样例数据集中样例数据对应的元信息;
    采用所述样例数据集中样例数据对应的元信息随机生成元信息总序列;
    将所述元信息总序列拆分成多个元信息序列,并基于所述多个元信息序列将所述机器学习作业拆分成多个训练任务;其中,多个训练任务并列执行。
  3. 根据权利要求2所述的方法,其特征在于,还包括:
    获取使用所述样例数据集的其他机器学习作业;
    基于多个所述元信息序列将所述其他机器学习作业拆分成多个其他训练任务;其中,所述其他训练任务与对应同一个所述元信息序列的训练任务同步执行。
  4. 根据权利要求1所述的方法,其特征在于,还包括:
    确定使用所述前一次预存的目标样例数据执行所述训练任务所花费的时间;
    确定历史中使用预存的目标样例数据,执行所述训练任务所花费的最短时间;
    根据所述最短时间和异常值参数,计算得到异常时间;
    当所述时间大于所述异常时间时,增加所述预设数量。
  5. 根据权利要求4所述的方法,其特征在于,所述当所述时间大于所述异常时间时,增加所述预设数量,包括:
    当所述时间大于所述异常时间时,检测上一次增加所述预设数量的系统时间至当前系统时间执行所述遍历所述元信息序列,确定出预设数量的目标元信息的步骤的次数;
    当所述次数大于预设次数时,增加所述预设数量。
  6. 根据权利要求1所述的方法,其特征在于,应用于Kubernetes集群,所述Kubernetes集群部署有工作节点和分布式缓存系统,所述工作节点用于执行所述训练任务,所述分布式缓存系统用于预存目标样例数据。
  7. 根据权利要求6所述的方法,其特征在于,所述Kubernetes集群部署有机器学习框架,所述机器学习框架的数据读取模块被Dataset Indexing Service组件替换,以通过所述Dataset Indexing Service组件维护所述训练任务对应的元信息序列。
  8. 一种样例数据的处理装置,其特征在于,所述装置包括:
    序列获取模块,用于获取训练任务以及所述训练任务对应的元信息序列,所述元信息序列包括若干元信息,所述元信息用于索引到对应的样例数据;
    序列遍历模块,用于遍历所述元信息序列,确定出预设数量的目标元信息;
    任务执行模块,用于预存所述目标元信息对应的目标样例数据,同时使用前一次预存的目标样例数据执行所述训练任务;
    数据驱逐模块,用于当所述前一次预存的目标样例数据被使用完时,返回执行所述遍历所述元信息序列,确定出预设数量的目标元信息的步骤,同时驱逐所述前一次预存的目标样例数据。
  9. 根据权利要求8所述的装置,其特征在于,还包括:
    作业获取模块,用于获取机器学习作业,以及所述机器学习作业进行学习训练时需要使用到的样例数据集中样例数据对应的元信息;
    序列生成模块,用于采用所述样例数据集中样例数据对应的元信息随机生成元信息总序列;
    作业拆分模块,用于将所述元信息总序列拆分成多个元信息序列,并基于所述多个元信息序列将所述机器学习作业拆分成多个训练任务;其中,多个训练任务并列执行。
  10. 一种电子设备,其特征在于,包括:处理器;和
    存储器,其上存储有可执行代码,当所述可执行代码被执行时,使得所述处理器执行如权利要求1-7中一个或多个所述的样例数据的处理方法。
  11. 一个或多个机器可读介质,其上存储有可执行代码,当所述可执行代码被执行时,使得处理器执行如权利要求1-7中一个或多个所述的样例数据的处理方法。
PCT/CN2022/118411 2021-09-28 2022-09-13 样例数据的处理方法、装置、设备和存储介质 WO2023051228A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111144871.7 2021-09-28
CN202111144871.7A CN113988306A (zh) 2021-09-28 2021-09-28 样例数据的处理方法、装置、设备和存储介质

Publications (1)

Publication Number Publication Date
WO2023051228A1 true WO2023051228A1 (zh) 2023-04-06

Family

ID=79737034

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/118411 WO2023051228A1 (zh) 2021-09-28 2022-09-13 样例数据的处理方法、装置、设备和存储介质

Country Status (2)

Country Link
CN (1) CN113988306A (zh)
WO (1) WO2023051228A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113988306A (zh) * 2021-09-28 2022-01-28 阿里巴巴(中国)有限公司 样例数据的处理方法、装置、设备和存储介质
CN114579269A (zh) * 2022-02-08 2022-06-03 阿里巴巴(中国)有限公司 任务调度方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018049563A1 (en) * 2016-09-13 2018-03-22 Huawei Technologies Co., Ltd. Systems and methods for caching
CN109286622A (zh) * 2018-09-26 2019-01-29 天津理工大学 一种基于学习规则集的网络入侵检测方法
CN111259384A (zh) * 2020-01-17 2020-06-09 中国科学院计算技术研究所 一种基于缓存随机无效的处理器瞬态攻击防御方法
CN111527479A (zh) * 2018-01-10 2020-08-11 Arm有限公司 推测性缓存存储区
CN113424144A (zh) * 2019-03-12 2021-09-21 英特尔公司 计算数据存储系统
CN113988306A (zh) * 2021-09-28 2022-01-28 阿里巴巴(中国)有限公司 样例数据的处理方法、装置、设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018049563A1 (en) * 2016-09-13 2018-03-22 Huawei Technologies Co., Ltd. Systems and methods for caching
CN111527479A (zh) * 2018-01-10 2020-08-11 Arm有限公司 推测性缓存存储区
CN109286622A (zh) * 2018-09-26 2019-01-29 天津理工大学 一种基于学习规则集的网络入侵检测方法
CN113424144A (zh) * 2019-03-12 2021-09-21 英特尔公司 计算数据存储系统
CN111259384A (zh) * 2020-01-17 2020-06-09 中国科学院计算技术研究所 一种基于缓存随机无效的处理器瞬态攻击防御方法
CN113988306A (zh) * 2021-09-28 2022-01-28 阿里巴巴(中国)有限公司 样例数据的处理方法、装置、设备和存储介质

Also Published As

Publication number Publication date
CN113988306A (zh) 2022-01-28

Similar Documents

Publication Publication Date Title
WO2023051228A1 (zh) 样例数据的处理方法、装置、设备和存储介质
US10025364B2 (en) GPU power measuring method of heterogeneous multi-core system
US8352517B2 (en) Infrastructure for spilling pages to a persistent store
US9946742B2 (en) Parallel load in a column-store database
US9183151B2 (en) Thread cache allocation
US11144330B2 (en) Algorithm program loading method and related apparatus
US20180136842A1 (en) Partition metadata for distributed data objects
US11740941B2 (en) Method of accelerating execution of machine learning based application tasks in a computing device
US9858120B2 (en) Modifying memory space allocation for inactive tasks
WO2023024955A1 (zh) 数据库任务处理方法、冷热数据处理方法、存储引擎、设备及存储介质
CN115543965A (zh) 跨机房数据处理方法、设备、存储介质及程序产品
US20200042609A1 (en) Methods and systems for searching directory access groups
US10579419B2 (en) Data analysis in storage system
US20170017574A1 (en) Efficient cache warm up based on user requests
US11409670B2 (en) Managing lock coordinator rebalance in distributed file systems
US10664170B2 (en) Partial storage of large files in distinct storage systems
JP5692355B2 (ja) コンピュータシステム、制御システム、制御方法および制御プログラム
CN109241039B (zh) 一种全局唯一参数化文件的实现方法、系统、服务器及存储介质
CN113448739B (zh) 一种数据处理方法及装置
Rong et al. Exploring the layered structure of containers for design of video analytics application migration
Manduchi et al. CODAS for long lasting experiments. The SPIDER experience
US20150348223A1 (en) Performance control for concurrent animations
US20200106816A1 (en) Enhanced Anchor Protocol for Event Stream Processing
CN112906309B (zh) 机器学习模型的分布式训练方法、装置和系统
US20230342200A1 (en) System and method for resource management in dynamic systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874615

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE