CN113988306A

CN113988306A - Sample data processing method, device, equipment and storage medium

Info

Publication number: CN113988306A
Application number: CN202111144871.7A
Authority: CN
Inventors: 徐之浩; 车漾; 张凯; 顾荣
Original assignee: Alibaba China Co Ltd; Alibaba Cloud Computing Ltd
Current assignee: Alibaba China Co Ltd; Alibaba Cloud Computing Ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2022-01-28
Also published as: WO2023051228A1

Abstract

The embodiment of the application provides a method, a device, equipment and a storage medium for processing sample data, wherein the method comprises the following steps: acquiring a training task and a meta-information sequence corresponding to the training task, wherein the meta-information sequence comprises a plurality of meta-information, and the meta-information is used for indexing corresponding sample data; traversing the meta-information sequence and determining a preset amount of target meta-information; target sample data corresponding to the target meta-information is pre-stored, a training task is executed by using the target sample data pre-stored last time, when the target sample data pre-stored last time is used up, the step of executing the traversal meta-information sequence and determining the target meta-information in a preset amount is returned, and meanwhile, the target sample data pre-stored last time is expelled. By applying the embodiment of the application, the sample data used when the training task is to be executed is pre-stored according to the meta-information sequence, the used sample data is expelled, and the requirement for executing the training task can be met as long as a small amount of sample data is pre-stored, so that the resource use of a cache system is saved.

Description

Sample data processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing sample data, an electronic device, and a storage medium.

Background

In recent years, with the development of research on heterogeneous computing devices themselves, more and more heterogeneous computing devices with stronger computing power are emerging, further accelerating the machine learning training process. However, the increase of data processing capability puts higher requirements on the data access speed of the program, and the architecture of separate computing and storage adopted on the cloud further limits the data access speed, which is thus gradually becoming the main performance bottleneck of the machine learning training program.

In order to solve the above problems, a scheme of computing-side distributed cache is generally adopted at present to accelerate data access: by caching data in a storage system in a computing environment, a machine learning training job running in the computing environment can acquire required data with lower latency and higher bandwidth. Because machine learning often carries out many rounds of training, data cache can be multiplexed between many rounds, promotes the efficiency of machine learning training process.

However, as the data sets used become larger, caching systems face greater challenges: on the one hand, in order to obtain the maximum data access speed, the cache system needs to cache data in the whole data set, which means that the cache system needs to occupy a large amount of storage resources in the computing environment. On the other hand, if the cache system is unable to fully cache data in the entire data set, the efficiency of data access for the machine learning training job will be significantly reduced compared to the fully cached case, subject to the eviction policy of the cache system.

Disclosure of Invention

The embodiment of the application provides a sample data processing method, and aims to solve the problems that a cache system occupies a large number of storage resources in a computing environment, and when the cache system cannot completely cache data in a whole data set, the data access efficiency of machine learning training operation is low.

Correspondingly, the embodiment of the application also provides a sample data processing device, an electronic device and a storage medium, which are used for ensuring the implementation and application of the method.

In order to solve the above problem, an embodiment of the present application discloses a method for processing sample data, where the method includes:

acquiring a training task and a meta-information sequence corresponding to the training task, wherein the meta-information sequence comprises a plurality of meta-information, and the meta-information is used for indexing corresponding sample data;

traversing the meta-information sequence to determine a preset amount of target meta-information;

pre-storing target sample data corresponding to the target meta-information, and executing the training task by using the previously pre-stored target sample data;

and when the target sample data pre-stored last time is used up, returning to execute the step of traversing the meta-information sequence to determine a preset amount of target meta-information, and meanwhile, expelling the target sample data pre-stored last time.

Optionally, the method further comprises:

obtaining machine learning operation and meta information corresponding to sample data in a sample data set required to be used when the machine learning operation is used for learning training;

randomly generating a meta-information total sequence by adopting meta-information corresponding to the sample data in the sample data set;

splitting the total sequence of meta-information into a plurality of sequences of meta-information and splitting the machine learning job into a plurality of training tasks based on the plurality of sequences of meta-information; wherein a plurality of training tasks are executed in parallel.

Optionally, the method further comprises:

obtaining other machine learning jobs using the sample dataset;

splitting the other machine learning job into a plurality of other training tasks based on a plurality of the sequences of meta-information; and the other training tasks and the training tasks corresponding to the same meta-information sequence are synchronously executed.

Optionally, the method further comprises:

determining the time taken to perform the training task using the previously pre-stored target sample data;

determining the shortest time spent on executing the training task by using pre-stored target sample data in history;

calculating to obtain abnormal time according to the shortest time and the abnormal value parameters;

and when the time is greater than the abnormal time, increasing the preset number.

Optionally, when the time is greater than the abnormal time, increasing the preset number includes:

when the time is longer than the abnormal time, detecting the times of the step of traversing the meta-information sequence from the last time of increasing the preset amount of system time to the current system time and determining the preset amount of target meta-information;

and when the times are greater than the preset times, increasing the preset number.

Optionally, the method is applied to a Kubernetes cluster, where the Kubernetes cluster is deployed with a work node and a distributed cache system, the work node is used to execute the training task, and the distributed cache system is used to pre-store target sample data.

The embodiment of the application also discloses a processing device of the sample data, the device comprises:

the system comprises a sequence acquisition module, a data processing module and a data processing module, wherein the sequence acquisition module is used for acquiring a training task and a meta-information sequence corresponding to the training task, the meta-information sequence comprises a plurality of meta-information, and the meta-information is used for indexing corresponding sample data;

the sequence traversing module is used for traversing the meta-information sequence and determining a preset amount of target meta-information;

the task execution module is used for pre-storing target sample data corresponding to the target meta-information and executing the training task by using the target sample data pre-stored last time;

and the data eviction module is used for returning and executing the step of traversing the meta-information sequence and determining the preset amount of target meta-information when the target sample data pre-stored last time is used up, and evicting the target sample data pre-stored last time.

Optionally, the method further comprises:

the operation acquisition module is used for acquiring machine learning operation and meta information corresponding to sample data in a sample data set required to be used when the machine learning operation is used for learning training;

the sequence generation module is used for randomly generating a meta-information total sequence by adopting meta-information corresponding to the sample data in the sample data set;

a job splitting module for splitting the total sequence of meta-information into a plurality of meta-information sequences and splitting the machine learning job into a plurality of training tasks based on the plurality of meta-information sequences; wherein a plurality of training tasks are executed in parallel.

Optionally, the method further comprises:

the operation acquisition module is also used for acquiring other machine learning operations using the sample data set;

a job splitting module to split the other machine learning jobs into a plurality of other training tasks based on the plurality of meta-information sequences; and the other training tasks and the training tasks corresponding to the same meta-information sequence are synchronously executed.

Optionally, the method further comprises:

a time determination module for determining the time spent on executing the training task using the previously pre-stored target sample data;

the time determining module is also used for determining the shortest time spent on executing the training task by using pre-stored target sample data in history;

the time calculation module is used for calculating abnormal time according to the shortest time and the abnormal value parameters;

a number increasing module for increasing the preset number when the time is greater than the abnormal time.

Optionally, the number increasing module includes:

a time determining submodule, configured to detect, when the time is greater than the abnormal time, a time for executing the step of traversing the meta-information sequence from a last time of increasing the preset number of system times to a current system time, and determining a preset number of target meta-information;

and the number increasing submodule is used for increasing the preset number when the number is greater than the preset number.

Optionally, the kubernets cluster is deployed with a machine learning framework, and a data reading module of the machine learning framework is replaced by a Dataset indicating Service component, so as to maintain a meta information sequence corresponding to the training task through the Dataset indicating Service component.

The embodiment of the application also discloses an electronic device, which comprises: a processor; and a memory having executable code stored thereon, which when executed, causes the processor to perform a method of processing sample data as described in one or more of the embodiments of the present application.

The embodiment of the application also discloses one or more machine-readable media, wherein executable codes are stored on the machine-readable media, and when the executable codes are executed, the processor is caused to execute the processing method of the sample data as one or more of the embodiments of the application.

Compared with the prior art, the embodiment of the application has the following advantages:

in the embodiment of the application, a training task and a meta-information sequence corresponding to the training task are obtained, wherein the meta-information sequence comprises a plurality of meta-information, and the meta-information is used for indexing corresponding sample data; traversing the meta-information sequence and determining a preset amount of target meta-information; target sample data corresponding to the target meta-information is pre-stored, a training task is executed by using the target sample data pre-stored last time, when the target sample data pre-stored last time is used up, the step of executing the traversal meta-information sequence and determining the target meta-information in a preset amount is returned, and meanwhile, the target sample data pre-stored last time is expelled. By applying the embodiment of the application, the sample data used when the training task is to be executed is pre-stored according to the sequence of the meta information in the meta information sequence, the used sample data is expelled, and the requirement for executing the training task can be met as long as a small amount of sample data is pre-stored, so that the resource use of a cache system is saved. Meanwhile, sample data used when the training task is to be executed is pre-stored according to the meta-information sequence, the situation that the sample data cannot be hit when the training task is executed is avoided, and the performance bottleneck caused by slow data access is solved. In addition, the process of expelling and pre-storing is carried out synchronously with the execution of the training task, and the whole process runs in a pipeline mode, so that the time for executing the training task is shortened.

In addition, after the meta-information corresponding to the sample data required by the training task is acquired, the meta-information sequence is randomly generated by adopting the meta-information, and the random sequencing of the meta-information in the meta-information sequence ensures the randomness of the use of the sample data in the process of executing the training task according to the meta-information sequence so as to prevent the over-fitting of the learning model obtained by executing the training task.

Drawings

FIG. 1 is a flow chart of steps of an embodiment of a sample data processing method of the present application;

FIG. 2 is a flow chart of steps in another sample data processing method embodiment of the present application;

FIG. 3 is a flowchart illustrating steps of an embodiment of adjusting the amount of pre-stored sample data according to the present application;

FIG. 4 is a schematic diagram of a modified embodiment of a machine learning programming framework of the present application;

FIG. 5 is a block diagram of a Kubernets cluster embodiment of the present application;

FIG. 6 is one of the framework diagrams of a sample data processing embodiment of the present application;

FIG. 7 is a second block diagram of a sample data processing embodiment of the present application;

FIG. 8 is a block diagram of an embodiment of an apparatus for sample data of the present application;

fig. 9 is a schematic structural diagram of an apparatus according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

Benefiting from advances in data storage and data collection technologies over the past few years, machine learning such a method of learning expert knowledge from data has become a common approach to solving computer-cognitive problems. Among them, machine learning exhibits a strong ability to solve practical problems in various fields. The machine learning method mainly comprises two processes of training and inference. The machine learning training process needs to learn correct correlation between data from a large-scale data set to acquire experience, and the machine learning inference process judges emerging data according to the experience.

Machine learning training is often performed using heterogeneous computing devices such as GPUs to achieve parallel acceleration of data processing. In recent years, with the development of research on heterogeneous computing devices themselves, more and more heterogeneous computing devices with stronger computing power are emerging, further accelerating the machine learning training process. The acceleration of data processing capability puts higher requirements on the data access speed of the program, while the architecture of separation of computing and storage adopted on the cloud further limits the data access speed, and thus the data access speed gradually becomes a main performance bottleneck of the machine learning training program. In order to solve the above problems, the prior art proposes the following solutions:

by using the Alluxio and other caching technologies popular in the industry at present, distributed caching can be realized on a computing side, the storage resources of all nodes can be integrated by the distributed caching, a larger caching pool is provided, and the whole data set required by machine learning training is cached. But requires a large amount of memory resources. As the size of the data set is larger and larger, the required storage resources are also larger and larger, and especially in a multi-tenant scenario, multiple machine learning training jobs simultaneously require multiple different data sets, which causes a distributed cache system to face a larger burden.

The data access process is optimized internally to the machine learning programming framework such as (PyTorch and tensorblow). Data echo, for example, promotes resource utilization of a computing device by replaying Data that has been used before when a Data access bottleneck occurs, however such optimization substantially modifies the semantics of machine learning training, potentially impacting the effectiveness of machine learning methods.

The data prestoring capability provided by the PyTorch and Tensorflow frameworks also helps to mitigate the performance impact of data access bottlenecks, however the prestoring process relies on a user-written machine learning exercise program. Before the program runs, a user cannot predict in advance whether a data access bottleneck occurs, so that the pre-storage function provided by the framework cannot be reasonably configured.

In order to solve the problems, the core idea of the application is that when a machine learning application is deployed in a cluster, an existing data reading component is replaced in a framework layer of an application container in an injection mode, and the purpose of improving the application running speed is achieved on the premise of a lower data space by means of uniform pre-storage and expulsion of sample data.

1. And introducing a Dataset indicating Service component, automatically replacing a data reading module of a machine learning framework such as TensorFlow, PyTorch and the like, and controlling the access sequence of sample data of the machine learning training operation.

And 2, performing prestoring and eviction management on sample data in the cache system by the Dataset indicating Service component according to different application data access sequence characteristics, and efficiently utilizing the cache.

3. And dynamically controlling the pre-stored behaviors of the sample data according to the machine learning training speed, and ensuring that the machine learning training process is not influenced by the bottleneck of data access as far as possible.

The application provides a sample data processing method and device, which are explained in detail in the following embodiments. First, the noun terms to which one or more embodiments of the present application relate are explained.

Kubernetes: is an open source system for automatically deploying, extending and managing containerized (containerized) applications.

Machine learning: it is essential to use data to solve the cognitive problem of a computer. Knowledge understands that information processing is even predictive.

The remote storage system: refers to a cloud storage or storage server for storing training sample data sets, which is remote from the computing side.

Distributed caching: in a distributed environment or system, remotely stored data is stored to a machine that is close to a user or application to reduce the delay of remote data transmission and allow the user and application to quickly access desired data.

GPU: graphics processors, like CPUs, except GPUs are designed specifically to perform complex mathematical and geometric calculations, and are very widely used in artificial intelligence applications.

And (4) working nodes: with computer nodes such as GPUs or for learning training.

Meta information: also called metadata, is data describing data (data about data), mainly information describing data property (property), and is used to support functions such as indicating storage location, history data, resource search, file record, and the like. Even an electronic catalog, in order to achieve the goal of creating a catalog, the contents or features of data must be described and collected, so as to achieve the goal of assisting data retrieval.

Referring to fig. 1, a flowchart of steps of an embodiment of a sample data processing method according to the present application is shown, including the following steps:

step 101, a training task and a meta-information sequence corresponding to the training task are obtained, where the meta-information sequence includes a plurality of meta-information, and the meta-information is used to index corresponding sample data.

The method is applied to a Kubernets cluster, the Kubernets cluster is provided with machine learning frameworks such as TensorFlow and PyTorch, a Dataset indicating Service component is introduced, a data reading module of the machine learning frameworks is replaced automatically, and meanwhile the Kubernets cluster is provided with working nodes and a distributed cache system.

Specifically, a training task and a meta-information sequence corresponding to the training task are obtained, the training task is scheduled to a working node, and the meta-information sequence corresponding to the training task is maintained through a Dataset indicating Service component. The meta-information sequence is generated randomly after the data set indicating Service component acquires the meta-information corresponding to the sample data required by the training task, and the random sequencing of the meta-information in the meta-information sequence ensures the randomness of sample data use (consumption) in the process of executing the training task according to the meta-information sequence so as to prevent the overfitting of the learning model obtained by executing the training task.

And 102, traversing the meta-information sequence to determine a preset amount of target meta-information.

Specifically, when the working node needs to consume (use) the sample data to execute the training task, the working node requests the sample data from the Dataset indicating Service component, the Dataset indicating Service component determines a preset amount of target meta-information according to the traversal meta-information sequence in sequence, and returns the target meta-information to the working node, for example, the preset amount is 2, the meta-information sequence is 1, 3, 6, 5, 4, and 5, and then the preset amount of target meta-information is determined to be 1 and 3.

And 103, pre-storing target sample data corresponding to the target meta-information, and executing the training task by using the previously pre-stored target sample data.

The remote storage system stores sample data corresponding to the meta-information in the meta-information sequence.

Specifically, target sample data corresponding to target meta-information is pulled from a remote storage system and is prestored in a distributed cache system, and a training task is executed by using the target sample data prestored last time, for example, the meta-information sequence is 1, 3, 6, 5, 4, 5, and 1, 3 sample data are prestored in the distributed cache system, so that the sample data is requested from a Dataset indicating Service component at a working node, the Dataset indicating Service component sequentially traverses the meta-information sequence, and the determined target meta-information is 6, 5, and then the

sample data

6, 5 are prestored in the distributed cache system, and the working node acquires the

sample data

1, 3 from the distributed cache system and executes the training task.

And 104, when the target sample data pre-stored last time is used up, returning to the step of executing the step of traversing the meta-information sequence to determine a preset amount of target meta-information, and meanwhile, expelling the target sample data pre-stored last time.

Specifically, when the previously pre-stored target sample data is used (consumed), the working node will request the sample data from the Dataset indicating Service component again, and the Dataset indicating Service component sequentially traverses the meta-information sequence again to determine a new preset number of target meta-information, it should be noted that the meta-information corresponding to the pre-stored sample data is not determined to be the target meta-information, for example, the preset number is 2, and the meta-information sequence is 1, 3, 6, 5, 4, 5, then the previous time determines that the preset number of target meta-information is 1, 3, and traverses the meta-information sequence again, and starts to traverse from the meta-information 6 to determine that the preset number of target meta-information is 6, 5. And meanwhile, target sample data prestored in the distributed cache system at the previous time is expelled.

In the embodiment of the application, a training task and a meta-information sequence corresponding to the training task are obtained, wherein the meta-information sequence comprises a plurality of meta-information, and the meta-information is used for indexing corresponding sample data; traversing the meta-information sequence and determining a preset amount of target meta-information; target sample data corresponding to the target meta-information is pre-stored, a training task is executed by using the target sample data pre-stored last time, when the target sample data pre-stored last time is used up, the step of executing the traversal meta-information sequence and determining the target meta-information in a preset amount is returned, and meanwhile, the target sample data pre-stored last time is expelled. By applying the embodiment of the application, the sample data to be used (consumed) for executing the training task is pre-stored according to the meta-information sequence, the used sample data is expelled, the requirement for executing the training task can be met as long as a small amount of sample data is pre-stored, and the sample data required for executing the training task is not required to be pre-stored completely, so that the resource use of a cache system is saved. Meanwhile, sample data used when the training task is to be executed is pre-stored according to the meta-information sequence, the situation that the sample data cannot be hit when the training task is executed is avoided, and the performance bottleneck caused by slow data access is solved. In addition, the process of expelling and pre-storing is carried out synchronously with the execution of the training task, and the whole process runs in a pipeline mode, so that the time for executing the training task is shortened.

In addition, the meta-information sequence is generated randomly after the data set indicating Service component acquires the meta-information corresponding to the sample data required by the training task, and random sequencing of the meta-information in the meta-information sequence ensures the randomness of use (consumption) of the sample data in the process of executing the training task according to the meta-information sequence so as to prevent overfitting of the learning model obtained by executing the training task.

On the basis of the above embodiments, alternative embodiments are proposed, and it should be noted herein that, in order to make the description brief, only the differences from the above embodiments are described in the alternative embodiments.

In an embodiment of the present application, referring to fig. 2, a flowchart illustrating steps of another embodiment of a method for processing sample data of the present application is shown, including the following steps:

step 201, obtaining a machine learning operation and meta information corresponding to sample data in a sample data set required to be used when the machine learning operation is used for learning training.

Specifically, a machine learning job and meta information corresponding to sample data in a sample data set required to be used when the machine learning job performs learning training are obtained, wherein the meta information corresponding to the sample data in the sample data set is stored in a distributed cache system, and the meta information corresponding to the sample data is obtained from the distributed cache system through a Dataset indicating Service component.

Step 202, randomly generating a meta-information total sequence by using meta-information corresponding to the sample data in the sample data set.

Based on performance analysis of a program for performing machine learning training on a large-scale data set, it can be found that random data access sequence of the machine learning training is a root cause of significant reduction of data access efficiency. If the cache system cannot cache all data, a cache eviction policy such as LRU will evict data that has not been used recently, resulting in a cache miss in the next data access.

Specifically, after the Dataset indicating Service component acquires the meta-information corresponding to the sample data from the distributed cache system, the meta-information is adopted to randomly generate a total meta-information sequence, and the machine learning operation can access (use) the sample data according to the order of the meta-information in the total meta-information sequence to perform learning training.

In the embodiment of the application, the meta information corresponding to the sampling example data is randomly generated into the meta information total sequence in advance, and when the machine performs learning training, the sample data can be accessed according to the sequence of the meta information in the meta information total sequence, so that the randomness of the sample data in the process of performing learning training is ensured, and the obtained learning model is prevented from being over-fitted. Meanwhile, the problem that the cache of the learning training is not hit in the next sample data access is solved.

Step 203, splitting the meta-information total sequence into a plurality of meta-information sequences, and splitting the machine learning operation into a plurality of training tasks based on the plurality of meta-information sequences; wherein a plurality of training tasks are executed in parallel.

Specifically, the meta-information total sequence is divided into a plurality of meta-information sequences through a Dataset indicating Service component, the machine learning operation is divided into a plurality of training tasks based on the plurality of meta-information sequences, and the plurality of training tasks are respectively scheduled to different working nodes to be executed in parallel. The steps of the training task performed by the working node are described above and will not be repeated here.

In the embodiment of the application, the machine learning operation is split into the plurality of training tasks based on the plurality of meta-information sequences, and the plurality of training tasks are respectively scheduled to different working nodes to be executed in parallel, so that the time spent on the machine learning operation in learning and training can be shortened.

In an embodiment of the present application, the method further includes: obtaining other machine learning jobs using the sample dataset; splitting the other machine learning job into a plurality of other training tasks based on a plurality of the sequences of meta-information; and the other training tasks and the training tasks corresponding to the same meta-information sequence are synchronously executed.

Specifically, other machine learning jobs using a sample data set (of machine learning jobs) are obtained, because the sample data sets used during learning training of the machine learning jobs and the other machine learning jobs are the same, the machine learning jobs and the other machine learning jobs can be coordinated to share the same meta-information total sequence in a centralized manner, therefore, the other machine learning jobs can be split into a plurality of other training tasks based on the plurality of meta-information sequences, the plurality of other training tasks are scheduled to different working nodes for execution, the training tasks corresponding to the same meta-information sequence and the other training tasks can be executed synchronously, and in the execution process, the training tasks and the other training tasks use the same sample data, so that the sample data in the distributed cache system is reused by the plurality of training tasks, and the cache utilization rate is improved.

In addition, based on a plurality of meta-information sequences, machine learning operation is split into a plurality of training tasks, other machine learning operation is split into a plurality of other training tasks, the training tasks corresponding to the same meta-information sequence and the other training tasks can be synchronously executed, and in the executing process, the training tasks and the other training tasks use the same sample data, so that the sample data in the distributed cache system is repeatedly utilized by the plurality of training tasks, and the cache utilization rate is improved.

In an embodiment of the present application, referring to fig. 3, a flowchart illustrating steps of an embodiment of adjusting the amount of pre-stored sample data according to the present application is shown, including the following steps:

step 301: determining a time taken to perform the training task using the previously pre-stored target sample data.

Step 302: determining a minimum time it takes to perform the training task using pre-stored target sample data in history.

Step 303: and calculating to obtain abnormal time according to the shortest time and the abnormal value parameters.

Step 304: and when the time is greater than the abnormal time, increasing the preset number.

The number of the pre-stored sample data in the distributed cache system is related to the speed of using the sample data when the training task is executed and the speed of pulling the sample data from the remote storage system, the number of the pre-stored sample data is too small, so that part of the sample data is still not in the distributed cache system when being used, the access efficiency of the sample data is reduced due to cache miss, and the number of the pre-stored sample data occupies too much storage resources in the distributed cache system.

Specifically, determining the time spent on executing the training task by using the preset amount of target sample data prestored in the previous time and the shortest time spent on executing the training task by using the prestored target sample data in history, and multiplying the shortest time by the abnormal value parameter to obtain abnormal time; when the time spent on executing the training task by using the previously pre-stored preset amount of target sample data is longer than the abnormal time, it is indicated that part of the sample data is still not in the distributed cache system when being used, and the preset amount needs to be increased, that is, the amount of the pre-stored sample data in the distributed cache system is increased, so as to meet the requirement of executing the training task.

In an embodiment of the application, when the time is greater than the abnormal time, increasing the preset number includes: when the time is longer than the abnormal time, detecting the times of the step of traversing the meta-information sequence from the last time of increasing the preset amount of system time to the current system time and determining the preset amount of target meta-information; and when the times are greater than the preset times, increasing the preset number.

For one machine learning operation, the number of sample data consumed in each work node request interval is certain, and the operation logic of the training samples is fixed, so that the overall operation time is stable, and only small-range time fluctuation is possible. Therefore, the number of the pre-stored sample data in the distributed cache system is increased to a proper number, and adjustment is not needed within a certain time.

Specifically, when the time spent for executing the training task by using the previously pre-stored preset number of target sample data is longer than the abnormal time, detecting the time from the last time of increasing the preset number of system time to the current system time to execute the traversal meta-information sequence, and determining the number of times of the steps of the preset number of target meta-information, when the number of times is longer than the preset number of times, increasing the preset number, that is, increasing the number of pre-stored sample data in the distributed cache system, and when the number of times is smaller than or equal to the preset number of times, not increasing the preset number, that is, not increasing the number of pre-stored sample data in the distributed cache system.

In addition, it should be noted that the speed of consuming the sample data by the machine learning job is relatively stable, and therefore, once the quantity of the pre-stored sample data in the distributed cache system is expanded to a proper size, the sample data does not need to be reduced in a normal situation.

Specifically, whether the preset quantity needs to be increased is judged by the following formula:

in the formula: t is t_iThe time interval between the current receiving of the work node initiating request and the last receiving of the work node initiating request is represented, namely the time spent on executing the training task by using the target sample data prestored at the previous time; t isⁱ _minThe minimum value of the request interval time of the past time period is represented, namely the minimum time spent on executing the training task by using the pre-stored target sample data in the history; alpha is an abnormal value parameter; i represents the number of times of the step of detecting the last system time increased by the preset number to the current system time to execute the traversal meta-information sequence and determine the target meta-information of the preset number; p is a stable value parameter, namely the preset times.

When t is_iGreater than α x Tⁱ _minAnd when i is greater than P, increasing the preset number, namely increasing the number of the pre-stored sample data in the distributed cache system.

In the embodiment of the application, the number of the pre-stored sample data in the distributed cache system can be adjusted by detecting the time spent on executing the training task by using the preset number of target sample data, so that the situation that the number of the pre-stored sample data is too small, part of sample data is still not in the distributed cache system when being used, the access efficiency of the sample data is reduced due to cache miss, and the excessive number of the pre-stored sample data occupies too many storage resources in the distributed cache system is avoided.

For better understanding of the embodiments in the present application, the following description exemplifies a processing method of sample data, but it should be understood that the embodiments in the present application are not limited thereto.

Referring to fig. 4, a schematic diagram of a modified embodiment of the machine learning programming framework of the present application is shown, in which the technical solution of the present application requires modification of the underlying data access logic of the machine learning programming framework (such as PyTorch, tensoflow), whereas users often use a standard version of the machine learning programming framework, which does not include code logic for implementing the technical solution of the present application. When a user submits a job, a Service Auto Injector component (hereinafter referred to as an Injector component) creates a Dataset indicating Service component (hereinafter referred to as a Service component) in a cluster, and the Service component starts to pre-store sample data after being started. The Injector component then injects InitContainer (a container used for initialization work) in the machine learning job submitted by the user, which uses the image defined by the present scheme, which contains the code logic changes to the machine learning programming framework. The InitContainer can be started in priority to the machine learning operation submitted by the user, and when the InitContainer is started, the code logic needing to be changed can be covered to the corresponding position of the user mirror image, so that logic replacement which is not perceived by the user is realized. When a user-defined machine learning job is started, its data access process will proceed according to the workflow below.

It should be noted that, without introducing a Dataset indicating Service component, by using a data access logic that modifies a machine learning framework, the machine learning can also perform cache management on the distributed cache system while accessing data.

Referring to fig. 5, which shows a framework schematic diagram of an embodiment of a kubernets cluster according to the present application, before a machine learning training job starts, a Service component acquires meta information of all training sample data (sample data) in an entire data set from a distributed cache system. The sequence of data of the consumption samples in the machine learning training process is completely random, and in order to prevent the overfitting of a machine learning model, when the Service component obtains the meta-information of the training samples, the Service component firstly scrambles the meta-information to generate a random meta-information total sequence, and splits the meta-information total sequence into a plurality of meta-information sequences to correspond to a plurality of training tasks of the machine learning training Worker;

when any machine learning training Worker needs to consume data to execute a training task, the machine learning training Worker requests training sample data from a Service component, the Service component traverses a meta-information sequence corresponding to the machine learning training Worker, and selects and returns meta-information corresponding to the unused training sample data.

And the machine learning training Worker reads corresponding sample data from the distributed cache system according to the returned meta information to execute a training task.

Referring to fig. 6, which shows one of the frame diagrams of an embodiment of sample data processing of the present application,

sample data

1, 3, and 5 are pre-stored in a distributed cache system, and when a Service component returns

meta information

3, 5 to a machine learning training Worker, the machine learning training Worker reads the

sample data

3, 5 from the distributed cache system. Referring to fig. 7, at this time, when the machine learning training Worker requests the Service component to train the sample, the sliding window (the dashed window) is shifted to the right, the new

meta information

6 and 4 is shifted into the sliding window, and at this time, the Service component immediately performs the pre-storing operation of the

sample data

6 and 4. And moving out the

meta information

3 and 5 of the sliding window while moving the sliding window to the right, wherein the sample data corresponding to the

meta information

3 and 5 is just requested and is consumed by a machine learning training Worker to perform model training (execute a training task), and at the moment, the Service component cannot immediately drive out the 3 and 5 sample data from the distributed cache system. When the machine learning training Worker requests the sample data from the Service component again, indicating that the previously requested

sample data

3, 5 has been consumed, the Service component may evict the

sample data

3, 5.

In the embodiment of the application, the sample data which is consumed can not be used any more, so the Service component guides the distributed cache system to evict the sample data, so as to save the resource use of the cache system. Meanwhile, new meta information is moved into the sliding window, and the Service component guides the distributed cache system to pre-store sample data corresponding to the meta information into the distributed cache system from the remote storage system, so that the machine learning training Worker can hit the cache when needing the sample data, and performance bottleneck caused by slow data access is avoided. The expelling and pre-storing processes are synchronously carried out with the model training process of the machine learning training Worker, the whole process runs in a flow line mode, and the whole machine learning training time is shortened.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.

On the basis of the above embodiments, the present embodiment further provides a sample data processing apparatus, which is applied to electronic devices such as a terminal device and a server.

Referring to fig. 8, a block diagram of a structure of an embodiment of an apparatus for sample data in the present application is shown, which may specifically include the following modules:

a sequence obtaining module 801, configured to obtain a training task and a meta-information sequence corresponding to the training task, where the meta-information sequence includes a plurality of meta-information, and the meta-information is used to index corresponding sample data;

a sequence traversing module 802, configured to traverse the meta-information sequence and determine a preset number of target meta-information;

a task execution module 803, configured to pre-store target sample data corresponding to the target meta information, and execute the training task using the target sample data pre-stored last time;

a data eviction module 804, configured to, when the target sample data pre-stored last time is used up, return to the step of executing the traversal of the meta-information sequence to determine a preset number of target meta-information, and evict the target sample data pre-stored last time.

In an embodiment of the present application, the method further includes:

In an embodiment of the application, the number increasing module includes:

In an embodiment of the present application, the kubernets cluster is applied, where a work node and a distributed cache system are deployed in the kubernets cluster, the work node is configured to execute the training task, and the distributed cache system is configured to pre-store target sample data.

The present application further provides a non-transitory, readable storage medium, where one or more modules (programs) are stored, and when the one or more modules are applied to a device, the device may execute instructions (instructions) of method steps in this application.

Embodiments of the present application provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an electronic device to perform the methods as described in one or more of the above embodiments. In the embodiment of the present application, the electronic device includes various types of devices such as a terminal device and a server (cluster).

Embodiments of the present disclosure may be implemented as an apparatus, which may include electronic devices such as a terminal device, a server (cluster), etc., using any suitable hardware, firmware, software, or any combination thereof, to perform a desired configuration. Fig. 9 schematically illustrates an example apparatus 900 that may be used to implement various embodiments described herein.

For one embodiment, fig. 9 illustrates an example apparatus 900 having one or more processors 902, a control module (chipset) 904 coupled to at least one of the processor(s) 902, a memory 906 coupled to the control module 904, a non-volatile memory (NVM)/storage 908 coupled to the control module 904, one or more input/output devices 910 coupled to the control module 904, and a network interface 912 coupled to the control module 904.

The processor 902 may include one or more single-core or multi-core processors, and the processor 902 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 900 can be a terminal device, a server (cluster), or the like as described in this embodiment.

In some embodiments, apparatus 900 may include one or more computer-readable media (e.g., memory 906 or NVM/storage 908) having instructions 914 and one or more processors 902 in combination with the one or more computer-readable media and configured to execute instructions 914 to implement modules to perform the actions described in this disclosure.

For one embodiment, control module 904 may include any suitable interface controllers to provide any suitable interface to at least one of the processor(s) 902 and/or any suitable device or component in communication with control module 904.

The control module 904 may include a memory controller module to provide an interface to the memory 906. The memory controller module may be a hardware module, a software module, and/or a firmware module.

The memory 906 may be used, for example, to load and store data and/or instructions 914 for the device 900. For one embodiment, memory 906 may comprise any suitable volatile memory, such as suitable DRAM. In some embodiments, the memory 906 may comprise a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, the control module 904 may include one or more input/output controllers to provide an interface to the NVM/storage 908 and input/output device(s) 910.

For example, NVM/storage 908 may be used to store data and/or instructions 914. NVM/storage 908 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 908 may include storage resources that are physically part of the device on which apparatus 900 is installed, or it may be accessible by the device and need not be part of the device. For example, NVM/storage 908 may be accessible over a network via input/output device(s) 910.

Input/output device(s) 910 may provide an interface for apparatus 900 to communicate with any other suitable device, input/output devices 910 may include communication components, audio components, sensor components, and so forth. Network interface 912 may provide an interface for device 900 to communicate over one or more networks, and device 900 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as access to a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.

For one embodiment, at least one of the processor(s) 902 may be packaged together with logic for one or more controller(s) (e.g., memory controller module) of the control module 904. For one embodiment, at least one of the processor(s) 902 may be packaged together with logic for one or more controller(s) of the control module 904 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 902 may be integrated on the same die with logic for one or more controller(s) of the control module 904. For one embodiment, at least one of the processor(s) 902 may be integrated on the same die with logic of one or more controllers of the control module 904 to form a system on a chip (SoC).

In various embodiments, the apparatus 900 may be, but is not limited to being: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, apparatus 900 may have more or fewer components and/or different architectures. For example, in some embodiments, device 900 includes one or more cameras, keyboards, Liquid Crystal Display (LCD) screens (including touch screen displays), non-volatile memory ports, multiple antennas, graphics chips, Application Specific Integrated Circuits (ASICs), and speakers.

The detection device can adopt a main control chip as a processor or a control module, sensor data, position information and the like are stored in a memory or an NVM/storage device, a sensor group can be used as an input/output device, and a communication interface can comprise a network interface.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable xxxx terminal devices to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable xxxx terminal devices, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable xxxx terminal device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable xxxx terminal device to cause a series of operational steps to be performed on the computer or other programmable terminal device to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal device provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing describes in detail a method and apparatus for processing sample data, an electronic device and a storage medium provided by the present application, and specific examples are applied in the present application to explain the principles and embodiments of the present application, and the descriptions of the foregoing examples are only used to help understand the method and core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for processing sample data, the method comprising:

2. The method of claim 1, further comprising:

3. The method of claim 2, further comprising:

obtaining other machine learning jobs using the sample dataset;

4. The method of claim 1, further comprising:

5. The method of claim 4, wherein increasing the preset number when the time is greater than the abnormal time comprises:

6. The method of claim 1, applied to a Kubernetes cluster, wherein the Kubernetes cluster is deployed with a working node and a distributed cache system, the working node is used for executing the training task, and the distributed cache system is used for pre-storing target sample data.

7. The method of claim 6, wherein a machine learning framework is deployed in the Kubernets cluster, and a data reading module of the machine learning framework is replaced by a DataseIndexing Service component, so as to maintain a meta-information sequence corresponding to the training task through the DatasetIndexing Service component.

8. An apparatus for processing sample data, the apparatus comprising:

9. The method of claim 8, further comprising:

10. An electronic device, comprising: a processor; and

memory having stored thereon executable code which, when executed, causes the processor to perform a method of processing sample data as claimed in one or more of claims 1-7.

11. One or more machine-readable media having stored thereon executable code that, when executed, causes a processor to perform a method of processing sample data as recited in one or more of claims 1-7.