CN117011397A

CN117011397A - Data processing method, apparatus, device, readable storage medium, and program product

Info

Publication number: CN117011397A
Application number: CN202211289462.0A
Authority: CN
Inventors: 弓静
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-11-07

Abstract

The embodiment of the application provides a data processing method, a device, equipment, a readable storage medium and a program product, which relate to the fields of artificial intelligence, maps and the like, and application scenes comprise but are not limited to training sample reading scenes in model training. The method comprises the following steps: acquiring at least one binary file corresponding to a plurality of training samples; determining an initial tensor corresponding to at least one binary file; determining the number of preprocessing threads aiming at the initial tensor through a preset automatic thread adjusting mode, preprocessing the initial tensor through the preprocessing threads corresponding to the number of the preprocessing threads, and determining the preprocessed tensor; inputting the preprocessed tensor into a model to be trained, and training the model to be trained; therefore, a plurality of training samples are converted into preprocessed tensors, and the preprocessed tensors are input into the model to be trained, so that the reading speed of the training samples in model training is improved.

Description

Data processing method, apparatus, device, readable storage medium, and program product

Technical Field

The present application relates to the field of computer technology, and in particular, to a data processing method, apparatus, device, readable storage medium, and program product.

Background

The reading speed of the training samples in the model training directly influences the model training speed. In the prior art, a distributed file system is used for storing training samples at the bottom layer of a main stream machine learning platform, and in a model training scene, massive training samples (the training samples are for example picture files, and the size of a single picture file is generally about 100 KB) are usually required to be extracted from the distributed file system to train the model; the massive training samples have the characteristics of large occupied space, multiple files and the like, so that the reading speed of the training samples is low, and the model training speed is influenced.

Disclosure of Invention

The application aims at the defects of the existing mode and provides a data processing method, a device, equipment, a computer readable storage medium and a computer program product, which are used for solving the problem of how to improve the reading speed of training samples in model training.

In a first aspect, the present application provides a data processing method, including:

acquiring at least one binary file corresponding to a plurality of training samples;

determining an initial tensor corresponding to at least one binary file;

determining the number of preprocessing threads aiming at the initial tensor through a preset automatic thread adjusting mode, preprocessing the initial tensor through the preprocessing threads corresponding to the number of the preprocessing threads, and determining the preprocessed tensor;

And inputting the preprocessed tensor into a model to be trained, and training the model to be trained.

In one embodiment, obtaining at least one binary file corresponding to a plurality of training samples includes:

decoding the plurality of training samples to obtain a first character string corresponding to each training sample in the plurality of training samples;

creating a protocol memory block corresponding to each training sample, filling sample characteristics in a first character string corresponding to each training sample into the protocol memory block, and serializing the protocol memory block into a second character string;

at least one binary file corresponding to the training samples is generated based on each second character string.

In one embodiment, determining the initial tensor corresponding to the at least one binary file includes:

and analyzing the protocol memory block corresponding to the at least one binary file through a preset analyzer, and determining an initial tensor corresponding to the at least one binary file.

In one embodiment, determining the number of preprocessing threads for the initial tensor by a preset automatic adjustment thread mode includes:

the number of preprocessing threads for the initial tensor is determined based on the custom preprocessing function and the number of central processor cores.

In one embodiment, determining the number of preprocessing threads for the initial tensor based on the custom preprocessing function and the number of central processor cores comprises:

determining a value range of the number of preprocessing threads for the initial tensor based on the number of the custom preprocessing functions and the number of the central processor cores, wherein the value range comprises a plurality of different numbers of preprocessing threads;

determining a preprocessing time corresponding to each different preprocessing thread number in a plurality of different preprocessing thread numbers;

and determining the number of preprocessing threads corresponding to the minimum preprocessing time in the preprocessing times as the number of preprocessing threads aiming at the initial tensor.

In one embodiment, determining a preprocessing time for each of a plurality of different preprocessing thread numbers includes:

preprocessing the preset tensor through the preprocessing threads corresponding to each different preprocessing thread in the plurality of different preprocessing threads to obtain preprocessing time corresponding to each different preprocessing thread.

In one embodiment, preprocessing the initial tensor by preprocessing the number of preprocessing threads corresponding to the number of preprocessing threads, and determining the preprocessed tensor includes:

And preprocessing the initial tensor by a parallel processing mode of at least two preprocessing threads corresponding to the number of the preprocessing threads, and determining the preprocessed tensor.

In one embodiment, after preprocessing the initial tensor by the preprocessing thread corresponding to the number of preprocessing threads to determine the preprocessed tensor, the method further includes:

and storing the preprocessed tensor through a memory cache or a disk cache.

In one embodiment, the preprocessing for the initial tensor and the training for the model to be trained are performed in an overlapping manner by means of pre-reading.

In one embodiment, the preprocessing for the initial tensor and the training for the model to be trained are overlapped and performed by a pre-reading mode, which comprises:

and performing preprocessing on the initial tensor and training on the model to be trained in parallel in a pre-reading mode, wherein the preprocessing on the initial tensor is the Nth preprocessing performed by a central processing unit, the training on the model to be trained is the (n+1) th iterative training performed by a graphic processor or a tensor processor, and N is a positive integer.

In a second aspect, the present application provides a data processing apparatus comprising:

The first processing module is used for acquiring at least one binary file corresponding to the training samples;

the second processing module is used for determining an initial tensor corresponding to at least one binary file;

the third processing module is used for determining the number of preprocessing threads aiming at the initial tensor through a preset automatic thread adjustment mode, preprocessing the initial tensor through the preprocessing threads corresponding to the number of the preprocessing threads, and determining the preprocessed tensor;

and the fourth processing module is used for inputting the preprocessed tensor into the model to be trained and training the model to be trained.

In a third aspect, the present application provides an electronic device, comprising: a processor, a memory, and a bus;

a bus for connecting the processor and the memory;

a memory for storing operation instructions;

and a processor for executing the data processing method according to the first aspect of the present application by calling an operation instruction.

In a fourth aspect, the present application provides a computer readable storage medium storing a computer program for executing the data processing method of the first aspect of the present application.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the data processing method of the first aspect of the application.

The technical scheme provided by the embodiment of the application has at least the following beneficial effects:

acquiring at least one binary file corresponding to a plurality of training samples; determining an initial tensor corresponding to at least one binary file; determining the number of preprocessing threads aiming at the initial tensor through a preset automatic thread adjusting mode, preprocessing the initial tensor through the preprocessing threads corresponding to the number of the preprocessing threads, and determining the preprocessed tensor; inputting the preprocessed tensor into a model to be trained, and training the model to be trained; in this way, a large number of training samples are converted into a small number of binary files, initial tensors corresponding to the small number of binary files are determined, preprocessing efficiency for the initial tensors is maximized through a preset automatic adjustment thread mode, the preprocessed tensors are determined, and the preprocessed tensors are input into a model to be trained, so that the reading speed of the training samples in model training is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of a data processing system according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of data processing according to an embodiment of the present application;

FIG. 4 is a schematic diagram of data processing according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of data processing according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g. "a and/or B" indicates implementation as "a", or as "B", or as "a and B".

It will be appreciated that in the specific embodiments of the present application, where data processing related data is involved, user approval or consent is required when the above embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data is required to comply with relevant laws and regulations and standards of the relevant country and region.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The embodiment of the application provides a data processing method provided by a data processing system, and relates to the fields of artificial intelligence, maps and the like.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

The intelligent transportation system (Intelligent Traffic System, ITS), also called intelligent transportation system (Intelligent Transportation System), is a comprehensive transportation system which uses advanced scientific technology (information technology, computer technology, data communication technology, sensor technology, electronic control technology, automatic control theory, operation study, artificial intelligence, etc.) effectively and comprehensively for transportation, service control and vehicle manufacturing, and enhances the connection among vehicles, roads and users, thereby forming a comprehensive transportation system for guaranteeing safety, improving efficiency, improving environment and saving energy.

In order to better understand and describe the schemes of the embodiments of the present application, some technical terms related to the embodiments of the present application are briefly described below.

TensorFlow: tensorFlow is a symbolic mathematical system based on data flow programming (dataflow programming) and is widely applied to the programming implementation of various machine learning algorithms.

proto file: the proto file is a protocol file of a message, and the suffix file of the protocol file is named ". Proto".

tfrecord format: the tfrecord format is a file storage format, for example, tens of millions of jpeg pictures can be converted into hundreds or tens of tfrecord format files, i.e., tfrecord files.

tflecord file: tensorFlow provides tfreeord files to unify storage data; the tfrecord file is a binary file for uniformly storing image data (for example, describing pixel distribution of a picture cat by using a matrix) and a tag (the tag can represent the true class of the picture, for example, the class of the picture is cat), so that the memory can be better utilized; the tfreeord file can be quickly copied, moved, read, stored and the like in the TensorFlow; the tfreeord file includes tf.train.example protocol memory blocks (protocol buffers).

Tensor processing unit: the tensor processing unit (TPU, tensor Processing Unit) is a kind of custom ASIC chip, which can be dedicated to machine learning workload.

Graphics processor: graphics processors (GPUs, graphics Processing Unit), also known as display cores, vision processors, display chips, etc., are microprocessors that do image and graphics-related operations specifically on personal computers, workstations, gaming machines, and some mobile devices (e.g., tablet computers, smartphones, etc.).

LMDB: the LMDB (Lightning Memory-Mapped Database) has simple file structure, random copying and random transmission of data; the access of the LMDB is simple, and a separate database management process does not need to be run, so long as the LMDB library is referenced in the code of the access data, and the access is given to the file path.

epoch: one epoch indicates that all data is sent into the network, and the forward calculation and back propagation processes are completed once; one epoch may be split into multiple bachs, i.e., in model training, all data is split into multiple bachs, entering a portion of the data into the model at a time.

The scheme provided by the embodiment of the application relates to an artificial intelligence technology, and the technical scheme of the application is described in detail by a specific embodiment. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

In order to better understand the scheme provided by the embodiment of the present application, the scheme is described below in connection with a specific application scenario.

In one embodiment, fig. 1 is a schematic diagram of a data processing system to which the embodiment of the present application is applied, and it can be understood that the data processing method provided by the embodiment of the present application may be applied, but is not limited to, to the application scenario shown in fig. 1.

In this example, as shown in FIG. 1, the architecture of the data processing system in this example may include, but is not limited to, a server 10, a terminal 20, and a database 30. Interactions between server 10, terminal 20 and database 30 may occur via network 40.

The terminal 20 sends indication information for training the model to be trained to the server 10, and the server 10 acquires at least one binary file corresponding to a plurality of training samples based on the indication information; the server 10 determines an initial tensor corresponding to the at least one binary file; the server 10 determines the number of preprocessing threads aiming at the initial tensor through a preset automatic thread adjusting mode, and preprocesses the initial tensor through the preprocessing threads corresponding to the number of preprocessing threads to determine the preprocessed tensor; the server 10 inputs the preprocessed tensor to the model to be trained, and trains the model to be trained. The server 10 sends the training result after the training of the model to be trained to the terminal 20 through the network 40, and sends the training result after the training of the model to be trained to the database 30 for storage.

It will be appreciated that the above is only an example, and the present embodiment is not limited thereto.

The terminal includes, but is not limited to, a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a mobile phone simulator, a tablet computer, a notebook computer, a digital broadcast receiver, an MID (Mobile Internet Devices, mobile internet device), a PDA (personal digital assistant), an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, etc.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like.

Cloud computing (clouding) is a computing model that distributes computing tasks across a large pool of computers, enabling various application systems to acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed.

As a basic capability provider of cloud computing, a cloud computing resource pool (cloud platform for short, generally referred to as IaaS (Infrastructure as a Service, infrastructure as a service) platform) is established, in which multiple types of virtual resources are deployed for external clients to select for use.

According to the logic function division, a PaaS (Platform as a Service ) layer can be deployed on an IaaS (Infrastructure as a Service ) layer, and a SaaS (Software as a Service, software as a service) layer can be deployed above the PaaS layer, or the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, web container, etc. SaaS is a wide variety of business software such as web portals, sms mass senders, etc. Generally, saaS and PaaS are upper layers relative to IaaS.

The artificial intelligence cloud Service is also commonly called AIaaS (AI as a Service, chinese is "AI as Service"). The service mode of the artificial intelligent platform is the mainstream at present, and particularly, the AIaaS platform can split several common AI services and provide independent or packaged services at the cloud. This service mode is similar to an AI theme mall: all developers can access one or more artificial intelligence services provided by the use platform through an API interface, and partial deep developers can also use an AI framework and AI infrastructure provided by the platform to deploy and operate and maintain self-proprietary cloud artificial intelligence services.

The network may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: local area networks, metropolitan area networks, and wide area networks, the wireless network comprising: bluetooth, wi-Fi, and other networks implementing wireless communications. And in particular, the method can be determined based on actual application scene requirements, and is not limited herein.

Referring to fig. 2, fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application, where the method may be performed by any electronic device, for example, may be a server or the like; as an alternative implementation, the method may be performed by a server, and for convenience of description, in the following description of some alternative embodiments, a server will be described as an example of the method execution body. As shown in fig. 2, the data processing method provided by the embodiment of the application includes the following steps:

s201, at least one binary file corresponding to a plurality of training samples is acquired.

Specifically, a training sample, such as a picture file, may be around 100KB in size. The binary file may be a tfreeord format file, i.e., a tfreeord file. A large number of training samples may be converted into a small number of binary files; for example, tens of millions of jpeg pictures are converted into hundreds or tens of tfreeord files.

S202, determining an initial tensor corresponding to at least one binary file.

Specifically, the protocol memory block corresponding to at least one binary file is analyzed, and an initial tensor corresponding to the at least one binary file is determined, wherein the initial tensor is a matrix.

S203, determining the number of preprocessing threads aiming at the initial tensor through a preset automatic thread adjusting mode, preprocessing the initial tensor through the preprocessing threads corresponding to the number of the preprocessing threads, and determining the preprocessed tensor.

Specifically, the automatic thread adjustment mode is used for reasonably setting preprocessing threads, such as setting the number of the preprocessing threads. The preprocessing efficiency aiming at the initial tensor is maximized through an automatic thread adjusting mode, and the data waiting time in preprocessing is reduced, so that the preprocessed tensor is rapidly determined, the preprocessed tensor is input into a model to be trained, and the reading speed of a training sample in model training is improved; wherein the preprocessed tensor is a matrix.

S204, inputting the preprocessed tensor into a model to be trained, and training the model to be trained.

Specifically, a model to be trained such as a machine learning model or the like; loading the preprocessed tensor on an accelerator running the machine learning model, namely inputting the preprocessed tensor into the machine learning model; where an accelerator such as a graphics processor or tensor processing unit.

In the embodiment of the application, at least one binary file corresponding to a plurality of training samples is acquired; determining an initial tensor corresponding to at least one binary file; determining the number of preprocessing threads aiming at the initial tensor through a preset automatic thread adjusting mode, preprocessing the initial tensor through the preprocessing threads corresponding to the number of the preprocessing threads, and determining the preprocessed tensor; inputting the preprocessed tensor into a model to be trained, and training the model to be trained; in this way, a large number of training samples are converted into a small number of binary files, initial tensors corresponding to the small number of binary files are determined, preprocessing efficiency for the initial tensors is maximized through a preset automatic adjustment thread mode, the preprocessed tensors are determined, and the preprocessed tensors are input into a model to be trained, so that the reading speed of the training samples in model training is improved.

In one embodiment, obtaining at least one binary file corresponding to a plurality of training samples includes steps A1-A3:

and A1, decoding the plurality of training samples to obtain a first character string corresponding to each training sample in the plurality of training samples.

Specifically, the binary file may be a tfrecord-format file, i.e., a tfrecord file. Converting a plurality of training samples into a tfrecord file through a preset tfrecord script; wherein the tfreeord script is a format conversion script of a dataset comprising a plurality of training samples.

A training sample, such as a picture file, for example, decodes a plurality of pictures through a preset tfreeord script, that is, decodes and converts each picture in the plurality of pictures into a binary stream (string), i.e., a character string (first character string); the character string (first character string) includes a feature (feature) of the training sample and a label corresponding to the training sample.

And A2, creating a protocol memory block corresponding to each training sample, filling sample characteristics in a first character string corresponding to each training sample into the protocol memory block, and serializing the protocol memory block into a second character string.

Specifically, an example object corresponding to each training sample is created through a preset proto file, wherein the example object is a protocol memory block; filling sample (feature) features in a first character string corresponding to each training sample into an example object, namely a protocol memory block; the example object is serialized into a string (second string).

And A3, generating at least one binary file corresponding to the training samples based on the second character strings.

Specifically, the binary file may be a tfrecord-format file, i.e., a tfrecord file. Based on each second string, a tfrecord file is generated by an interface (interface such as tf. Python_io. Tfrecordwriter in TensorFlow).

Specifically, the binary file may be a tfrecord-format file, i.e., a tfrecord file. And reading data from the tfrecord file, namely determining the initial tensor corresponding to the tfrecord file. For example, a tfreeord file is stored in advance in a persistent storage, and the tfreeord file is read from the persistent storage, i.e., data is read from the tfreeord file; the permanent storage may be local storage such as HDD (Hard Disk Drive), SSD (Solid State Drives, solid state Disk), etc., or remote storage such as GCS (Google Cloud Storage, gu Geyun storage), HDFS (Hadoop Distributed File System, distributed file system), etc.

For example, a parse queue is generated by tf. Train_input_producer in TensorFlow; calling a tf.parameter_single_sample parser of a tf.tfrecordreader in the Tensorflow, reading the parse queue by the parser, and returning to a materialized_sample object in the Tensorflow; the parser calls tf.parameter_single_sample operation in the TensorFlow, and parses a protocol memory block corresponding to the tfrecord file into a Tensor (Tensor), i.e. an initial Tensor corresponding to the tfrecord file.

Specifically, training samples such as picture files, i.e., images; the custom preprocessing function may be a user-defined function that may be used for image decompression, image enhancement conversion (e.g., random cropping of an image, flipping of an image, color distortion of an image, etc.), image rearrangement, image batch processing, etc. The central processor core may be a CPU (Central Processing Unit, central processor) core of a server. The optimal thread number (the number of preprocessing threads aiming at the initial tensor) is determined by an automatic thread adjustment mode (an autotune mode), namely based on the number of self-defined preprocessing functions and the number of central processing unit cores, the preprocessing efficiency aiming at the initial tensor is maximized, the preprocessed tensor is determined, and the preprocessed tensor is input into a model to be trained, so that the reading speed of a training sample in model training is improved.

In one embodiment, the number of preprocessing threads for the initial tensor is determined based on the custom preprocessing function and the number of central processor cores, including steps B1-B3:

And B1, determining a value range of the number of preprocessing threads aiming at the initial tensor based on the self-defined preprocessing function and the number of the central processing unit cores, wherein the value range comprises a plurality of different numbers of preprocessing threads.

Specifically, based on the self-defined preprocessing function and the number N of the central processing unit cores, a value range of the number of preprocessing threads aiming at the initial tensor is determined, wherein the value range of the number of preprocessing threads is 1-N, and N is a positive integer. For example, if N is 5, the number of preprocessing threads is 1, 2, 3, 4, and 5, respectively.

And B2, determining the preprocessing time corresponding to each different preprocessing thread number in the plurality of different preprocessing threads.

Specifically, preheating training is executed, and preprocessing time corresponding to each different preprocessing thread number in a plurality of different preprocessing thread numbers is counted; if the value range of the number of the preprocessing threads is 1-N, counting the preprocessing time T1-TN respectively corresponding to the value 1-N of the number of the preprocessing threads, wherein N is a positive integer. For example, if N is 6, the number of preprocessing threads is 1, 2, 3, 4, 5, 6, respectively; the values 1, 2, 3, 4, 5 and 6 of the number of the preprocessing threads correspond to the preprocessing times T1, T2, T3, T4, T5 and T6 respectively.

And B3, determining the number of preprocessing threads corresponding to the minimum preprocessing time in the preprocessing times as the number of preprocessing threads aiming at the initial tensor.

Specifically, the number of preprocessing threads is 1-N, wherein N is a positive integer. For example, N is 6, and the values 1, 2, 3, 4, 5, and 6 of the number of preprocessing threads correspond to the preprocessing times T1, T2, T3, T4, T5, and T6, respectively, and the number of preprocessing threads corresponding to the smallest preprocessing time among the preprocessing times T1, T2, T3, T4, T5, and T6, that is, the optimal number of preprocessing threads, is determined as the number of preprocessing threads for the initial tensor. The thread number of the tf.data.Dataset.map function in TensorFlow is set as the optimal preprocessing thread number.

It should be noted that, the tf.data API in the TensorFlow provides a tf.data.dataset.map function, and the tf.data.dataset.map function calls a user-defined function, and the initial tensor is preprocessed by the user-defined function, so as to determine the tensor after preprocessing.

Specifically, preheating training is executed, and preprocessing time corresponding to each different preprocessing thread number in a plurality of different preprocessing thread numbers is counted; the preheating training comprises preprocessing a preset tensor through preprocessing threads corresponding to each different preprocessing thread in a plurality of different preprocessing threads. The value range of the number of the preprocessing threads is 1-N, wherein N is a positive integer. For example, N is 4, and the number of preprocessing threads is 1, 2, 3, and 4, respectively; and executing preheating training, and counting pretreatment time T1, T2, T3 and T4 respectively corresponding to the values 1, 2, 3 and 4 of the number of the pretreatment threads.

Specifically, if the pre-fetch mode (prefetch) is not adopted, the accelerator (graphics processor GPU or tensor processing unit TPU) is in an idle state when the CPU is preparing data (e.g., preprocessing for initial tensor); when the accelerator is training the model, the CPU is in an idle state; thus, the time taken for model training is the sum of the CPU's preprocessing time for the initial tensor and the accelerator training model time.

If a pre-fetch mode (pre) is used, then as shown in fig. 3, the function pre () overlaps the pre-processing for the initial tensor with the training for the model to be trained; while the accelerator is executing the nth training step, the CPU is preparing data of the n+1th step. For example, the accelerator (GPU or TPU) is in idle state (idle), the CPU is preparing data of step 1 (preparation 1); when the accelerator is executing the 1 st training step (Train 1), the CPU is preparing the data of the 2 nd step (preparation 2); while the accelerator is executing the 2 nd training step (Train 2), the CPU is preparing the data of the 3 rd step (preparation 3); while the accelerator is executing the 3 rd training step (Train 3), the CPU is preparing the data of the 4 th step (preparation 4). Therefore, the single-step time of training can be shortened to the maximum extent, and the time required for extracting data and converting the data can be shortened; if the pre-reading mode is not adopted, the CPU and the accelerator (GPU or TPU) are in idle states in most of the time, so that the idle time can be obviously reduced by adopting the pre-reading mode, and the reading speed of training samples in model training is improved.

As shown in fig. 4, when the accelerator is executing the nth training step (trann), the CPU is preparing data of the n+1th step (preparen+1); one Batch in prepaern+1 corresponds to 4 Map functions, namely one Batch corresponds to 4 preprocessing threads; one Batch corresponds to one iterative training, for example, 1000 training samples are divided into 10 groups, and 100 training samples in each group corresponds to one Batch, i.e., 100 training samples in each group corresponds to one iterative training; the 4 preprocessing threads are processed in parallel, namely, the initial tensor is preprocessed in a parallel processing mode of 4 preprocessing threads corresponding to the 4 preprocessing threads in number, and the preprocessed tensor is determined; wherein, the parallel processing mode is a parallel map mode.

It should be noted that, by automatically adjusting the thread mode (auto mode), that is, determining the optimal thread number (for example, the number of preprocessing threads 4 for the initial tensor in fig. 4) based on the number of the custom preprocessing functions and the number of the central processor cores, maximizing the preprocessing efficiency for the initial tensor, determining the preprocessed tensor, and inputting the preprocessed tensor to the model to be trained, the reading speed of the training sample in the model training is improved.

and storing the preprocessed tensor through a memory cache or a disk cache.

Specifically, the memory buffer may be capable of buffering a binary stream (string) and an initial tensor, which are decoded and converted from a picture, in the memory, where the memory buffer is limited by a machine memory space, so that the memory buffer is suitable for a situation that the data size is smaller. The disk cache can be used for caching binary stream (string), initial tensor and the like which are decoded and converted into pictures in a local disk by using an LMDB storage format, and is suitable for the situation that the data volume of the disk cache is large. Preferentially selecting the memory cache for storing the preprocessed tensor, and selecting the disk cache for storing the preprocessed tensor if the space of the memory is insufficient; therefore, the memory buffer and the disk buffer are constructed into the multi-level buffer, and the preprocessed tensor can be read from the memory or the disk in the subsequent model training, so that the reading speed of the training sample in the model training is improved.

For example, the tfreeord file is pre-stored in a persistent store on the remote server; the permanent storage can be local storage such as HDD, SSD, etc., and the permanent storage can also be remote storage such as GCS, HDFS, etc.; the local server reads the tfrecords file from the remote server to obtain an initial tensor A; the local server preprocesses the initial tensor A to obtain a preprocessed tensor B, and finally, the preprocessed tensor B is stored in a memory or a magnetic disk of the local server; the subsequent model training can read the preprocessed tensor B from the memory or the disk of the local server, so that the reading speed of the training sample in the model training is improved.

Specifically, as shown in fig. 3, the function prefatch () overlaps together the preprocessing for the initial tensor and the training for the model to be trained; the preprocessing of the initial tensor is the nth preprocessing performed by the Central Processing Unit (CPU), the training of the model to be trained is the (n+1) th iterative training performed by the Graphics Processor (GPU) or tensor processor (GPU), i.e. when the GPU or GPU is executing the (N) th training step, the CPU is preparing the data of the (n+1) th step. For example, the accelerator (GPU or TPU) is in idle state (idle), the CPU is preparing data of step 1 (preparation 1); when the accelerator is executing the 1 st training step (Train 1), the CPU is preparing the data of the 2 nd step (preparation 2); while the accelerator is executing the 2 nd training step (Train 2), the CPU is preparing the data of the 3 rd step (preparation 3); while the accelerator is executing the 3 rd training step (Train 3), the CPU is preparing the data of the 4 th step (preparation 4). Therefore, the single-step time of training can be shortened to the maximum extent, and the time required for extracting data and converting the data can be shortened; if the pre-reading mode is not adopted, the CPU and the accelerator (GPU or GPU) are in idle states in most of the time, so that the idle time can be obviously reduced by adopting the pre-reading mode, and the reading speed of training samples in model training is improved.

In one embodiment, the method provided by the embodiment of the present application corresponds to an interface set, where the interface set includes at least one interface, and the interfaces in the interface set have universality, and through the interface set, the method provided by the embodiment of the present application can be adapted to a mainstream machine learning platform.

The application of the embodiment of the application has at least the following beneficial effects:

converting a large number of training samples into a small number of binary files, determining initial tensors corresponding to the small number of binary files, maximizing preprocessing efficiency aiming at the initial tensors through a preset automatic thread adjusting mode, and determining preprocessed tensors; meanwhile, idle time is obviously reduced by adopting a pre-reading mode; inputting the preprocessed tensor into the model to be trained by adopting a multi-level cache mode; thus, the reading speed of the training sample in the model training is improved.

In order to better understand the method provided by the embodiment of the present application, the scheme of the embodiment of the present application is further described below with reference to examples of specific application scenarios.

In a specific application scenario embodiment, for example, a training sample reading scenario in model training, referring to fig. 5, a process flow of a data processing method is shown, and as shown in fig. 5, the process flow of the data processing method provided in the embodiment of the present application includes the following steps:

S501, the remote server performs training sample preparation.

Specifically, the remote server obtains training samples, such as pictures, through the public data set, the crawler data, and the like.

S502, the remote server generates tfreeord files corresponding to the training samples.

Specifically, the remote server decodes the plurality of training samples to obtain a first character string corresponding to each training sample in the plurality of training samples; the remote server creates a protocol memory block corresponding to each training sample, fills sample characteristics in a first character string corresponding to each training sample into the protocol memory block, and sequences the protocol memory block into a second character string; the remote server generates tfrecord files corresponding to the training samples based on the second character strings.

the tfreeord file is pre-stored in a permanent storage of a remote server; the permanent storage may be local storage such as HDD or SSD, or remote storage such as GCS or HDFS.

S503, the local server reads the tfrecord file from the remote server to obtain an initial tensor corresponding to the tfrecord file.

Specifically, the local server generates an analysis queue through tf.train_input_producer in TensorFlow; the local server calls a tf.parameter_single_sample parser of tf.tfrecordreader in the Tensorflow, reads the parse queue, and returns a servicejsample object in the Tensorflow; the parser calls tf.parameter_single_sample operation in the TensorFlow, and parses a protocol memory block corresponding to the tfrecord file into a Tensor (Tensor), i.e. an initial Tensor corresponding to the tfrecord file.

For example, as shown in fig. 6, at the beginning of the first epoch (epoch 1) training, a plurality of workers (such as CPU, GPU or TPU) in the distributed training collectively read the data (tfrecord file) in the permanent storage of the remote server, resulting in an initial tensor.

S504, the local server preprocesses the initial tensor to obtain the preprocessed tensor.

Specifically, the optimal thread number is determined by automatically adjusting the thread mode (autotune mode), that is, based on the number of custom preprocessing functions and central processing unit cores, so as to maximize the preprocessing efficiency for the initial tensor, and determine the tensor after preprocessing. In a pre-fetch mode (pre), as shown in fig. 3, the function pre () overlaps the pre-processing for the initial tensor with the training for the model to be trained; therefore, the single-step time of training can be shortened to the maximum extent, and the time required for extracting data and converting the data can be shortened; therefore, idle time can be obviously reduced by adopting a pre-reading mode, so that the reading speed of training samples in model training is improved.

For example, as shown in fig. 6, a plurality of workers pre-process the initial tensor to obtain a pre-processed tensor.

S505, the local server stores the preprocessed tensor through a memory cache or a disk cache.

Specifically, for example, as shown in fig. 6, each worker in the plurality of workers stores the preprocessed tensor obtained by the worker in a memory or a disk of the local server; subsequent model training may read the preprocessed tensor from the memory or disk of the local server.

S506, in model training, the local server reads the preprocessed tensor from the memory or the disk, and trains the model to obtain a model training result.

Specifically, the preprocessed tensor is preferentially selected to be stored in the memory cache, and if the space of the memory is insufficient, the preprocessed tensor is selected to be stored in the disk cache; therefore, the memory buffer and the disk buffer are constructed into the multi-level buffer, and the preprocessed tensor is read from the memory or the disk in the model training, so that the reading speed of the training sample in the model training is improved.

S507, the local server sends the model training result to the remote server, and the remote server stores the model training result.

converting a large number of training samples into a small number of tfrecord files, determining initial tensors corresponding to the small number of tfrecord files, maximizing preprocessing efficiency aiming at the initial tensors through a preset automatic thread adjusting mode, and determining preprocessed tensors; meanwhile, idle time is obviously reduced by adopting a pre-reading mode; inputting the preprocessed tensor into the model to be trained by adopting a multi-level cache mode; thus, the reading speed of the training sample in the model training is improved.

The embodiment of the present application further provides a data processing apparatus, and a schematic structural diagram of the data processing apparatus is shown in fig. 7, where the data processing apparatus 70 includes a first processing module 701, a second processing module 702, a third processing module 703, and a fourth processing module 704.

A first processing module 701, configured to obtain at least one binary file corresponding to a plurality of training samples;

a second processing module 702, configured to determine an initial tensor corresponding to at least one binary file;

a third processing module 703, configured to determine, by using a preset automatic thread adjustment manner, a number of preprocessing threads for the initial tensor, and perform preprocessing on the initial tensor by using a preprocessing thread corresponding to the number of preprocessing threads, to determine a preprocessed tensor;

the fourth processing module 704 is configured to input the preprocessed tensor to a model to be trained, and train the model to be trained.

In one embodiment, the first processing module 701 is specifically configured to:

In one embodiment, the second processing module 702 is specifically configured to:

In one embodiment, the third processing module 703 is specifically configured to:

In one embodiment, the third processing module 603 is specifically configured to:

In one embodiment, the third processing module 703 is further configured to:

and storing the preprocessed tensor through a memory cache or a disk cache.

In one embodiment, the third processing module 703 and the fourth processing module 704 are further configured to:

and performing preprocessing on the initial tensor and training on the model to be trained in an overlapping manner by a pre-reading mode.

In one embodiment, the third processing module 703 and the fourth processing module 704 are specifically configured to:

the pre-processing for the initial tensor and the training for the model to be trained are executed in parallel in a pre-reading mode, wherein the pre-processing for the initial tensor is the nth pre-processing performed by the third processing module 703 through the central processor, the training for the model to be trained is the (n+1) th iterative training performed by the fourth processing module 704 through the graphics processor or the tensor processor, and N is a positive integer.

The embodiment of the application also provides an electronic device, a schematic structural diagram of which is shown in fig. 8, and an electronic device 4000 shown in fig. 8 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.

Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.

The memory 4003 is used for storing a computer program for executing an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.

Among them, electronic devices include, but are not limited to: a server, etc.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program can realize the steps and corresponding contents of the embodiment of the method when being executed by a processor.

Based on the same principle as the method provided by the embodiments of the present application, the embodiments of the present application also provide a computer program product or a computer program, which comprises computer instructions stored in a computer-readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the method provided in any of the alternative embodiments of the application described above.

It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.

The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, and the implementation manner is also within the protection scope of the embodiments of the present application.

Claims

1. A method of data processing, comprising:

determining an initial tensor corresponding to the at least one binary file;

inputting the preprocessed tensor into a model to be trained, and training the model to be trained.

2. The method of claim 1, wherein the obtaining at least one binary file corresponding to the plurality of training samples comprises:

decoding a plurality of training samples to obtain a first character string corresponding to each training sample in the plurality of training samples;

and generating at least one binary file corresponding to the training samples based on each second character string.

3. The method of claim 1, wherein the determining the initial tensor corresponding to the at least one binary file comprises:

4. The method of claim 1, wherein determining the number of preprocessing threads for the initial tensor by a preset auto-tuning thread mode comprises:

5. The method of claim 4, wherein determining the number of preprocessing threads for the initial tensor based on the custom preprocessing function and the number of central processor cores comprises:

Determining a value range of the number of preprocessing threads for the initial tensor based on the number of custom preprocessing functions and the number of central processor cores, the value range including a plurality of different numbers of preprocessing threads;

determining a preprocessing time corresponding to each different preprocessing thread number in the plurality of different preprocessing thread numbers;

6. The method of claim 5, wherein determining a preprocessing time for each different number of preprocessing threads in the plurality of different numbers of preprocessing threads comprises:

and preprocessing the preset tensor through the preprocessing threads corresponding to each different preprocessing thread in the plurality of different preprocessing threads to obtain preprocessing time corresponding to each different preprocessing thread.

7. The method according to claim 1, wherein the preprocessing the initial tensor by the preprocessing thread corresponding to the number of preprocessing threads, determining the preprocessed tensor includes:

8. The method of claim 1, wherein after the preprocessing the initial tensor by the preprocessing thread corresponding to the number of preprocessing threads, determining a preprocessed tensor, further comprises:

and storing the preprocessed tensor through a memory cache or a disk cache.

9. The method as recited in claim 1, further comprising:

10. The method according to claim 9, wherein the overlapping execution of the preprocessing for the initial tensor and the training for the model to be trained by pre-reading means comprises:

and performing preprocessing on the initial tensor and training on the model to be trained in parallel in a pre-reading mode, wherein the preprocessing of the initial tensor is the Nth preprocessing performed by a central processor, the training of the model to be trained is the (n+1) th iterative training performed by a graphic processor or a tensor processor, and the N is a positive integer.

11. A data processing apparatus, comprising:

the second processing module is used for determining an initial tensor corresponding to the at least one binary file;

the third processing module is used for determining the number of preprocessing threads aiming at the initial tensor through a preset automatic thread adjusting mode, preprocessing the initial tensor through the preprocessing threads corresponding to the number of the preprocessing threads, and determining the preprocessed tensor;

and the fourth processing module is used for inputting the preprocessed tensor into a model to be trained and training the model to be trained.

12. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-10.

13. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-10.

14. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1-10.