CN110427222A

CN110427222A - Data load method, device, electronic equipment and storage medium

Info

Publication number: CN110427222A
Application number: CN201910551038.0A
Authority: CN
Inventors: 舒承椿
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2019-11-08

Abstract

The disclosure is directed to a kind of data load method, device, electronic equipment and storage mediums, it is related to field of computer technology, when solving to carry out data load using the higher OP operator of complexity, training data loads slower problem, method of disclosure includes: to determine at least one characteristic information from least one sample for training the raw data set of deep learning model to include according to feature extraction configuration information when preparing training data；At least one characteristic information is encoded according to the type of coding of at least one characteristic information；When loading training data, characteristic information is marked by parse to the characteristic information after coding the type of coding of determining characteristic information, and according to type of coding；The data flow for training deep learning model is generated according to the characteristic information after label.Since the disclosure completes the feature extraction of training data before training data load, OP operator is not depended on, it is easy to operate.

Description

Data load method, device, electronic equipment and storage medium

Technical field

This disclosure relates to field of computer technology more particularly to a kind of data load method, device, electronic equipment and storage Medium.

Background technique

Deep learning model instruction process is divided into four and is mutually related the stage: training dataset prepares, training dataset adds It carries, model training and model save.Wherein, training dataset prepares to include data cleansing, alignment, filtering and transformation；Training number Data flow is directly provided according to model training program is loaded as；Model training generallys use GPU (Graphics Processing Unit, graphics processor) and the hardware such as FPGA (Field Programmable Gate Array, field programmable gate array) Accelerate training；After last model training finally generates optimized parameter collection, and result is stored in disk or Distribute file system In, for being supplied to subsequent on-line prediction service.

One of the scheme of current training data loading technique, in tfrecord+queue (queue) mode as an example, it is The opposite feed_dict mode IO (Input/Output, input/output) that tensorflow system provides is handled up higher training Data loading method.As shown in Figure 1, using which, user is first raw data set (such as Imagenet picture classification number According to collection) it is converted to binary tfrecord format；Then loading procedure is trained to use FIFO Queue (First in First Out Queue, First Input First Output) etc. queue technologies multithreading read tfrecord data, by IO OP (operation, Operation) mode give model training program feed data.But in training data loading procedure, feature pumping is carried out to training data Take is realized using OP operator.Which needs complicated compiling procedure to obtain OP operator, so that the complexity of OP operator is higher, In the time for carrying out increasing computation complexity and training data load when feature extraction.

The OP operator used when in conclusion training data loading in the related technology needs complicated compiling procedure, complicated Degree is higher, increases the time of computation complexity and training data load.

Summary of the invention

The disclosure provides a kind of data load method, apparatus and system, at least to solve the load of the relevant technologies training data The OP operator of Shi Caiyong needs complicated compiling procedure, and complexity is higher, when increasing computation complexity and training data load Between the problem of.The technical solution of the disclosure is as follows:

According to the first aspect of the embodiments of the present disclosure, a kind of data load method is provided, comprising:

When preparing training data, according to feature extraction configuration information from for training the initial data of deep learning model At least one characteristic information is determined at least one sample that collection includes；

At least one described characteristic information is encoded according to the type of coding of characteristic information described at least one；

When loading training data, by carrying out the coding that parsing determines the characteristic information to the characteristic information after coding Type, and the characteristic information is marked according to the type of coding；

The data flow for training deep learning model is generated according to the characteristic information after label.

In one possible implementation, it is described according to feature extraction configuration information from for training deep learning model Raw data set include at least one sample in determine at least one characteristic information step include:

The data that the initial data is concentrated are divided into more parts of data, wherein every part of data include at least one sample；

Following treatment process is executed to every part of data parallel:

It is chosen at least from least one sample that a data include according to preconfigured feature extraction information One characteristic information, and using the characteristic information selected by same part data as a characteristic set.

In one possible implementation, the type of coding according to characteristic information described at least one is at least one A characteristic information carries out coding step

Following treatment process is executed to each characteristic set parallel:

Determine the type of coding of at least one characteristic information in the characteristic set；

At least one described characteristic information is encoded according to the type of coding of at least one characteristic information.

In one possible implementation, the coding of at least one characteristic information in the determination characteristic set Type step includes:

According to the corresponding relationship of characteristic information and type of coding determination in the preconfigured feature extraction information The corresponding type of coding of at least one characteristic information in characteristic set.

In one possible implementation, the spy is determined by carrying out parsing to the characteristic information after coding described Before the type of coding step of reference breath includes:, further includes:

Characteristic set after coding is stored into corresponding store path, the characteristic set after any two of them coding Corresponding store path is different；

In one possible implementation, described to determine the feature by carrying out parsing to the characteristic information after coding The type of coding step of information includes:

When loading training data, process is loaded for different training datas and distributes different store paths, parallel calling Multiple training data load processes determine the type of coding of the characteristic information, wherein every call the training data load Process determines that the type of coding of the characteristic information is carried out following process:

Process, which is loaded, using the training data of calling loads the store path that process is distributed from for the training data Under characteristic set in read the characteristic information after coding, and the characteristic information after the coding of reading is parsed, really The type of coding of characteristic information after the fixed coding.

In one possible implementation, the characteristic information according to after label is generated for training deep learning mould The data flow step of type includes:

The sequence between characteristic information after determining the label for belonging to different samples according to sample time, and according to institute Sequence is stated to the characteristic information combination producing data flow after the label, wherein the sample time is the time that sample generates； Or

The sequence between characteristic information after the random determining label for belonging to different samples, and it is right according to the sequence Characteristic information combination producing data flow after the label.

In one possible implementation, it generates in the characteristic information according to after label for training deep learning After the data flow step of model, further includes:

Before training deep learning model, the data flow is divided into multiple sub-data flows, and from multiple subnumbers According to choosing at least one described sub-data flow in stream；

By at least one described sub-data flow storage to being used to train in the equipment of deep learning model.

According to the second aspect of an embodiment of the present disclosure, a kind of data loading device is provided, comprising:

Feature extraction unit is configured as executing when preparing training data, according to feature extraction configuration information from being used for At least one characteristic information is determined at least one sample that the raw data set of training deep learning model includes；

Coding unit is configured as executing according to the type of coding of at least one characteristic information described at least one Characteristic information is encoded；

Resolution unit is configured as executing when loading training data, by parsing the characteristic information after coding It determines the type of coding of the characteristic information, and the characteristic information is marked according to the type of coding；

Generation unit is configured as generating the data for training deep learning model according to the characteristic information after label Stream.

In one possible implementation, the feature extraction unit is specifically used for:

Following treatment process is executed to every part of data parallel:

In one possible implementation, the coding unit is specifically used for:

Following treatment process is executed to each characteristic set parallel:

In one possible implementation, coding unit is specifically used for:

In one possible implementation, the data loading device further include:

Storage unit is configured as executing by the characteristic set storage after coding into corresponding store path, wherein appointing The corresponding store path of the characteristic set anticipated two after encoding is different.

In one possible implementation, the resolution unit is specifically used for:

When loading training data, process is loaded for different data and distributes different store paths, parallel calling is multiple Data load process determines the type of coding of the characteristic information, wherein described in every calling one data load process is determining The type of coding of characteristic information is carried out following process:

Process is loaded from the feature loaded under the store path that process is distributed for the data using the data of calling The characteristic information after coding is read in set, and the characteristic information after the coding of reading is parsed, and determines the volume The type of coding of characteristic information after code.

In one possible implementation, the generation unit is specifically used for:

In one possible implementation, the data loading device further include:

Dividing cell is configured as executing before training deep learning model, the data flow is divided into multiple subnumbers According to stream, and at least one described sub-data flow is chosen from multiple sub-data flows；

Will storage described in an at least sub-data flow to being used to train in the equipment of deep learning model.

According to the third aspect of an embodiment of the present disclosure, a kind of electronic equipment is provided, comprising: memory can be held for storing Row instruction；

Processor, for reading and executing the executable instruction stored in the memory, to realize that the disclosure such as is implemented Data load method described in any one of example first aspect.

According to a fourth aspect of embodiments of the present disclosure, a kind of non-volatile memory medium is provided, when in the storage medium Instruction by data loading device processor execute when so that data loading device is able to carry out embodiment of the present disclosure first party Data load method described in any one of face.

According to a fifth aspect of the embodiments of the present disclosure, a kind of computer program product is provided, when the computer program produces When product are run on an electronic device, so that the electronic equipment, which executes, realizes the above-mentioned first aspect of the embodiment of the present disclosure and first Any method that may relate to that aspect is related to.

The technical scheme provided by this disclosed embodiment at least bring it is following the utility model has the advantages that

Compared to existing the case where carrying out feature extraction using OP operator, since the embodiment of the present disclosure is carried out to training data When feature extraction, selection characteristic information is directly concentrated from initial data according to preconfigured feature extraction information, independent of OP operator reduces the computation complexity of feature extraction process, and the training data before training data load prepared Feature extraction is carried out in journey, and parsing directly is carried out to the characteristic information after coding in training data load and generates data flow, is mentioned The high speed of training data load.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure Example, and together with specification for explaining the principles of this disclosure, do not constitute the improper restriction to the disclosure.

Fig. 1 is a kind of training data load signal of tfrecord+queue mode shown according to an exemplary embodiment Figure；

Fig. 2 is the method schematic diagram that training data shown according to an exemplary embodiment prepares and loads；

Fig. 3 is the schematic diagram of a kind of training data preparation and loading device shown according to an exemplary embodiment；

Fig. 4 is the schematic diagram of another training data preparation and loading device shown according to an exemplary embodiment；

Fig. 5 is a kind of schematic diagram of distribution characteristics abstraction module shown according to an exemplary embodiment；

Fig. 6 is the schematic diagram of a kind of data parsing and composite module shown according to an exemplary embodiment；

Fig. 7 is the schematic diagram of a kind of data pre-fetching and cache module shown according to an exemplary embodiment；

Fig. 8 is a kind of flow chart of the complete method of training data load shown according to an exemplary embodiment；

Fig. 9 is the block diagram of the first data loading device shown according to an exemplary embodiment；

Figure 10 is the block diagram of second of data loading device shown according to an exemplary embodiment.

Specific embodiment

In order to make ordinary people in the field more fully understand the technical solution of the disclosure, below in conjunction with attached drawing, to this public affairs The technical solution opened in embodiment is clearly and completely described.

It should be noted that the specification and claims of the disclosure and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiment of the disclosure described herein can in addition to illustrating herein or Sequence other than those of description is implemented.Embodiment described in following exemplary embodiment does not represent and disclosure phase Consistent all embodiments.On the contrary, they are only and as detailed in the attached claim, the disclosure some aspects The example of consistent device and method.

The some words occurred in text are explained below:

1, term "and/or" in the embodiment of the present disclosure describes the incidence relation of affiliated partner, indicates that there may be three kinds of passes System, for example, A and/or B, can indicate: individualism A exists simultaneously A and B, these three situations of individualism B.Character "/" one As indicate forward-backward correlation object be a kind of "or" relationship.

2, in the embodiment of the present disclosure term " training data " in deep learning and machine learning field, the training of a model It is with training data for driving, then algorithm finds the best ginseng that can be fitted training data in given function set Number.Training data is the input of model, and the output of model is the optimal parameter set of fitting.

3, in the embodiment of the present disclosure term " OP (operation) operation operator " be google (Google) provide depth A concept for practising the deep learning training program that frame tensorflow writes, represents a special operation operator, it It is often used to read training data, various conversions is carried out to data.One OP operator usually has corresponding GPU (Graphics Processing Unit, graphics processor) and CPU (Central Processing Unit/Processor, central processing Device) version, respectively correspond the Different Optimization version in two kinds of arithmetic elements of GPU and CPU.

4, term " GPU " in the embodiment of the present disclosure is a kind of microprocessor of image operation, more due to possessing thousands of Core is also commonly used for deep learning model training.

5, term " feature extraction " refers to that most statistics is selected from data acquisition system or characteristic set anticipates in the embodiment of the present disclosure The characteristic (or characteristic information etc.) of justice, the data for wherein including in data acquisition system can have different characteristics.

It 6, would generally be certainly when term " standard input and output " refers to and executes shell (shell) order line in the embodiment of the present disclosure It is dynamic to open three normative documents, i.e. standard files input (stdin), the keyboard of usual counterpart terminal；Standard output file (stdout) and standard error output file (stderr), the screen of both of these documents all counterpart terminals.Process will be defeated from standard Enter and obtain input data in file, normal output data is output to standard output file, and error message is sent to standard mistake Accidentally in file.

7, term " tensorflow " is an open-source software library in the embodiment of the present disclosure, for carrying out high-performance Numerical value calculates.

8, term " ImageNet " is that a large size for the research of visual object identification software can in the embodiment of the present disclosure Depending on changing database.

9, in the embodiment of the present disclosure term " tfrecord " be one kind two for using of tensorflow deep learning frame into Storage format processed, for storing extracted feature.

10, term " feed_dict " is a data structure in the embodiment of the present disclosure, for storing tensorflow program Defined in placeholder (placeholder) variable and the corresponding corresponding relationship of its assignment tensor (tensor).

The application scenarios of embodiment of the present disclosure description are the technical solutions in order to more clearly illustrate the embodiment of the present disclosure, The restriction of the technical solution provided for the embodiment of the present disclosure is not constituted, those of ordinary skill in the art are it is found that with newly answering With the appearance of scene, the technical solution that the embodiment of the present disclosure provides is equally applicable for similar technical problem.Wherein, at this In disclosed description, unless otherwise indicated, the meaning of " multiple ".

Depth learning technology is widely used in the various fields of internet, including natural language processing, Webpage search, commodity push away It recommends and the fields such as Internet advertising.Compared with traditional machine learning method, it has higher precision and extensive effect.

Generally, training data is more, and the modelling effect of deep learning is better.The challenge of the training of deep learning model it First is that how quickly to utilize mass data one higher model of precision of training.

By taking business scenario is recommended in advertisement as an example, ad system generates the log of magnanimity daily, usually with 10TB The data volume of (Terabyte, terabyte) calculates.Deep learning training pattern needs read characteristic from using log in one week Accordingly and label data.If updating training pattern daily, it would be desirable to be able to processing 1GB (Gigabyte, gigabyte) per second, even The ability of high several times of training data, usually has been out the I O process energy of Gigabit Ethernet or generic training data loading procedure Power.

In the related technology, data characteristics is taken into numpy, tfrecord or csv in the training data preparation stage Training dataset, is then stored in local by frequently-used datas formats such as (Comma-Separated Values, comma separated values) On disk or HDFS (Hadoop Distributed File System, distributed storage file system).Data load dress It sets from data are locally or remotely read, is then input to training program in a manner of feed_dict or in the way of IO OP.

Wherein, numpy system is that a kind of numerical value of open source of Python (boa) calculates extension.This tool can be used to deposit Storage and processing large-scale matrix are efficiently more than nested list (nested list structure) structure of Python itself (structure can also be used to representing matrix (matrix)).

The mode that common data loading device is realized includes feed_dict, the tfrecord+queue of tensorflow, Tensorflow dataset (data source) interface, a pytorch data loader (weight of reading data in PyTorch Want interface) etc..

In view of training data before being trained to deep learning model in the related technology prepares and loading procedure In there are many problems, such as in training data preparation process data conversion tool be a single-unit operation program, than The feature extraction of more difficult parallelization processing mass data, feature extraction spend the time long, influence training pattern immediately more Newly, lack scalability, furthermore when carrying out data load, the customized OP of OP operator of use writes complexity, it is difficult to be loaded Speed-optimization: although the OP of tensorflow is able to carry out GPU optimization, but it depend critically upon to the hardware principle of GPU and Multithreading has very high learning curve, is not easy to optimize.

Thus the feature extraction part during training data is advanceed to the training data preparation stage by the embodiment of the present disclosure It carries out, as shown in Fig. 2, Fig. 2 is a kind of flow chart of training data preparation method shown according to an exemplary embodiment, including Following steps:

S21: when preparing training data, according to feature extraction configuration information from for training the original of deep learning model At least one characteristic information is determined at least one sample that data set includes；

S22: at least one characteristic information is encoded according to the type of coding of at least one characteristic information；

S23: when loading training data, by the coding for parse to the characteristic information after coding determining characteristic information Type, and characteristic information is marked according to type of coding；

S24: the data flow for training deep learning model is generated according to the characteristic information after label.

In the embodiments of the present disclosure, preconfigured feature extraction information can be a tag file, wrap in this document Containing the type of coding for needing to select the characteristic information field and characteristic information come, wherein type of coding and characteristic information it Between there is corresponding relationship, and the corresponding relationship between preconfigured feature extraction information and sample are as follows: a sample is corresponding One feature extraction information, all samples for being also possible to initial data concentration correspond to the same feature extraction information.

By initial data concentrate all samples correspond to the same feature extraction information for, wherein raw data set for To the system of different user's advertisements in the week age acquired on a certain short-sighted frequency APP (Application, application program) It counts, if a sample is regarded as a data (sample corresponds to many statistical data, such as a log), wherein The corresponding feature of these statistical data includes: the type of advertisement, the duration of advertisement, the theme of advertisement, the implied meaning of advertisement, advertisement Whether language, the image of advertisement, age of user, user's gender, user's login time, user click advertisement etc., feature extraction letter Breath then indicates which field needs, which field do not need in this data, is taken out by pre-set feature extraction information Take out the characteristic information needed, such as preset feature extraction information are as follows: whether age of user, the type of advertisement, user click Advertisement.

Assuming that the data for including in sample 1 have respectively: game (type of advertisement), 2 minutes (duration of advertisement), 20 years old (age of user), be (user click advertisement), male (gender of user), then according to above-mentioned feature extraction information select come Characteristic information are as follows: game, 20, be.

Assuming that the data for including in sample 2 have respectively: education (type of advertisement), 1 minute (duration of advertisement), 10 years old (age of user), no (user does not click on advertisement), female's (gender of user), then select according to above-mentioned feature extraction information Come characteristic information are as follows: education, 10, it is no.

It should be noted that the form of pre-set feature extraction information cited in the embodiment of the present disclosure is only lifted Example explanation, the feature extraction information of any form are suitable for the embodiment of the present disclosure.

In the embodiments of the present disclosure, the characteristic information selected in sample is determined according to pre-set feature extraction information Corresponding type of coding, and then the type of coding determined encodes sample information.

Wherein, type of coding includes but is not limited to some or all of following:

Dense type coding, sparse type coding, binary label type coding.

In the embodiments of the present disclosure, sparse type coding is corresponding with dense type coding, and sparse type coding can generally be applied In User ID (identification, mark), advertisement ID etc., dense type coding generally can be applied to age of user, examination Achievement etc., binary label type coding, which refers to, is whether user clicks extensively for label data binary representation, such as label data When announcement, then "Yes" can be encoded to 1, "No" is encoded to 0.

For example, in preconfigured feature extraction information type of coding and characteristic information corresponding relationship are as follows: age of user Corresponding dense type coding, advertisement type correspond to sparse type coding, and whether user clicks advertisement and correspond to binary label type coding.

Assuming that advertisement type is game, 0000000000000001 is represented by after carrying out sparse type coding；Advertisement type For education, it is represented by 0010000000000000 after carrying out sparse type coding, wherein using sparse type coded representation advertisement It is indicated when type with 16 bits, wherein each bit corresponds to a kind of advertisement type, the numerical value on a certain bit It is 1, then it represents that type belonging to a certain advertisement is the corresponding advertisement type of this bit.

Assuming that age of user is 10,0001010 is represented by after carrying out dense type coding；Age of user is 16, is carried out thick 0010000 is represented by after close type coding；Age of user is 32, is represented by 0100000 after carrying out dense type coding；User year Age is 64, is represented by 1000000 after carrying out dense type coding.

A kind of possible embodiment can also use existing coding mode, such as the tool that google is provided that will select The characteristic information taken out is converted to binary system.

It should be noted that characteristic information type of coding cited in the embodiment of the present disclosure is merely illustrative, arbitrarily A kind of characteristic information type of coding is suitable for the embodiment of the present disclosure.

A kind of possible embodiment, it is contemplated that parallelization handles the speed that feature extraction can be improved, thus is carrying out Feature extraction is can be by the way of distributed nature extraction, specifically, the data that initial data is concentrated are divided into more parts Data, wherein every part of data include at least one sample；Following treatment process is executed to every part of data parallel: according to being pre-configured with Feature extraction information at least one characteristic information is chosen from least one sample that a data include, and will be by same portion The characteristic information that data decimation goes out is as a characteristic set；Following treatment process is executed to each characteristic set parallel: being determined The type of coding of at least one characteristic information in characteristic set；According to the type of coding of at least one characteristic information at least one A characteristic information is encoded.Carry out feature extraction simultaneously by multiple calculate nodes, such as initial data is concentrated Data are divided into 100 parts, are realized by 10 calculate nodes, such as 1~node of node 10 is simultaneously respectively to the 1st~10 number According to the selection for carrying out characteristic information, wherein then one can be chosen from remaining untreated data after the completion of any one node selection Part data continue the selection of characteristic information.

Wherein, the characteristic information same part data selected is wrapped as a characteristic set for example, seeking in a data Containing 2 samples, the characteristic information chosen in sample 1 has 5, and the characteristic information that 2 China of sample chooses has 5, then by this 10 Characteristic information is as a characteristic set.

Likewise, the mode of distributed treatment can also be used when encoding to characteristic information, saved using multiple calculating It puts while being encoded.

Time-consuming feature extraction operation is executed realization using distribution, improves training data load by aforesaid way Speed；And it is easy extension: when needing to handle more initial data, can directly distributes more calculate nodes, obtain Obtain faster data prediction speed；Using more reading data nodes, training data loading velocity can be linearly improved.

In the embodiments of the present disclosure, the characteristic information after coding is stored after being encoded to characteristic information, In When being trained data load, in the characteristic information after reading coding and is parsed, generate and be used for deep learning model training Data flow.

A kind of possible embodiment can will be in the characteristic set after coding after encoding to characteristic information Characteristic information is compressed to obtain compressed characteristic set；Compressed characteristic set is subjected to distributed storage, i.e., it will pressure Characteristic set after contracting stores the corresponding storage road of the compressed characteristic set of any two of them into corresponding store path Diameter is different.

Aforesaid way optimizes data path using data compression technique, improves the speed of entire training data load.

In view of the IO throughput of training data load is not high, FIIFO Queue+OP mode is loaded using multithreading Data, there are speed bottle-necks cannot make full use of the IO of network bandwidth or disk in particular for data transformation operations are carried out Throughput, there are also rooms for promotion for whole training data loading velocity.

Thus, in the embodiments of the present disclosure, when being trained data load, the mode of multi-process reading can be used, specifically , when loading training data, process is loaded for different data and distributes different store paths, the multiple data of parallel calling add It is loaded into the type of coding that journey determines characteristic information, wherein every type of coding for calling a data load process to determine characteristic information It is carried out following process: loading process from the feature set loaded under the store path that process is distributed for data using the data called The characteristic information after coding is read in conjunction, and the characteristic information after the coding of reading is parsed, the feature after determining coding The type of coding of information.

Aforesaid way is read the training data for being stored in Distribute file system using multi-process, overcome between multithreading Performance issue caused by global lock is realized, and improve the speed of training data load.

A kind of possible embodiment, if being pressed when being trained data preparation the characteristic information after coding Contracting, then when being parsed to characteristic information, it is necessary first to be decompressed to compressed characteristic information；To the feature after decompression Information carries out the type of coding that parsing determines the characteristic information.

A kind of possible embodiment, in the embodiments of the present disclosure, by parsing the type of coding of determining characteristic information simultaneously After characteristic information is marked, when generating data flow, there are several generating modes, is set forth below two kinds:

Generating mode one determines sequence between the characteristic information after belonging to the labels of different samples according to sample time, And in sequence to the characteristic information combination producing data flow after label, wherein sample time is the time that sample generates.

In the embodiments of the present disclosure, for different application scenarios, the time that sample generates is different, for example, advertisement is recommended If business scenario, the time that sample generates can be the time by advertisement pushing to user, for the field of stock information acquisition Scape, the time that sample generates can be the data acquisition date of stock, generates for the scene general warranty sample of picture processing Time independent same distribution.

By taking business scenario is recommended in advertisement as an example, the sample time of sample 1 is earliest, is successively sample 2 and sample 3 later, then When generating data flow, the characteristic information after belonging to the label of sample 1 is then located at the forefront of data flow, belongs to the mark of sample 3 Characteristic information after note is located at the last of data flow.

The sequence between characteristic information after generating mode two, the random determining label for belonging to different samples, and according to suitable Characteristic information combination producing data flow after ordered pair label.

Specifically, guarantee independent same distribution between sample and sample, the feature letter after determining the label for belonging to different samples Sequence between breath, and in sequence to the independent identically distributed random data stream of characteristic information combination producing after label.

For example, being directed to 5 samples, the sequence determined at random is sample 1,3,5,2,4, then when generating data flow, Characteristic information after the label of sample 1 is located at the forefront of data flow, the feature letter after being followed successively by the label for belonging to sample 3 later Characteristic information after ceasing, belonging to the label of sample 5, after belonging to the characteristic information after the label of sample 2, belonging to the label of sample 4 Characteristic information.

It should be noted that the mode of generation data flow cited in the embodiment of the present invention is merely illustrative, it is any It is a kind of generate data flow mode be suitable for the embodiment of the present disclosure.

A kind of possible embodiment, can be to feature before training deep learning model after generating data flow Information is prefetched and is cached, and data flow is divided into multiple small lot data flows, and from multiple small lot data flows choose to A few small lot data flow；By the storage of at least one small lot data flow to being used to train in the equipment of deep learning model. In the stage nearest from model training, the time that model training program reads data is saved in prefetching and caching to training data.

For example, data, which are failed to be sold at auction, is expressed as (wherein characteristic information is all after marking): 1 characteristic information 1 of sample, sample 1 are special Reference ceases 2,3 characteristic information 1 of sample, 3 characteristic information 2 of sample, 5 characteristic information 1 of sample, 5 characteristic information 2 of sample, 2 feature of sample Information 1,2 characteristic information 2 of sample, 4 characteristic information 1 of sample, 4 characteristic information 2 of sample, then can choose 3 characteristic information of sample at any time 1,3 characteristic information 2 of sample, 5 characteristic information 1 of sample, the combination of 5 characteristic information 2 of sample form a small lot data flow, choose sample This 1 characteristic information 1, the combination of 1 characteristic information 2 of sample form another small lot data flow, and by the two small lot data flows It is cached to the hardware cache area of GPU in advance, the training data stream of high speed can be then provided in this way for hardware resources such as GPU, thus Make full use of hardware resource, acceleration model training.

Since quasi- training characteristics and label data to be packaged into the small lot of high compression ratio in the embodiment of the present disclosure, accelerate instruction Practice the subsequent storage of data, transmission and reads.

A kind of possible embodiment, a kind of data loading device that the embodiment of the present disclosure provides, as shown in figure 3, the dress It sets and the training data of model is prepared to be divided into long-range distribution calculating and storage section and model training platform with loading procedure Directly related reading parses gentle nonresident portion.The present apparatus is independent feature extraction, by the way of distribution and parallelization It executes, distributed nature extraction is carried out to raw data set, and extraction result is stored in distributed memory system or file In (such as HDFS).Model training platform reads extraction as a result, after parsing and combination by way of multi-process, prefetches and direct It is buffered in the training equipment such as GPU.

The device particularly may be divided into 4 modules, respectively distributed nature abstraction module, multi-process training data read Module, data parsing and composite module, data pre-fetching and cache module, are illustrated in figure 4 one kind of embodiment of the present disclosure offer Data loading device.

The function and principle of the component part of the device of embodiment of the present disclosure description are as follows:

1, distributed nature abstraction module.

In the training process of deep learning model, generally there are some feature extractions work, such as image is cut out, turns over Turn and add the pretreatment such as make an uproar.The for another example operation such as the feature discretization in advertisement prediction model and transformation.These operations are general relatively It is time-consuming, it is easy to influence the loading velocity of training data.

In the device of embodiment of the present disclosure description, these time-consuming feature extraction work are isolated, and are realized in spy Levy abstraction module.The module as a whole, can be with being distributed in Distributed Architecture, such as Hadoop Spark cluster Upper operation.Therefore it can use many back end (calculate node), cooperatively complete feature extraction work.

Wherein, Hadoop is a distributed system infrastructure developed by apache foundation.User can be In the case where not knowing about distributed bottom level details, distributed program is developed.Make full use of cluster power carry out high-speed computation and Storage.Spark is a kind of open source cluster computing environment similar with Hadoop, but there is also some differences between the two Place, these useful differences make Spark show in terms of certain workloads more superior, and Spark enables memory Distributed data collection, other than being capable of providing interactive inquiry, it can be with Optimized Iterative workload.

As shown in figure 5, the chief component of this module includes: (1.a) feature extractor；(1.b) distribution submit and Execution tool；(1.c) data binary system generates program；(1.d) data compression program；The distribution of (1.e) data stores tool.

A kind of possible embodiment is: it is defeated that feature extraction module (1.a), which is implemented as one with standard input (stdin), Enter data, standard output (stdout) is the program of result output, then using it is distributed submit and execute tool (1.b) with The mode of Hadoop Streaming operates in Hadoop cluster.Another implementation is that feature extractor is used The mode of yarn client (client) is directly submitted in yarn cluster.The result of feature extraction is binary data (1.c), And the result of generation is compressed using compression algorithm (1.d) (such as zstandard compression algorithm of Facebook), then use Tool (1.e) is stored in distributed file system (such as HDFS file system).

Wherein, Apache Hadoop YARN (Yet Another Resource Negotiator, another resource association Tune person) it is a kind of new Hadoop resource manager, it is a universal resource management system, and unification can be provided for upper layer application Resource management and scheduling, it be introduced as cluster utilization rate, resource unified management and in terms of bring it is huge Big benefit.

2, multi-process data read module.

Different using multithreading from general training data loading procedure, the device that the embodiment of the present disclosure proposes uses Multi-process technology loads data.In the technical solution of the present apparatus, the implementation of multi-process makes each reading entity with more only The speed-up ratio close to linear training data loading velocity can be obtained to increase process number by founding property and concurrency.On the contrary, The concurrency of multithreading often receives a kind of defect (such as the overall situation of python (computer programming language) of system realization Lock) influence, may cause speed-up ratio and successively decrease with the increase of Thread Count, or even more multithreading occur to cause overall to read speed The phenomenon that degree decline.

3, data parsing and composite module.

As shown in fig. 6, this module includes two parts: data parse (3.a) and data combination (3.b).

The data that multi-process is read partially are carried out parsing operation by data parsing (3.a), specifically include the decompression behaviour of data Make, and the binary data of decompression is parsed, reads therein sparse and dense characteristic data and label (mark Label) data.

Data combination (3.b) part is then combined the result data of parsing, forms a satisfactory data flow. The example of the requirement of data flow includes independent identically distributed random data stream, or meets the data flow of time sequencing, such as on State the two kinds of data flow generating modes mentioned in embodiment.

Since multiple data samples of the same small lot are concentrated the number finally saved at one in the embodiment of the present disclosure In, convenient for directly prefetching and caching the training data of multiple small lots below.

4, data pre-fetching and cache module.

As shown in fig. 7, this module includes two parts: prefetching (4.a) and caching (4.b).Wherein, data pre-fetching (4.a) Usually prepare the data flow of several small lots (mini batch) in advance from previous module, therefore when model training needs to input number According to when, data set can be immediately obtained, and obtain data without waiting for network or disk I/O, when saving training data load Between.

The low volume data (such as a batch) prefetched is usually put into the hardware cache area of GPU by data buffer storage (4.b), The data copy time for reducing CPU to GPU, to accelerate training data loading procedure.

The technical solution of the embodiment of the present disclosure is directed to existing deep learning model training data loading problem, in conjunction with distribution Formula calculating, data compression, multi-process is read and efficient queue technology, proposes a kind of dress of acceleration model training data load It sets, since time-consuming feature extraction and conversion operation are packaged into the program that distribution can be parallel in the embodiment of the present disclosure, is submitted to Distribution platform executes, and the training data for being stored in Distribute file system is read using multi-process, overcomes between multithreading Performance issue caused by global lock is realized.

Through the above scheme, time-consuming operation is executed realization using distribution by the embodiment of the present disclosure, and uses data compression Technology, optimizes data path, and the speed of entire training data load is very fast.In practice, training data loading velocity can be with Network interface card bandwidth is made full use of, under Gigabit Ethernet environment, 400,000 samples/secs can be reached, throughput improves about 10 times, improves Data IO throughput: and it is easy extension: when needing to handle more initial data, it can directly distribute more calculating Node obtains faster data prediction speed；Using more reading data nodes, training data load speed can be linearly improved Degree；It extracts and converts in addition, the embodiment of the present disclosure is easy the characteristic customized；It, can for different feature extraction demands Directly to modify feature extractor (1.a), and realizes and can not bind programming language and library function, such as C language, C++, JAVA etc..

Fig. 8 is a kind of complete method flow chart of training data load shown according to an exemplary embodiment, specific to wrap Include following steps:

S801, when being prepared to training data, by way of distributed treatment, according to preconfigured feature take out Breath of winning the confidence chooses multiple characteristic informations from multiple samples for training the raw data set of deep learning model to include；

S802, according to the corresponding relationship for the characteristic information and type of coding for including in preconfigured feature extraction information, Determine the type of coding of multiple characteristic informations；

S803, multiple characteristic informations of selection are encoded according to the type of coding of determining multiple characteristic informations；

S804, the characteristic information after multiple codings is compressed to obtain multiple compressed characteristic informations；

S805, compressed characteristic information is subjected to distributed storage；

S806, be trained data load when, using multi-process reading by the way of read the multiple of distributed storage Compressed characteristic information；

S807, the multiple compressed characteristic informations read to multi-process decompress；

S808, it multiple characteristic informations after decompression is carried out with parsing determines that the type of coding of each characteristic information is gone forward side by side rower Note；

S809, sequence between the characteristic information after belonging to the labels of different samples is determined according to sample time, and according to Sequence is to the characteristic information combination producing data flow after label；

S810, before training deep learning model, data flow is divided into multiple small lot data flows, and from multiple small quantities of At least one small lot data flow is chosen in amount data flow；

S811, by the storage of at least one small lot data flow to being used to train in the equipment of deep learning model.

The embodiment of the present disclosure loads slower problem for the training data during deep learning model training, it is also proposed that A kind of device improving training data loading velocity.It supports sparse and dense characteristic data simultaneously；Distribution can be supported to hold Row, is readily expanded to execute on more machines, and handles bigger data set；It can be adjusted according to the amount of capacity of data volume The process number of training data load, obtains higher IO throughput；The feature extraction logic of data is realized independent of OP, is easy It realizes, and realizes complicated feature extraction.

Fig. 9 is a kind of data loading device block diagram shown according to an exemplary embodiment.Referring to Fig. 9, which includes Feature extraction unit 900, coding unit 901, resolution unit 902 and generation unit 903.

Feature extraction unit 900, be configured as execute when preparing training data, according to feature extraction configuration information from At least one characteristic information is determined at least one sample that the raw data set of training deep learning model includes；

Coding unit 901 is configured as executing the type of coding according to characteristic information described at least one at least one The characteristic information is encoded；

Resolution unit 902 is configured as executing when loading training data, by solving the characteristic information after coding Analysis determines the type of coding of the characteristic information, and the characteristic information is marked according to the type of coding；

Generation unit 903 is configured as generating the number for training deep learning model according to the characteristic information after label According to stream.

In one possible implementation, the feature extraction unit 900 is specifically used for:

Following treatment process is executed to every part of data parallel:

In one possible implementation, the coding unit 901 is specifically used for:

Following treatment process is executed to each characteristic set parallel:

In one possible implementation, coding unit 901 is specifically used for:

In one possible implementation, the data loading device further include:

Storage unit 904 is configured as executing by the characteristic set storage after coding into corresponding store path, wherein The corresponding store path of characteristic set after any two coding is different.

In one possible implementation, the resolution unit 902 is specifically used for:

In one possible implementation, the generation unit 903 is specifically used for:

In one possible implementation, the data loading device further include:

Dividing cell 905 is configured as executing before training deep learning model, the data flow is divided into multiple sons Data flow, and at least one described sub-data flow is chosen from multiple sub-data flows；

About the device in above-described embodiment, wherein each unit executes the concrete mode of request in related this method Embodiment in be described in detail, no detailed explanation will be given here.

Figure 10 is a kind of block diagram of device 1000 for training data load shown according to an exemplary embodiment, should Device includes:

Processor 1010；

For storing the memory 1020 of 1010 executable instruction of processor；

Wherein, the processor 1010 is configured as executing described instruction, is added with the data realized in the embodiment of the present disclosure Support method.

In the exemplary embodiment, a kind of non-volatile memory medium including instruction is additionally provided, for example including instruction Memory 1020, above-metioned instruction can by the processor 1010 of device 1000 execute to complete the above method.Optionally, storage is situated between Matter can be non-transitorycomputer readable storage medium, for example, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..

The embodiment of the present disclosure also provides a kind of computer program product, when the computer program product on an electronic device When operation, so that the electronic equipment, which executes, realizes the above-mentioned any one data load method of the embodiment of the present disclosure or any one Any method that may relate to of data load method.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following Claim is pointed out.

It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the accompanying claims.

Claims

1. a kind of data load method characterized by comprising

When preparing training data, according to feature extraction configuration information from for training the raw data set packet of deep learning model At least one characteristic information is determined at least one sample contained；

When loading training data, by carrying out the coding class that parsing determines the characteristic information to the characteristic information after coding Type, and the characteristic information is marked according to the type of coding；

2. data load method according to claim 1, which is characterized in that it is described according to feature extraction configuration information from At least one characteristic information step packet is determined at least one sample that the raw data set of training deep learning model includes It includes:

Following treatment process is executed to every part of data parallel:

At least one is chosen from least one sample that a data include according to preconfigured feature extraction information Characteristic information, and using the characteristic information selected by same part data as a characteristic set.

3. data load method according to claim 2, which is characterized in that described according to characteristic information described at least one Type of coding coding step carried out at least one described characteristic information include:

Following treatment process is executed to each characteristic set parallel:

4. data load method according to claim 3, which is characterized in that in the determination characteristic set at least The type of coding step of one characteristic information includes:

The feature is determined according to the corresponding relationship of characteristic information and type of coding in the preconfigured feature extraction information The corresponding type of coding of at least one characteristic information in set.

5. data load method according to claim 3, which is characterized in that described by the characteristic information after coding Before the type of coding step for carrying out the determining characteristic information of parsing, further includes:

Characteristic set after coding is stored into corresponding store path, the characteristic set after any two of them coding is corresponding Store path it is different.

6. data load method according to claim 5, which is characterized in that it is described by the characteristic information after coding into Row parsing determines that the type of coding step of the characteristic information includes:

When loading training data, process is loaded for different training datas and distributes different store paths, parallel calling is multiple Training data load process determines the type of coding of the characteristic information, wherein every call the training data to load process Determine that the type of coding of the characteristic information is carried out following process:

Process is loaded using the training data of calling to load under the store path that process is distributed from for the training data The characteristic information after coding is read in characteristic set, and the characteristic information after the coding of reading is parsed, and determines institute The type of coding of characteristic information after stating coding.

7. any data load method according to claim 1~6, which is characterized in that the feature according to after label Information is generated for training the data flow step of deep learning model to include:

The sequence between characteristic information after determining the label for belonging to different samples according to sample time, and according to described suitable Characteristic information combination producing data flow after being marked described in ordered pair, wherein the sample time is the time that sample generates；Or

The sequence between characteristic information after the random determining label for belonging to different samples, and according to the sequence to described Characteristic information combination producing data flow after label.

8. a kind of data loading device characterized by comprising

Feature extraction unit is configured as executing when preparing training data, according to feature extraction configuration information from for training At least one characteristic information is determined at least one sample that the raw data set of deep learning model includes；

Coding unit is configured as executing the type of coding according to characteristic information described at least one at least one feature Information is encoded；

Resolution unit is configured as executing when loading training data, by carrying out parsing determination to the characteristic information after coding The type of coding of the characteristic information, and the characteristic information is marked according to the type of coding；

Generation unit is configured as generating the data flow for training deep learning model according to the characteristic information after label.

9. a kind of electronic equipment characterized by comprising

Processor；

For storing the memory of the processor-executable instruction；

Wherein, the processor is configured to execute described instruction, with realize according to claim 1 into claim 7 it is any Data load method described in.

10. a kind of storage medium, which is characterized in that when the instruction in the storage medium is by training data load electronic equipment When processor executes, so that training data load electronic equipment is able to carry out according to claim 1 any one of to claim 7 The data load method.