CN111258965A

CN111258965A - Data acquisition method and device, electronic equipment and storage medium

Info

Publication number: CN111258965A
Application number: CN202010030600.8A
Authority: CN
Inventors: 余虹建; 李锦丰
Original assignee: Shell Internet Beijing Security Technology Co Ltd
Current assignee: Beijing Juyuncube Technology Co ltd
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2020-06-09
Anticipated expiration: 2040-01-10
Also published as: CN111258965B

Abstract

The embodiment of the invention discloses a data acquisition method and device, electronic equipment and a storage medium, relates to the technical field of computers, and can effectively improve the acquisition speed of training data in model training. The data acquisition method comprises the following steps: determining the size relation between the data storage space required by the training data set and the residual space of the memory; under the condition that the data storage space is larger than the residual memory space, selecting at least one file in the training data set as a target file according to a preset strategy; and after the target file is read for the first time, keeping the target file in a page cache of a kernel so as to obtain the target file from the page cache of the kernel when the target file is read again in the future. The method is suitable for model training of machine learning.

Description

Data acquisition method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data acquisition method and apparatus, an electronic device, and a storage medium.

Background

In recent years, artificial intelligence technology has become more and more widely used in industry and life. Machine learning is an important branch in the field of artificial intelligence, and can obtain a relatively ideal mathematical model through a large amount of training data, so that human thinking is simulated.

However, since the amount of data required for model training is huge, often in the order of tens of millions of files, the reading speed of the training data becomes an important factor affecting the efficiency of model training.

For the problem that the reading speed of training data is slow in model training, an effective solution is not available in the related field.

Disclosure of Invention

In view of this, embodiments of the present invention provide a data acquisition method, an apparatus, an electronic device, and a storage medium, which can effectively improve the acquisition speed of training data in model training.

In a first aspect, an embodiment of the present invention provides a data acquisition method, including:

determining the size relation between the data storage space required by the training data set and the residual space of the memory;

under the condition that the data storage space is larger than the residual memory space, selecting at least one file in the training data set as a target file according to a preset strategy;

and after the target file is read for the first time, keeping the target file in a page cache of a kernel so as to obtain the target file from the page cache of the kernel when the target file is read again in the future.

Optionally, the preset policy includes:

taking the file with the file size smaller than a first threshold value in the training data set as the target file;

alternatively, the first and second electrodes may be,

and determining the target files according to the file sizes of the files in the training data set and the residual memory space, so that the number of the target files is larger than a second threshold value, and/or the residual memory space is smaller than a third threshold value after the target files are kept in a page cache of a kernel.

Optionally, after the target file is read for the first time, the step of retaining the target file in the page cache of the kernel includes:

reading a first file from the training dataset;

determining whether the first file is the target file read for the first time;

and under the condition that the first file is the target file read for the first time, adding a preset mark to the first file, so that the virtual file system VFS keeps the first file in a page cache of a kernel according to the preset mark.

Optionally, after the target file is retained in the page cache of the kernel, the method further includes:

receiving an instruction to read a second file from the training dataset;

searching the second file in a page cache of a kernel;

under the condition that the second file is found, obtaining the second file from the page cache of the inner core so as to perform model training by using the second file;

and under the condition that the second file is not found, obtaining the second file from a remote server so as to perform model training by using the second file.

Optionally, the method further includes:

caching the training data set in a local hard disk;

and under the condition that the second file is not found, obtaining the second file from a local hard disk so as to perform model training by using the second file.

Optionally, before determining the size relationship between the data storage space required by the training data set and the remaining memory space, the method further includes: and clearing the memory.

In a second aspect, an embodiment of the present invention further provides a data acquisition apparatus, including:

the determining unit is used for determining the size relationship between the data storage space required by the training data set and the residual memory space;

the selection unit is used for selecting at least one file in the training data set as a target file according to a preset strategy under the condition that the data storage space is larger than the residual memory space;

and the reservation unit is used for reserving the target file in a page cache of a kernel after the target file is read for the first time so as to obtain the target file from the page cache of the kernel when the target file is read again in the future.

Optionally, the preset policy includes:

alternatively, the first and second electrodes may be,

Optionally, the reservation unit includes:

a reading module for reading a first file from the training dataset;

the determining module is used for determining whether the first file is the target file read for the first time;

and the adding module is used for adding a preset mark to the first file under the condition that the first file is the target file read for the first time, so that the virtual file system VFS can keep the first file in a page cache of a kernel according to the preset mark.

Optionally, the apparatus further comprises:

a receiving unit, configured to receive an instruction to read a second file from the training data set after the target file is retained in a page cache of a kernel;

the searching unit is used for searching the second file in a page cache of a kernel;

the obtaining unit is used for obtaining the second file from the page cache of the kernel under the condition that the second file is found so as to perform model training by utilizing the second file; and under the condition that the second file is not found, obtaining the second file from a remote server so as to perform model training by using the second file.

Optionally, the apparatus further comprises:

the hard disk cache unit is used for caching the training data set in a local hard disk;

the obtaining unit is further configured to obtain the second file from a local hard disk under the condition that the second file is not found, so as to perform model training by using the second file.

Optionally, the apparatus further includes a clearing unit, configured to clear the memory before determining a size relationship between a data storage space required by the training data set and a remaining memory space.

In a third aspect, an embodiment of the present invention further provides an electronic device, including: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, and is used for executing any data acquisition method provided by the embodiment of the invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement any one of the data acquisition methods provided by the embodiments of the present invention.

The data acquisition method, the data acquisition device, the electronic device and the storage medium provided by the embodiment of the invention can determine the size relationship between the data storage space required by a training data set and the residual memory space, select at least one file in the training data set as a target file according to a preset strategy under the condition that the data storage space is larger than the residual memory space, and keep the target file in the page cache of the kernel after the target file is read for the first time so as to acquire the target file from the page cache of the kernel when the target file is read again in the future. Therefore, when the remaining memory space cannot accommodate all data of the lower training data set, the data in the cache cannot be determined to be retained in the cache according to the default reading frequency of the file, so that each time the file is read, the file cannot be hit due to the fact that the reading frequency before the file is too low. And active intervention is performed on a default caching rule, and at least one target file is selected in the training data set for caching according to a preset strategy, so that the cache hit rate is effectively improved, and the acquisition speed of training data in model training is also effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a data acquisition method provided by an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a data acquisition apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a reservation unit in the data acquisition apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a data acquisition apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a data acquisition apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a data acquisition apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In machine learning, on the one hand, a computer with powerful computing power is required for model training, and on the other hand, sufficient data samples are also required for computer learning. Model training is a process of training through a large amount of data and continuously iterating and optimizing. In the process, the server can read data from the data set in batches according to a certain rule to perform model training, after all the data in the data set are read for one round, model parameters can be adjusted, and then the data set is subjected to second round of reading and model training. The repetition can reach dozens, hundreds or even higher. Due to the huge amount of data required by model training, one data set can often comprise tens of millions of files, so that the time interval for repeatedly reading once is long for each file, and the reading frequency of each file is at a low level. Therefore, in a computer, the advantage of caching cannot be effectively exerted by a mechanism for caching files according to the file reading frequency, and in multiple rounds of data reading and model training, the external storage device is frequently accessed to acquire training data, so that the data acquisition speed is low.

Therefore, embodiments of the present invention provide a data acquisition method, an apparatus, an electronic device, and a storage medium, which can effectively improve the acquisition speed of training data in model training.

In a first aspect, embodiments of the present invention provide a data acquisition method, which can effectively improve the acquisition speed of training data in model training.

As shown in fig. 1, a data acquisition method provided by an embodiment of the present invention may include:

s11, determining the size relation between the data storage space required by the training data set and the residual space of the memory;

in particular, the model training server may read the training data in a manner of reading a file. All training data required for a model training task may form a data set. The amount of data required for model training is huge, alternatively, a data set may often comprise tens of millions of files, and the storage space occupied by a data set is also quite huge.

Optionally, before reading the training data, the model training server may obtain the data size (size) of the data set trained this time by reading header information or by using other interaction information, that is, the size of the data storage space that needs to be occupied by the data set, for example, 132G, 60G, and the like. Having knowledge of the data storage space required for the training data set, the data storage space can be compared to the current memory headroom of the computer. For example, if the remaining space of the current memory is 20G and the data storage space required by the training data set is 35G, the data storage space required by the training data set is larger than the remaining space that can be provided by the current memory, i.e., the training data set cannot be completely stored in the memory.

S12, under the condition that the data storage space is larger than the residual memory space, selecting at least one file in the training data set as a target file according to a preset strategy;

in this step, when the data storage space is larger than the remaining memory space, the default file caching strategy according to the reading frequency is actively intervened, and one file or a plurality of files in the training data set can be selected as the target file according to the preset strategy.

Optionally, the preset policy may be set according to specific needs, as long as a part of files in the training data set can be retained in the cache, and the cache hit rate is not too low due to replacement of the part of files due to too low reading frequency, which is not limited in the embodiment of the present invention. For example, the preset policy may be to select a file in a preset path in the training data set as a target file, or select a file with a preset file name as a target file, or select a file with a file size in a preset interval as a target file, or the like.

S13, after the target file is read for the first time, the target file is kept in the page cache of the kernel, so that the target file can be obtained from the page cache of the kernel when the target file is read again in the future.

In this step, when the data storage space is larger than the remaining memory space, the default file caching strategy is actively intervened according to the reading frequency, and the target file selected from the training data set is retained in a page cache (page cache) of the kernel.

The data acquisition method provided by the embodiment of the invention can determine the size relationship between the data storage space required by the training data set and the residual memory space, select at least one file in the training data set as a target file according to a preset strategy under the condition that the data storage space is larger than the residual memory space, and keep the target file in the page cache of the kernel after the target file is read for the first time, so that the target file can be acquired from the page cache of the kernel when the target file is read again in the future. Therefore, when the remaining memory space cannot accommodate all data of the lower training data set, the data in the cache cannot be determined to be retained in the cache according to the default reading frequency of the file, so that each time the file is read, the file cannot be hit due to the fact that the reading frequency before the file is too low. And active intervention is performed on a default caching rule, and at least one target file is selected in the training data set for caching according to a preset strategy, so that the cache hit rate is effectively improved, and the acquisition speed of training data in model training is also effectively improved.

Optionally, in step S11, the relationship between the size of the data storage space required by the training data set and the size of the remaining memory space may include two cases: the first condition is that the data storage space is less than or equal to the remaining space of the memory, the training data set can be completely loaded into the memory at the moment, only the training data set needs to be read from the memory each time the training data is read, and the cache scheduling operation cannot be triggered at the moment; in the second case, the data storage space is larger than the remaining memory space, and only part of the data of the training data set can be cached in the memory, so that the cache scheduling operation is triggered.

Optionally, in order to cache more files, in an embodiment of the present invention, before determining the size relationship between the data storage space required by the training data set and the remaining memory space in step S11, the data obtaining method provided in the embodiment of the present invention may further include: and clearing the memory.

When the data storage space is larger than the remaining memory space, in order to avoid the situation that the file is cached according to the frequency of reading the file, in the embodiment of the present invention, active intervention may be performed on the cache scheduling. Specifically, at least one file may be selected as a target file in the training data set according to a preset policy in step S12, so as to cache the target file in step S13.

In an embodiment of the present invention, when performing model training, N rounds of reading operations are required for data in the data set, that is, all data in the data set is read at least N times. In order to minimize the communication time taken to read data, in an embodiment of the present invention, the preset policy may include that a file in the training data set with a file size smaller than a first threshold is used as the target file. Therefore, files with smaller file sizes are selected as target files for caching, on one hand, too much caching space is not occupied, on the other hand, files with a large number can be cached, so that the probability of reading the files from an external memory is effectively reduced, and the data acquisition speed is effectively improved. The size of the first threshold can be flexibly set and adjusted according to needs, and can be 10M, 50M and the like, for example.

Optionally, in another embodiment of the present invention, the preset policy may also include: and determining the target files according to the file sizes of the files in the training data set and the residual memory space, so that the number of the target files is larger than a second threshold value, and/or the residual memory space is smaller than a third threshold value after the target files are kept in a page cache of a kernel. That is, in this embodiment, the size of each file in the training data set and the size of the remaining space of the memory may be obtained in advance, and as many files as possible are selected as the target file for caching, and/or the remaining space of the memory is as small as possible after the target file is stored in the data variable.

For example, in an embodiment of the present invention, if the remaining memory space is 9.5G, there are 100 files with a file size of 50M in the training dataset, 20 files with a file size of 200M, and 30 files with a file size of 1G, 100 files (5G) with 50M and 20 files (4G) with 200M may be cached, so that 120 files are cached in total, and after the files are cached, the remaining memory space is 9.5G-5G-4G — 0.5G. Optionally, in another embodiment of the present invention, 90 files (4.5G) of 50M, 20 files (4G) of 200M, and a file (1G) of 1G may also be selected to be cached, so that 111 files are cached in total, and the remaining memory space is 9.5G-4.5G-4G-1G ═ 0G.

In the embodiment of the invention, more files are cached by using the limited remaining memory space as far as possible, for each data reading operation, if the data is not cached, extra communication time is needed to read the data from the external storage device, and if the data is cached, the communication time can be saved, and the data can be directly read from the cache, so that the data acquisition speed is improved.

In order to minimize unnecessary data communication time, in an embodiment of the present invention, the preset policy for selecting the target file may be: the smaller the data amount per object file itself, the better, and the larger the number of object files, the better. For example, the cache margin is 200G, and 50 files with the minimum data size can be selected as target files in the training data set, and the 200G cache margin is just full. The selected target file may form a list of file names for the query.

After the target file is selected, the target file may be cached in step S13. Optionally, in step S13, after the target file is read for the first time, the step of keeping the target file in the page cache of the kernel may specifically include:

reading a first file from the training dataset;

determining whether the first file is the target file read for the first time;

Specifically, in the Linux system, the file caching mechanism is directly managed by the virtual file system VFS of the Linux system. In order to mask a default policy for caching files according to a file reading frequency, in an embodiment of the present invention, in a case that the first file is the target file read for the first time, a preset flag may be added to the first file, so that the virtual file system VFS may keep the first file in a page cache of a kernel according to the preset flag.

Further, after the target file is retained in the page cache of the kernel, the data obtaining method provided in the embodiment of the present invention may further include:

receiving an instruction to read a second file from the training dataset;

searching the second file in a page cache of a kernel;

For example, in one embodiment of the present invention, assume that a data set has fifteen million pictures, and each picture requires 100 rounds of data input and model training, i.e., the data set has 100 rounds of repeated input. In each round of training, the picture data can be read in batches. Alternatively, the data amount read in each batch can be set as required. When the fifteen million pictures are read in the first round, each batch of data is read in for the first time, and the problem of hit of a page cache (page cache) of a kernel does not exist, but the data reading in the round needs to be prepared for cache of the following 99 rounds of training. That is, when reading in this round, a batch of pictures with the minimum data size may be retained in the page cache of the kernel according to a preset policy. Starting from the second round of reading the fifteen million graphs, the kernel is searched in the page cache, and if the kernel does not have the page cache, the kernel is read in the remote server.

Optionally, in an embodiment of the present invention, in order to further save communication time for data reading, the data obtaining method provided in the embodiment of the present invention may further include:

caching the training data set in a local hard disk;

In a second aspect, an embodiment of the present invention further provides a data acquisition apparatus, which can effectively improve the acquisition speed of training data in model training.

As shown in fig. 2, a data acquisition apparatus provided by an embodiment of the present invention may include:

a determining unit 21, configured to determine a size relationship between a data storage space required by the training data set and a remaining memory space;

a selecting unit 22, configured to select at least one file in the training data set as a target file according to a preset policy when the data storage space is greater than the remaining memory space;

a retaining unit 23, configured to retain the target file in a page cache of a kernel after the target file is read for the first time, so as to obtain the target file from the page cache of the kernel when the target file is read again in the future.

The data acquisition device provided by the embodiment of the invention can determine the size relationship between the data storage space required by the training data set and the residual memory space, select at least one file in the training data set as a target file according to a preset strategy under the condition that the data storage space is larger than the residual memory space, and after the target file is read for the first time, the target file is kept in the page cache of the kernel so as to acquire the target file from the page cache of the kernel when the target file is read again in the future. Therefore, when the remaining memory space cannot accommodate all data of the lower training data set, the data in the cache cannot be determined to be retained in the cache according to the default reading frequency of the file, so that each time the file is read, the file cannot be hit due to the fact that the reading frequency before the file is too low. And active intervention is performed on a default caching rule, and at least one target file is selected in the training data set for caching according to a preset strategy, so that the cache hit rate is effectively improved, and the acquisition speed of training data in model training is also effectively improved.

Optionally, the preset policy may include:

alternatively, the first and second electrodes may be,

Alternatively, as shown in fig. 3, the retaining unit 23 may include:

a reading module 231, configured to read a first file from the training dataset;

a determining module 232, configured to determine whether the first file is the target file read for the first time;

an adding module 233, configured to add a preset mark to the first file when the first file is the target file read for the first time, so that the virtual file system VFS retains the first file in a page cache of a kernel according to the preset mark.

Optionally, as shown in fig. 4, the data acquiring apparatus provided in the embodiment of the present invention may further include:

a receiving unit 24, configured to receive an instruction to read a second file from the training data set after the target file is retained in a page cache of a kernel;

a searching unit 25, configured to search for the second file in a page cache of a kernel;

an obtaining unit 26, configured to obtain the second file from the page cache of the kernel when the second file is found, so as to perform model training by using the second file; and under the condition that the second file is not found, obtaining the second file from a remote server so as to perform model training by using the second file.

Optionally, as shown in fig. 5, the data acquiring apparatus provided in the embodiment of the present invention may further include:

a hard disk cache unit 27, configured to cache the training data set in a local hard disk;

the obtaining unit 26 is further configured to obtain the second file from a local hard disk under the condition that the second file is not found, so as to perform model training by using the second file.

Optionally, as shown in fig. 6, the data acquiring apparatus according to the embodiment of the present invention may further include an emptying unit 28, configured to empty the memory before determining a size relationship between the data storage space required by the training data set and the remaining memory space.

In a third aspect, an embodiment of the present invention further provides an electronic device, which can effectively improve the acquisition speed of training data in model training.

As shown in fig. 7, an electronic device provided in an embodiment of the present invention may include: the device comprises a shell 51, a processor 52, a memory 53, a circuit board 54 and a power circuit 55, wherein the circuit board 54 is arranged inside a space enclosed by the shell 51, and the processor 52 and the memory 53 are arranged on the circuit board 54; a power supply circuit 55 for supplying power to each circuit or device of the electronic apparatus; the memory 53 is used to store executable program code; the processor 52 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 53, for executing the data acquisition method provided by any of the foregoing embodiments.

For specific execution processes of the above steps by the processor 52 and further steps executed by the processor 52 by running the executable program code, reference may be made to the description of the foregoing embodiments, and details are not described herein again.

The above electronic devices exist in a variety of forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio, video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.

(4) A server: the device for providing the computing service comprises a processor, a hard disk, a memory, a system bus and the like, and the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(5) And other electronic equipment with data interaction function.

Accordingly, an embodiment of the present invention further provides a computer-readable storage medium, where one or more programs are stored, and the one or more programs can be executed by one or more processors to implement any one of the data acquisition methods provided in the foregoing embodiments, so that corresponding technical effects can also be achieved, which have been described in detail above and are not described herein again.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.

In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

For convenience of description, the above devices are described separately in terms of functional division into various units/modules. Of course, the functionality of the units/modules may be implemented in one or more software and/or hardware implementations of the invention.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of data acquisition, comprising:

2. The method of claim 1, wherein the preset policy comprises:

alternatively, the first and second electrodes may be,

3. The method of claim 1, wherein the saving the target file in a page cache of a kernel after the target file is read for the first time comprises:

reading a first file from the training dataset;

determining whether the first file is the target file read for the first time;

4. The method of claim 1, wherein after the target file is retained in a page cache of a kernel, the method further comprises:

receiving an instruction to read a second file from the training dataset;

searching the second file in a page cache of a kernel;

5. The method of claim 4, further comprising:

caching the training data set in a local hard disk;

6. The method of any of claims 1 to 5, wherein prior to determining the size relationship between the data storage space required for the training data set and the remaining memory space, the method further comprises: and clearing the memory.

7. A data acquisition apparatus, comprising:

8. The apparatus of claim 7, wherein the preset policy comprises:

alternatively, the first and second electrodes may be,

9. The apparatus of claim 7, wherein the reservation unit comprises:

a reading module for reading a first file from the training dataset;

10. The apparatus of claim 7, further comprising: