CN111258965B - Data acquisition method and device, electronic equipment and storage medium - Google Patents

Data acquisition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111258965B
CN111258965B CN202010030600.8A CN202010030600A CN111258965B CN 111258965 B CN111258965 B CN 111258965B CN 202010030600 A CN202010030600 A CN 202010030600A CN 111258965 B CN111258965 B CN 111258965B
Authority
CN
China
Prior art keywords
file
target
memory
training
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010030600.8A
Other languages
Chinese (zh)
Other versions
CN111258965A (en
Inventor
余虹建
李锦丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Juyuncube Technology Co ltd
Original Assignee
Beijing Juyuncube Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Juyuncube Technology Co ltd filed Critical Beijing Juyuncube Technology Co ltd
Priority to CN202010030600.8A priority Critical patent/CN111258965B/en
Publication of CN111258965A publication Critical patent/CN111258965A/en
Application granted granted Critical
Publication of CN111258965B publication Critical patent/CN111258965B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files

Abstract

The embodiment of the invention discloses a data acquisition method, a data acquisition device, electronic equipment and a storage medium, relates to the technical field of computers, and can effectively improve the acquisition speed of training data in model training. The data acquisition method comprises the following steps: determining the size relation between the data storage space required by the training data set and the residual memory space; under the condition that the data storage space is larger than the memory residual space, selecting at least one file as a target file in the training data set according to a preset strategy; after the target file is read for the first time, the target file is reserved in the page cache of the kernel, so that the target file is acquired from the page cache of the kernel when the target file is read again in the future. The method is suitable for model training of machine learning.

Description

Data acquisition method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data acquisition method, a data acquisition device, an electronic device, and a storage medium.
Background
In recent years, artificial intelligence technology has been increasingly used in industry and life. Machine learning is an important branch in the field of artificial intelligence, and can obtain an ideal mathematical model through a large amount of training data so as to simulate human thinking.
However, since the amount of data required for model training is enormous, often in the tens of millions of files, the reading speed of training data becomes an important factor affecting the model training efficiency.
For the problem of slower reading speed of training data in model training, no effective solution exists in the related field.
Disclosure of Invention
In view of the above, the embodiments of the present invention provide a data acquisition method, apparatus, electronic device, and storage medium, which can effectively improve the acquisition speed of training data in model training.
In a first aspect, an embodiment of the present invention provides a data acquisition method, including:
determining the size relation between the data storage space required by the training data set and the residual memory space;
under the condition that the data storage space is larger than the memory residual space, selecting at least one file as a target file in the training data set according to a preset strategy;
after the target file is read for the first time, the target file is reserved in the page cache of the kernel, so that the target file is acquired from the page cache of the kernel when the target file is read again in the future.
Optionally, the preset policy includes:
taking a file with the file size smaller than a first threshold value in the training data set as the target file;
or,
and determining the target files according to the file sizes of the files in the training data set and the memory residual space, so that the number of the target files is larger than a second threshold value, and/or the memory residual space is smaller than a third threshold value after the target files are reserved in a page cache of a kernel.
Optionally, the retaining the target file in the page cache of the kernel after the target file is read for the first time includes:
reading a first file from the training dataset;
determining whether the first file is the target file read for the first time;
and adding a preset mark for the first file under the condition that the first file is the target file read for the first time, so that the virtual file system VFS keeps the first file in a page cache of a kernel according to the preset mark.
Optionally, after the target file is retained in the page cache of the kernel, the method further includes:
receiving an instruction for reading a second file from the training data set;
searching the second file in a page cache of the kernel;
under the condition that the second file is found, the second file is obtained from a page cache of the kernel so as to perform model training by using the second file;
and under the condition that the second file is not found, acquiring the second file from a remote server to perform model training by using the second file.
Optionally, the method further comprises:
caching the training data set on a local hard disk;
and under the condition that the second file is not found, acquiring the second file from a local hard disk so as to perform model training by using the second file.
Optionally, before determining the size relationship between the data storage space and the memory remaining space required by the training data set, the method further includes: and (5) emptying the memory.
In a second aspect, an embodiment of the present invention further provides a data acquisition apparatus, including:
the determining unit is used for determining the size relation between the data storage space required by the training data set and the memory residual space;
the selecting unit is used for selecting at least one file in the training data set as a target file according to a preset strategy under the condition that the data storage space is larger than the memory residual space;
and the retaining unit is used for retaining the target file in the page cache of the kernel after the target file is read for the first time, so that the target file can be acquired from the page cache of the kernel when the target file is read again in the future.
Optionally, the preset policy includes:
taking a file with the file size smaller than a first threshold value in the training data set as the target file;
or,
and determining the target files according to the file sizes of the files in the training data set and the memory residual space, so that the number of the target files is larger than a second threshold value, and/or the memory residual space is smaller than a third threshold value after the target files are reserved in a page cache of a kernel.
Optionally, the reservation unit includes:
the reading module is used for reading the first file from the training data set;
the determining module is used for determining whether the first file is the target file read for the first time;
and the adding module is used for adding a preset mark for the first file under the condition that the first file is the target file read for the first time, so that the virtual file system VFS can keep the first file in the page cache of the kernel according to the preset mark.
Optionally, the apparatus further includes:
a receiving unit, configured to receive an instruction of reading a second file from the training dataset after the target file is retained in a page buffer of a kernel;
the searching unit is used for searching the second file in the page cache of the kernel;
the acquisition unit is used for acquiring the second file from the page cache of the kernel under the condition that the second file is found so as to perform model training by using the second file; and under the condition that the second file is not found, acquiring the second file from a remote server to perform model training by using the second file.
Optionally, the apparatus further includes:
the hard disk caching unit is used for caching the training data set in a local hard disk;
the obtaining unit is further configured to obtain the second file from the local hard disk, so as to perform model training by using the second file, where the second file is not found.
Optionally, the apparatus further comprises a flushing unit for flushing the memory before determining the size relation between the data storage space required by the training data set and the remaining memory space.
In a third aspect, embodiments of the present invention further provide an electronic device, including: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space surrounded by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing any one of the data acquisition methods provided by the embodiments of the present invention.
In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium storing one or more programs executable by one or more processors to implement any of the data acquisition methods provided by the embodiments of the present invention.
According to the data acquisition method, the device, the electronic equipment and the storage medium, the size relation between the data storage space and the memory residual space required by the training data set can be determined, at least one file is selected as a target file in the training data set according to a preset strategy under the condition that the data storage space is larger than the memory residual space, and the target file is reserved in the page cache of the kernel after the target file is read for the first time, so that the target file is acquired from the page cache of the kernel when the target file is read again in future. Therefore, when the remaining space of the memory cannot accommodate all data of the lower training data set, the data in the cache cannot be determined whether to remain in the cache according to the default file reading frequency, so that each time the file is read, the file cannot be hit due to the fact that the reading frequency before the file is too low. And the default cache rule is actively intervened, and at least one target file is selected in the training data set for caching according to a preset strategy, so that the cache hit rate is effectively improved, and the acquisition speed of training data in model training is also effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a data acquisition method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data acquisition device according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a configuration of a retention unit in a data acquisition device according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of another structure of a data acquisition device according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a data acquisition device according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of another structure of a data acquisition device according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
It should be understood that the described embodiments are merely some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In machine learning, on the one hand, a computer with powerful computing power is required for model training, and on the other hand, enough data samples are required for the computer to learn. Model training is a process of training through a large amount of data and continuously iterating and optimizing. In the process, the server can read data from the data set in batches according to a certain rule to perform model training, after all the data in the data set are read for one round, model parameters can be adjusted, and then the data set is read for the second round and model training is performed. The number of the wheels can be tens, hundreds or even higher. Because of the large amount of data required for model training, a data set can often include tens of millions of files, and thus the time interval for repeating the reading once per file is long, the frequency of reading per file is at a low level. Therefore, in the computer, the mechanism of caching the file according to the file reading frequency cannot effectively play the advantage of caching, and in the process of multi-round data reading and model training, the external storage device is frequently accessed to acquire training data, so that the data acquisition speed is slower.
Therefore, the embodiment of the invention provides a data acquisition method, a data acquisition device, electronic equipment and a storage medium, which can effectively improve the acquisition speed of training data in model training.
In a first aspect, an embodiment of the present invention provides a data acquisition method, which can effectively improve the speed of acquiring training data in model training.
As shown in fig. 1, the data acquisition method provided by the embodiment of the present invention may include:
s11, determining the size relation between the data storage space required by the training data set and the residual memory space;
specifically, the model training server may read the training data in a read file manner. All training data required for a model training task may form a data set (data set). The amount of data required for model training is enormous, alternatively, a data set can often include tens of millions of files, and the storage space occupied by a data set is also quite large.
Optionally, before reading the training data, the model training server may learn the size (size) of the data set trained at this time, that is, the size of the data storage space that the data set needs to occupy, for example, 132G, 60G, etc., through reading the file header information or through other interaction information. Knowing the data storage space required for training the data set, the data storage space can be compared with the current memory remaining space of the computer. For example, if the current memory remaining space is 20G and the data storage space required for the training data set is 35G, the data storage space required for the training data set is greater than the remaining space that the current memory can provide, i.e., the training data set cannot be fully stored in the memory.
S12, selecting at least one file as a target file in the training data set according to a preset strategy under the condition that the data storage space is larger than the memory residual space;
in this step, under the condition that the data storage space is larger than the remaining memory space, active intervention is performed on a default file cache policy according to the reading frequency, and one file or a plurality of files can be selected as target files in the training data set according to the preset policy.
Optionally, the preset policy may be set according to specific needs, so long as a part of files in the training data set can be kept in the cache, and the cache hit rate is not too low as the read frequency is too low and is not replaced. For example, the preset policy may be to select a file under a preset path in the training dataset as a target file, or select a file with a preset file name as a target file, or select a file with a file size within a preset interval as a target file, or the like.
S13, after the target file is read for the first time, the target file is reserved in the page cache of the kernel, so that the target file is acquired from the page cache of the kernel when the target file is read again in the future.
In this step, under the condition that the data storage space is larger than the memory residual space, active intervention is performed on a default file cache policy according to the reading frequency, and the target file selected from the training data set is reserved in the page cache (page cache) of the kernel.
According to the data acquisition method provided by the embodiment of the invention, the size relation between the data storage space required by the training data set and the memory residual space can be determined, at least one file is selected as a target file in the training data set according to a preset strategy under the condition that the data storage space is larger than the memory residual space, and the target file is reserved in the page cache of the kernel after the target file is read for the first time, so that the target file is acquired from the page cache of the kernel when the target file is read again in future. Therefore, when the remaining space of the memory cannot accommodate all data of the lower training data set, the data in the cache cannot be determined whether to remain in the cache according to the default file reading frequency, so that each time the file is read, the file cannot be hit due to the fact that the reading frequency before the file is too low. And the default cache rule is actively intervened, and at least one target file is selected in the training data set for caching according to a preset strategy, so that the cache hit rate is effectively improved, and the acquisition speed of training data in model training is also effectively improved.
Optionally, in step S11, the size relationship between the data storage space required for training the data set and the remaining memory space may include two cases: in the first case, the data storage space is smaller than or equal to the memory residual space, at this time, the training data set can be completely loaded into the memory, and only needs to be read from the memory each time when the training data is read, at this time, the cache scheduling operation is not triggered; in the second case, the data storage space is larger than the remaining memory space, and only part of the data of the training data set can be cached in the memory, so that the cache scheduling operation can be triggered.
Optionally, in order to enable more files to be cached, in one embodiment of the present invention, before determining the size relationship between the data storage space required by the training data set and the remaining memory in step S11, the data acquisition method provided by the embodiment of the present invention may further include: and (5) emptying the memory.
When the data storage space is larger than the memory residual space, in order to avoid the situation of caching the file according to the frequency of reading the file, in the embodiment of the invention, active intervention can be performed on the cache scheduling. Specifically, at least one file may be selected as a target file in the training dataset according to a preset policy in step S12, so as to cache the target file in step S13.
In the embodiment of the present invention, when model training is performed, N rounds of reading operations are required for data in the data set, that is, all data in the data set is read at least N times. In order to minimize the communication time spent reading data, in one embodiment of the present invention, the preset policy may include taking a file with a file size smaller than a first threshold value in the training dataset as the target file. Therefore, the files with smaller file sizes are selected as target files to be cached, so that on one hand, too much cache space is not occupied, and on the other hand, a large number of files can be cached, thereby effectively reducing the probability of reading the files from an external memory and effectively improving the data acquisition speed. The first threshold value can be flexibly set and adjusted according to the needs, for example, 10M,50M and the like.
Optionally, in another embodiment of the present invention, the preset policy may also include: and determining the target files according to the file sizes of the files in the training data set and the memory residual space, so that the number of the target files is larger than a second threshold value, and/or the memory residual space is smaller than a third threshold value after the target files are reserved in a page cache of a kernel. That is, in this embodiment, the sizes of the files in the training data set and the sizes of the remaining memory space may be known in advance, and as many files as possible may be selected as target files for caching, and/or the remaining memory space after the target files are stored in the data variables may be as small as possible.
For example, in one embodiment of the present invention, if the remaining memory space is 9.5G, there are 100 files with a file size of 50M in the training dataset, 20 files with a file size of 200M, and 30 files with a file size of 1G, then 100 files (5G) with a size of 50M and 20 files (4G) with a size of 200M may be selected for buffering, so that 120 files are buffered, and after the files are buffered, the remaining memory space is 9.5G-5G-4 g=0.5G. Alternatively, in another embodiment of the present invention, 90 files (4.5G) of 50M, 20 files (4G) of 200M, and one file (1G) of 1G may be cached, so that a total of 111 files are cached, and the remaining memory space is 9.5G-4.5G-4G-1 g=0g.
In the embodiment of the invention, more files are cached by using the residual space of the limited memory as much as possible, and for each data reading operation, if the data is not cached, additional communication time is required to be spent on reading the data in the external storage device, and if the data is cached, the communication time can be saved, and the data can be directly read from the cache, thereby improving the data acquisition speed.
In order to minimize unnecessary data communication time, in one embodiment of the present invention, the preset policy for selecting the target file may be: the smaller the data amount per target file itself, the better, the larger the number of target files. For example, the buffer margin is 200G, and 50 files with the smallest data volume can be selected as target files in the training data set, so that the buffer margin of 200G is just full. The selected target file may form a list of file names for querying.
After the target file is selected, the target file may be cached in step S13. Optionally, in step S13, after the target file is read for the first time, the retaining the target file in the page cache of the kernel may specifically include:
reading a first file from the training dataset;
determining whether the first file is the target file read for the first time;
and adding a preset mark for the first file under the condition that the first file is the target file read for the first time, so that the virtual file system VFS keeps the first file in a page cache of a kernel according to the preset mark.
Specifically, in the Linux system, the file caching mechanism is directly managed by a virtual file system VFS of the Linux system. In order to mask the default policy of performing file caching according to the file reading frequency, in an embodiment of the present invention, when the first file is the target file that is read for the first time, a preset flag may be added to the first file, so that the virtual file system VFS may keep the first file in the page cache of the kernel according to the preset flag.
Further, after the target file is retained in the page cache of the kernel, the data acquisition method provided by the embodiment of the invention may further include:
receiving an instruction for reading a second file from the training data set;
searching the second file in a page cache of the kernel;
under the condition that the second file is found, the second file is obtained from a page cache of the kernel so as to perform model training by using the second file;
and under the condition that the second file is not found, acquiring the second file from a remote server to perform model training by using the second file.
For example, in one embodiment of the present invention, assume a dataset with one thousand five million pictures, each of which requires 100 cycles of data input and model training, i.e., the dataset is repeatedly input 100 cycles. In each training round, the picture data can be read in batches. Alternatively, the amount of data read in per batch may be set as desired. When the thousands of millions of pictures are read in the first round, each batch of data is read in for the first time, and the page cache (page cache) hit problem of the kernel does not exist, but the data reading in the first round needs to be prepared for the later 99 rounds of training. That is, during the present round of reading, a batch of pictures with the smallest data volume can be reserved in the page buffer of the kernel according to a preset strategy. Starting from the second round of reading the seventeenth five million pictures, searching in the page cache of the kernel, and if the page cache of the kernel does not exist, reading in a remote server.
Optionally, in an embodiment of the present invention, in order to further save communication time of data reading, the data acquisition method provided by the embodiment of the present invention may further include:
caching the training data set on a local hard disk;
and under the condition that the second file is not found, acquiring the second file from a local hard disk so as to perform model training by using the second file.
In a second aspect, an embodiment of the present invention further provides a data acquisition apparatus, which can effectively improve the speed of acquiring training data in model training.
As shown in fig. 2, the data acquisition device provided by the embodiment of the present invention may include:
a determining unit 21, configured to determine a size relationship between a data storage space required for training the data set and a remaining memory space;
a selecting unit 22, configured to select at least one file in the training dataset as a target file according to a preset policy when the data storage space is larger than the remaining memory space;
and a retaining unit 23, configured to retain the target file in the page cache of the kernel after the target file is read for the first time, so as to obtain the target file from the page cache of the kernel when the target file is read again in the future.
According to the data acquisition device provided by the embodiment of the invention, the size relation between the data storage space required by the training data set and the memory residual space can be determined, at least one file is selected as a target file in the training data set according to a preset strategy under the condition that the data storage space is larger than the memory residual space, and the target file is reserved in the page cache of the kernel after the target file is read for the first time, so that the target file is acquired from the page cache of the kernel when the target file is read again in future. Therefore, when the remaining space of the memory cannot accommodate all data of the lower training data set, the data in the cache cannot be determined whether to remain in the cache according to the default file reading frequency, so that each time the file is read, the file cannot be hit due to the fact that the reading frequency before the file is too low. And the default cache rule is actively intervened, and at least one target file is selected in the training data set for caching according to a preset strategy, so that the cache hit rate is effectively improved, and the acquisition speed of training data in model training is also effectively improved.
Optionally, the preset policy may include:
taking a file with the file size smaller than a first threshold value in the training data set as the target file;
or,
and determining the target files according to the file sizes of the files in the training data set and the memory residual space, so that the number of the target files is larger than a second threshold value, and/or the memory residual space is smaller than a third threshold value after the target files are reserved in a page cache of a kernel.
Alternatively, as shown in fig. 3, the reservation unit 23 may include:
a reading module 231 for reading the first file from the training dataset;
a determining module 232, configured to determine whether the first file is the target file read for the first time;
and an adding module 233, configured to add a preset flag to the first file if the first file is the target file read for the first time, so that the virtual file system VFS retains the first file in the page cache of the kernel according to the preset flag.
Optionally, as shown in fig. 4, the data acquisition device provided by the embodiment of the present invention may further include:
a receiving unit 24 for receiving an instruction to read a second file from the training dataset after the target file is held in the page buffer of the kernel;
a searching unit 25, configured to search the page cache of the kernel for the second file;
an obtaining unit 26, configured to obtain the second file from the page cache of the kernel, so as to perform model training by using the second file, where the second file is found; and under the condition that the second file is not found, acquiring the second file from a remote server to perform model training by using the second file.
Optionally, as shown in fig. 5, the data acquisition device provided by the embodiment of the present invention may further include:
a hard disk buffer unit 27, configured to buffer the training data set in a local hard disk;
the obtaining unit 26 is further configured to obtain the second file from the local hard disk, so as to perform model training with the second file, where the second file is not found.
Optionally, as shown in fig. 6, the data acquisition device provided by the embodiment of the present invention may further include a flushing unit 28, configured to flush the memory before determining the size relationship between the data storage space and the memory remaining space required for training the data set.
In a third aspect, an embodiment of the present invention further provides an electronic device, which can effectively improve an acquisition speed of training data in model training.
As shown in fig. 7, an electronic device provided by an embodiment of the present invention may include: the processor 52 and the memory 53 are arranged on the circuit board 54, wherein the circuit board 54 is arranged in a space surrounded by the shell 51; a power supply circuit 55 for supplying power to the respective circuits or devices of the above-described electronic apparatus; the memory 53 is for storing executable program code; the processor 52 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 53 for executing the data acquisition method provided in any of the foregoing embodiments.
The specific implementation of the above steps by the processor 52 and the further implementation of the steps by the processor 52 through the execution of the executable program code may be referred to the description of the foregoing embodiments, and will not be repeated here.
Such electronic devices exist in a variety of forms including, but not limited to:
(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include: smart phones (e.g., iPhone), multimedia phones, functional phones, and low-end phones, etc.
(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally also having mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc., such as iPad.
(3) Portable entertainment device: such devices may display and play multimedia content. The device comprises: audio, video players (e.g., iPod), palm game consoles, electronic books, and smart toys and portable car navigation devices.
(4) And (3) a server: the configuration of the server includes a processor, a hard disk, a memory, a system bus, and the like, and the server is similar to a general computer architecture, but is required to provide highly reliable services, and thus has high requirements in terms of processing capacity, stability, reliability, security, scalability, manageability, and the like.
(5) Other electronic devices with data interaction functions.
Accordingly, embodiments of the present invention further provide a computer readable storage medium storing one or more programs executable by one or more processors to implement any one of the data acquisition methods provided in the foregoing embodiments, so that corresponding technical effects can be achieved, which have been described in detail above and are not repeated herein.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.
In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.
For convenience of description, the above apparatus is described as being functionally divided into various units/modules, respectively. Of course, the functions of the various elements/modules may be implemented in the same piece or pieces of software and/or hardware when implementing the present invention.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (14)

1. A method of data acquisition, comprising:
determining the size relation between the data storage space required by the training data set and the residual memory space;
under the condition that the data storage space is larger than the memory residual space, selecting at least one file as a target file in the training data set according to a preset strategy;
reading a first file from the training dataset after the target file is read for the first time;
determining whether the first file is the target file read for the first time;
and under the condition that the first file is the target file read for the first time, adding a preset mark for the first file, so that the virtual file system VFS keeps the target file in a page cache of a kernel according to the preset mark, so that the target file is acquired from the page cache of the kernel when the target file is read again in future, and the preset strategy is used for ensuring that whether the target file is kept in the cache is not determined according to the reading frequency.
2. The method of claim 1, wherein the preset policy comprises:
and taking the file with the file size smaller than a first threshold value in the training data set as the target file.
3. The method of claim 1, wherein the preset policy comprises:
and determining the target files according to the file sizes of the files in the training data set and the memory residual space, so that the number of the target files is larger than a second threshold value, and/or the memory residual space is smaller than a third threshold value after the target files are reserved in a page cache of a kernel.
4. The method of claim 1, wherein after the target file is retained in the page cache of the kernel, the method further comprises:
receiving an instruction for reading a second file from the training data set;
searching the second file in a page cache of the kernel;
under the condition that the second file is found, the second file is obtained from a page cache of the kernel so as to perform model training by using the second file;
and under the condition that the second file is not found, acquiring the second file from a remote server to perform model training by using the second file.
5. The method as recited in claim 4, further comprising:
caching the training data set on a local hard disk;
and under the condition that the second file is not found, acquiring the second file from a local hard disk so as to perform model training by using the second file.
6. The method according to any one of claims 1 to 5, wherein prior to determining the size relationship of the data storage space required for the training data set and the memory remaining space, the method further comprises: and (5) emptying the memory.
7. A data acquisition device, comprising:
the determining unit is used for determining the size relation between the data storage space required by the training data set and the memory residual space;
the selecting unit is used for selecting at least one file in the training data set as a target file according to a preset strategy under the condition that the data storage space is larger than the memory residual space;
a reservation unit, configured to read a first file from the training dataset after the target file is read for the first time; determining whether the first file is the target file read for the first time; and under the condition that the first file is the target file read for the first time, adding a preset mark for the first file, so that the virtual file system VFS keeps the first file in a page cache of a kernel according to the preset mark, so that the target file is acquired from the page cache of the kernel when the target file is read again in future, and the preset strategy is used for ensuring that whether the target file is kept in the cache is not determined according to the reading frequency.
8. The apparatus of claim 7, wherein the preset policy comprises:
and taking the file with the file size smaller than a first threshold value in the training data set as the target file.
9. The apparatus of claim 7, wherein the preset policy comprises:
and determining the target files according to the file sizes of the files in the training data set and the memory residual space, so that the number of the target files is larger than a second threshold value, and/or the memory residual space is smaller than a third threshold value after the target files are reserved in a page cache of a kernel.
10. The apparatus as recited in claim 7, further comprising:
a receiving unit, configured to receive an instruction of reading a second file from the training dataset after the target file is retained in a page buffer of a kernel;
the searching unit is used for searching the second file in the page cache of the kernel;
the acquisition unit is used for acquiring the second file from the page cache of the kernel under the condition that the second file is found so as to perform model training by using the second file; and under the condition that the second file is not found, acquiring the second file from a remote server to perform model training by using the second file.
11. The apparatus as recited in claim 10, further comprising:
the hard disk caching unit is used for caching the training data set in a local hard disk;
the obtaining unit is further configured to obtain the second file from the local hard disk, so as to perform model training by using the second file, where the second file is not found.
12. The apparatus according to any of claims 7 to 11, further comprising a flushing unit for flushing the memory before determining the size relation of the data storage space required for the training data set and the memory remaining space.
13. An electronic device, the electronic device comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space surrounded by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; a processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the data acquisition method of any of the preceding claims 1-6.
14. A computer-readable storage medium, characterized in that the computer-readable storage medium stores one or more programs executable by one or more processors to implement the data acquisition method of any one of the preceding claims 1 to 6.
CN202010030600.8A 2020-01-10 2020-01-10 Data acquisition method and device, electronic equipment and storage medium Active CN111258965B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010030600.8A CN111258965B (en) 2020-01-10 2020-01-10 Data acquisition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010030600.8A CN111258965B (en) 2020-01-10 2020-01-10 Data acquisition method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111258965A CN111258965A (en) 2020-06-09
CN111258965B true CN111258965B (en) 2024-03-08

Family

ID=70950382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010030600.8A Active CN111258965B (en) 2020-01-10 2020-01-10 Data acquisition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111258965B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084017B (en) * 2020-07-30 2024-04-19 北京聚云科技有限公司 Memory management method and device, electronic equipment and storage medium
CN111966410B (en) * 2020-07-31 2023-11-14 龙芯中科技术股份有限公司 Start-up processing method and device, electronic equipment and storage medium
CN112783843A (en) * 2020-12-31 2021-05-11 北京聚云科技有限公司 Data reading method and device and electronic equipment
CN112905325B (en) * 2021-02-10 2023-01-10 山东英信计算机技术有限公司 Method, system and medium for distributed data cache accelerated training
CN113521745B (en) * 2021-06-17 2024-01-09 广州三七极耀网络科技有限公司 Data storage method, device and equipment of AI model training architecture of FPS game

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170376A (en) * 2017-12-21 2018-06-15 上海新案数字科技有限公司 The method and system that storage card is read and write
CN109992522A (en) * 2017-12-29 2019-07-09 广东欧珀移动通信有限公司 Application processing method and device, electronic equipment, computer readable storage medium
CN110516817A (en) * 2019-09-03 2019-11-29 北京华捷艾米科技有限公司 A kind of model training data load method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107807839B (en) * 2016-09-09 2022-01-28 阿里巴巴集团控股有限公司 Method and device for modifying memory data of virtual machine and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170376A (en) * 2017-12-21 2018-06-15 上海新案数字科技有限公司 The method and system that storage card is read and write
CN109992522A (en) * 2017-12-29 2019-07-09 广东欧珀移动通信有限公司 Application processing method and device, electronic equipment, computer readable storage medium
CN110516817A (en) * 2019-09-03 2019-11-29 北京华捷艾米科技有限公司 A kind of model training data load method and device

Also Published As

Publication number Publication date
CN111258965A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
CN111258965B (en) Data acquisition method and device, electronic equipment and storage medium
US9201810B2 (en) Memory page eviction priority in mobile computing devices
CN105760199B (en) A kind of application resource loading method and its equipment
CN104808952B (en) data cache method and device
CN112087487B (en) Scheduling method and device of model training task, electronic equipment and storage medium
US20170308546A1 (en) File storage method and electronic device
CN112084017B (en) Memory management method and device, electronic equipment and storage medium
US20120315012A1 (en) Weighted Playlist
CN110652728A (en) Game resource management method and device, electronic equipment and storage medium
CN114372297A (en) Method and device for verifying file integrity based on message digest algorithm
CN111240843B (en) Data acquisition method and device, electronic equipment and storage medium
CN111258959A (en) Data acquisition method, data providing method and device
CN114338102B (en) Security detection method, security detection device, electronic equipment and storage medium
CN112036133B (en) File storage method and device, electronic equipment and storage medium
CN111240843A (en) Data acquisition method and device, electronic equipment and storage medium
CN113946604A (en) Staged go teaching method and device, electronic equipment and storage medium
CN110597566B (en) Application processing method and device, storage medium and electronic equipment
CN110688223B (en) Data processing method and related product
CN113887719A (en) Model compression method and device
CN110825652B (en) Method, device and equipment for eliminating cache data on disk block
CN112667847A (en) Data caching method, data caching device and electronic equipment
CN112035804A (en) Method and device for inserting watermark identification into document page, electronic equipment and storage medium
US20170154096A1 (en) Data service system and electronic apparatus
CN111797392B (en) Method, device and storage medium for controlling infinite analysis of derivative files
CN110213314B (en) Method, device and server for determining storage node

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201023

Address after: Room 91, 5 / F, building 5, yard 30, Shixing street, Shijingshan District, Beijing 100041

Applicant after: Beijing juyuncube Technology Co.,Ltd.

Address before: 100041 Beijing, Shijingshan District Xing Xing street, building 30, No. 3, building 2, A-0071

Applicant before: Beijing Cheetah Mobile Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant