CN113065663A - Data access method, device, equipment and storage medium - Google Patents

Data access method, device, equipment and storage medium Download PDF

Info

Publication number
CN113065663A
CN113065663A CN202110089281.2A CN202110089281A CN113065663A CN 113065663 A CN113065663 A CN 113065663A CN 202110089281 A CN202110089281 A CN 202110089281A CN 113065663 A CN113065663 A CN 113065663A
Authority
CN
China
Prior art keywords
training data
target
data set
array
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110089281.2A
Other languages
Chinese (zh)
Inventor
唐晶
廖阔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110089281.2A priority Critical patent/CN113065663A/en
Publication of CN113065663A publication Critical patent/CN113065663A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a data access method, a data access device, a data access equipment and a storage medium, wherein a training data set used for training a target deep learning model is obtained, a target index file is determined according to the training data set, and random access is carried out on each piece of training data in the training data set based on an index. On the basis, operations such as random sequencing of training data are converted into operations of index information, namely in the process of training the target deep learning model of each wheel pair, the sequence of the index information in the target index file is randomly disturbed to obtain a target index file with the index information arranged according to a second sequence, and target training data corresponding to the target index information are read from the training data set according to the second sequence and input into the target deep learning model for the round of training. Therefore, only index information with a small size and a small amount of training data required by each step in training need to be stored in the memory, the use of the memory is obviously reduced, and the high efficiency of reading the training data can be ensured.

Description

Data access method, device, equipment and storage medium
Technical Field
The present application relates to the field of data processing, and in particular, to a data access method, apparatus, device, and storage medium.
Background
In recent years, deep learning techniques have been greatly developed, and have been shown to approach and exceed human performance in various fields such as natural language processing, computer vision, and speech recognition. The deep learning model with high training precision and good effect can not separate mass training data.
At present, a deep learning model is mainly trained based on a stochastic gradient descent algorithm. In most model training scenarios, the whole training data set needs to be read into a memory first, then training data in the training data set is randomly scrambled before each round of training, and a small part of training data is read in sequence from the scrambled training data set and input into a model.
However, as the scale of data used for model training is larger and larger, this approach becomes increasingly unfeasible because training data up to tens or even hundreds of GB may not be completely loaded into the memory. In addition, only a small part of the whole training data set is actually used in each step of model training, so that the effective utilization rate of the memory is low, and a large amount of memory resources are wasted.
Disclosure of Invention
In order to solve the technical problems, the application provides a data access method, a device, equipment and a storage medium, which ensure that each round of training reads partial required data and avoid a great deal of memory resource waste caused by loading the whole training data set. The order of the index information in the target index file is randomly disturbed through each round of training, random access to each piece of training data in the training data set based on the index is achieved, operations such as random sequencing of the training data can be converted into operations on the index information on the basis, and therefore only small-size index information and a small amount of training data needed in each step of training need to be stored in a memory.
The embodiment of the application discloses the following technical scheme:
in a first aspect, an embodiment of the present application provides a data access method, where the method includes:
acquiring a training data set for training a target deep learning model;
determining a target index file according to the training data set, wherein the target index file records index information of each piece of training data in the training data set according to a first sequence, and the index information represents the position of the training data in the training data set;
in the process that each wheel trains the target deep learning model, randomly disordering the sequence of the index information in the target index file to obtain a target index file in which the index information is arranged according to a second sequence, wherein the first sequence is different from the second sequence;
according to the second sequence, determining target index information belonging to the same reading batch from the target index file;
and for each reading batch, reading target training data corresponding to the target index information from the training data set and inputting the target deep learning model so as to perform the round of training on the target deep learning model.
In a second aspect, an embodiment of the present application provides a data access apparatus, where the apparatus includes an obtaining unit, a determining unit, a scrambling unit, and a reading unit:
the acquisition unit is used for acquiring a training data set used for training a target deep learning model;
the determining unit is configured to determine a target index file according to the training data set, where the target index file records index information of each piece of training data in the training data set according to a first order, and the index information indicates a position of the training data in the training data set;
the disordering unit is used for randomly disordering the sequence of the index information in the target index file in the process of training the target deep learning model by each wheel to obtain a target index file in which the index information is arranged according to a second sequence, wherein the first sequence is different from the second sequence;
the determining unit is further configured to determine, according to the second order, target index information belonging to the same reading batch from a target index file;
the reading unit is configured to, for each reading batch, read target training data corresponding to the target index information from the training data set and input the target deep learning model, so as to perform the round of training on the target deep learning model.
In a third aspect, an embodiment of the present application provides an apparatus for data access, where the apparatus includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of the first aspect according to instructions in the program code.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium for storing program code for executing the method of the first aspect.
According to the technical scheme, in the deep learning process, multiple rounds of training are usually performed on the whole training data set, the method for quickly accessing the training data set can be provided, specifically, the training data set used for training a target deep learning model is obtained, a target index file is determined according to the training data set, the target index file records index information of each piece of training data in the training data set according to a first sequence, and the index information represents the position of the training data in the training data set. In the process of training the target deep learning model by each wheel, randomly disordering the sequence of the index information in the target index file to obtain a target index file with the index information arranged according to a second sequence, wherein the first sequence is different from the second sequence. And according to the second sequence, determining target index information belonging to the same reading batch from the target index file, so that target training data corresponding to the target index information are read from the training data set and input into the target deep learning model for each reading batch, and performing the round of training on the target deep learning model, thereby ensuring that data required by part are read in each round of training, and avoiding waste of a large amount of memory resources caused by loading the whole training data set. The order of the index information in the target index file is randomly disturbed through each round of training, random access to each piece of training data in the training data set based on the index is achieved, operations such as random sequencing of the training data can be converted into operations on the index information on the basis, and therefore only small-size index information and a small amount of training data needed in each step of training need to be stored in a memory.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a flowchart of an algorithm of a data access method provided in the related art;
fig. 2 is a schematic view of an application scenario of a data access method according to an embodiment of the present application;
fig. 3 is a flowchart of a data access method provided in an embodiment of the present application;
FIG. 4 is a flowchart of an algorithm of a data access method according to an embodiment of the present application;
fig. 5 is an overall framework diagram of a data access method provided in an embodiment of the present application;
FIG. 6 is a flowchart of an algorithm for reading training data via an index lookup module according to an embodiment of the present disclosure;
FIG. 7 is a flowchart of an algorithm for constructing a target index file according to an embodiment of the present disclosure;
fig. 8 is a flowchart of a data access method provided in an embodiment of the present application;
fig. 9 is a block diagram of a data access device according to an embodiment of the present application;
fig. 10 is a structural diagram of a terminal device according to an embodiment of the present application;
fig. 11 is a block diagram of a server according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
In some related technologies, a whole training data set is read into a memory, then training data in the training data set is randomly scrambled before each round of training, and a small part of the training data is sequentially read from the scrambled training data set and input into a model, as shown in fig. 1.
Wherein E represents that the current training round number is the E-th round, E starts counting from 0, when E is 0, the first round of training is started, and E represents the total training round number. When training starts, that is, when e is 0 (see S101), the entire training data set is read into the memory (see S102). If E < the total number of training rounds E (see S103), it means that training has not been completed yet, and therefore it is necessary to randomly scramble the training data in the entire training data set (see S104), read a small portion of the training data in the scrambled training data set in sequence and input the training data into the model (see S105), and re-execute S103 with E + being 1 (see S106).
However, as the scale of data used for model training increases, an excessively large amount of training data is likely to not be fully loaded into memory. In addition, only a small part of the whole training data set is actually used in each step of model training, so that the effective utilization rate of the memory is low, and a large amount of memory resources are wasted.
Therefore, the embodiments of the present application provide a data access method, an apparatus, a device, and a storage medium, which ensure that each round of training reads partial required data, and avoid a large amount of memory resource waste caused by loading the whole training data set. The order of the index information in the target index file is randomly disturbed through each round of training, random access to each piece of training data in the training data set based on the index is achieved, operations such as random sequencing of the training data can be converted into operations on the index information on the basis, and therefore only small-size index information and a small amount of training data needed in each step of training need to be stored in a memory.
It should be noted that the present invention provides methods and systems related to the field of Artificial Intelligence (AI), which is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The data access method provided by the embodiment of the application can be mainly and widely applied to machine learning/deep learning scenes, for example, the data access method provided by the application can be used as a training acceleration component, the training speed of a deep model is improved, and the time cost of a user is saved.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
The method provided by the embodiment of the present application may further involve a blockchain technology, such as the data access method disclosed in the present application, wherein the training data set used in the deep learning process may be stored on the blockchain.
It should be noted that the method may be applied to a data processing device, and the data processing device may be a terminal device, and the terminal device may be, for example, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like, but is not limited thereto.
The data processing device may also be a server, which may be an independent physical server, a server cluster or distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services.
Of course, the data processing device may be a terminal device and a server, that is, the terminal device and the server may be executed cooperatively, and the terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.
In order to facilitate understanding of the technical solution of the present application, the data access method provided in the embodiment of the present application is described below with reference to a terminal device as an example in combination with an actual application scenario.
Referring to fig. 2, fig. 2 is a schematic view of an application scenario of the data access method provided in the embodiment of the present application, where the application scenario may include a terminal device 201 and a server 202. The terminal device 201 may be configured to execute the data access method provided in the embodiment of the present application, and the server 202 may store a training data set.
Various training data can be stored on the server 202, and when the terminal device 201 needs to train a certain model, for example, a target deep learning model, the terminal device 201 can obtain a training data set for training the target deep learning model from the server 202.
In the training process of the target deep learning model, multiple rounds of training are usually performed on the whole training data set, and the sequence of the training data read in each round of training should be different, so as to ensure the accuracy of model training. According to the method and the device, operations such as random sequencing of training data are converted into operations of index information, so that the terminal device 201 needs to determine the target index file according to the training data set. The target index file records index information of each piece of training data in the training data set according to a first sequence, and the index information represents the position of the training data in the training data set.
Then, in the process of training the target deep learning model by each wheel, the terminal device 201 randomly scrambles the order of the index information in the target index file to obtain a target index file in which the index information is arranged according to a second order, and the first order is different from the second order so as to achieve the purpose of reading the training data according to different orders in the process of reading the training data based on the index information by scrambling the order of the index information. In this way, the terminal device 201 determines the target index information belonging to the same reading batch from the target index file according to the second order, so that the index information can be divided into batches, and the target training data corresponding to the target index information is read from the training data set each time and input into the target deep learning model, so as to perform the round of training on the target deep learning model.
Next, a data access method provided in an embodiment of the present application will be described in detail with reference to the accompanying drawings, taking the data processing device as a terminal device as an example.
Referring to fig. 3, fig. 3 shows a flow chart of a data access method, the method comprising:
s301, a training data set used for training the target deep learning model is obtained.
When the terminal device needs to train a certain model, for example, a target deep learning model, the terminal device may obtain a training data set for training the target deep learning model from the server.
Wherein the training data set for training the target deep learning model may comprise one or more.
S302, determining a target index file according to the training data set.
In the training process of the target deep learning model, multiple rounds of training are usually performed on the whole training data set, and the sequence of the training data read in each round of training should be different, so as to ensure the accuracy of model training. The method and the device convert operations such as random sequencing of training data into operations on index information, so that in the method and the device, the terminal device needs to determine the target index file according to the training data set.
The target index file records index information of each piece of training data in the training data set according to a first sequence, and the index information represents the position of the training data in the training data set.
In one possible implementation, the index information is represented by a byte offset of the training data in the training data set, and the byte offset may be the number of bytes in which the first byte of the training data is located in the training data set. For example, the training data set includes a plurality of pieces of training data, the byte length of the first piece of training data is 1-20 bytes in the training data set, and the byte offset of the first byte of the first piece of training data is 1 byte in the training data set; the byte length of the second piece of training data is 21-50 bytes in the training data set, the first byte of the second piece of training data is 21 bytes in the training data set, the byte offset is 21, and so on.
In order to ensure that a complete piece of training data can be read every time when the training data is read according to the index information, the index information can also carry a line feed character or a training length corresponding to the training data, so that the tail position of the training data in the training data set is known when the training data is read.
It should be noted that, in this embodiment, the target index file may be already existing or may not be constructed yet, and based on this, the determination manner of the target index file may include multiple manners. That is, in one possible implementation manner, the determining the target index file according to the training data set may be to search for a target index file corresponding to the training data set in an index file storage area, that is, determine whether the target index file already exists (see S402 in fig. 4), where e ═ 0 shown in S401 may indicate that the training just starts, that is, the first round of training just occurs. If the target index file is not found in the index file storage area, a target index file corresponding to the training data set is constructed (see S403 in fig. 4). If the target index file is found in the index file storage area, the target index file is directly loaded (see S404 in fig. 4).
It should be noted that, in a case that a target index file corresponding to a training data set needs to be constructed, the method provided by the embodiment of the present application may include two parts, namely a data preprocessing process and a model training process, and an overall framework diagram of the method is shown in fig. 5. In the data preprocessing process, the index generation module 501 may generate a target index file corresponding to a training data set, where the input is the training data set and the output is the target index file corresponding to the training data set. In the model training process, target training data corresponding to subsequently determined target index information can be searched by the index searching module 502 and sequentially input into the target deep learning model.
S303, in the process that each wheel trains the target deep learning model, the sequence of the index information in the target index file is randomly disturbed to obtain a target index file with the index information arranged according to a second sequence.
In the process of training the target deep learning model by each wheel, the sequence of the index information in the target index file can be randomly disturbed to obtain the target index file with the index information arranged according to the second sequence, so that operations such as random sequencing of training data in the related technology are replaced. Wherein the first order is different from the second order.
For example, the index information in the target index file determined in S302 is sorted into an index X, an index Y, and an index Z … … (in a first order), where the index X is the index information corresponding to the first piece of training data, the index Y is the index information corresponding to the second piece of training data, and the index Z is the index information corresponding to the third piece of training data, … …. After the order of the index information in the target index file is randomly scrambled, if the obtained second order is index Z, index X, and index Y … …, the reading order of the training data is changed when the training data is read.
In the actual algorithm, E in fig. 4 indicates that the current number of training rounds is the E-th round, E counts from 0, when E is 0, the first round of training is started, and E indicates the total number of training rounds. After the training is started, that is, when E is 0 (see S401), after the target index file is obtained based on S402 or S403, it may be determined whether E is less than the total number of training rounds E (see S405), and if E is less than the total number of training rounds E, it indicates that the training has not been completed, so that the index information in the target index file needs to be randomly scrambled (see S406).
S304, determining the target index information belonging to the same reading batch from the target index file according to the second sequence.
S305, for each reading batch, reading target training data corresponding to the target index information from the training data set and inputting the target deep learning model, so as to perform the round of training on the target deep learning model.
In each training step, only a small part of the whole training data set is used, so that the target index information belonging to the same reading batch can be determined from the target index file according to the requirement of each training step, and the fact that the target index information belongs to the same reading batch indicates that the training data corresponding to the reading batch is required to be used in the training step. For example, several adjacent index information may be grouped together. In this way, the corresponding training data can be read from the training data set according to the target index information and then input into the target deep learning model for training (see S407 in fig. 4).
Since the next training round may be performed next, e + ═ 1 (see S408 in fig. 4), i.e., the current number of training rounds is increased by 1 to enter the next training round, and S405 is performed again. If the current number of training rounds E is 0, and E + is 1, then E is 1, then in S405, E is 1 compared with the total number of training rounds E.
It should be noted that, in this embodiment, the index lookup module 502 shown in fig. 5 may sequentially input the scrambled index information into the index lookup module 502, the index lookup module 502 finds and returns the target training data record content corresponding to the target index in the training data set, and the returned target training data record may be further input into the target deep learning model for training of the model.
For example, if the determined target index information is index Z, index X, and index Y, the target training data output by the index lookup module 502 are respectively third training data, first training data, and second training data … …, and the third training data, the first training data, and the second training data … … are sequentially input to the target deep learning model.
According to the technical scheme, in the deep learning process, multiple rounds of training are usually performed on the whole training data set, the method for quickly accessing the training data set can be provided, specifically, the training data set used for training a target deep learning model is obtained, a target index file is determined according to the training data set, the target index file records index information of each piece of training data in the training data set according to a first sequence, and the index information represents the position of the training data in the training data set. In the process of training the target deep learning model by each wheel, randomly disordering the sequence of the index information in the target index file to obtain a target index file with the index information arranged according to a second sequence, wherein the first sequence is different from the second sequence. And according to the second sequence, determining target index information belonging to the same reading batch from the target index file, so that target training data corresponding to the target index information are read from the training data set and input into the target deep learning model for each reading batch, and performing the round of training on the target deep learning model, thereby ensuring that data required by part are read in each round of training, and avoiding waste of a large amount of memory resources caused by loading the whole training data set. The order of the index information in the target index file is randomly disturbed through each round of training, random access to each piece of training data in the training data set based on the index is achieved, operations such as random sequencing of the training data can be converted into operations on the index information on the basis, and therefore only small-size index information and a small amount of training data needed in each step of training need to be stored in a memory.
Compared with the prior art that the whole training data set is randomly divided into a plurality of small data sets and only one small data set is loaded into a memory each time, the method provided by the embodiment of the application can solve the problems of large memory occupation and low utilization rate to a certain extent, and simultaneously can not randomly scramble all data compared with the prior art that random scrambling in each small data set and among each small data set can be realized.
Meanwhile, compared with the prior art that the whole training data set is randomly scrambled and a training data set copy using different scrambling sequences is created in each round, the method provided by the embodiment of the application randomly scrambles the index information, and even if the index information is stored, the copy of the target index file is stored, so that the file volume is greatly reduced. In addition, the method only needs to traverse the training data set when the target index file is generated, then the generated target index file can be stored in a disk, when the same training data set needs to be read later, only the generated target index file needs to be read, the training data set does not need to be traversed again, and the size of the index file is small compared with that of the whole training data set, so that the reading speed of the training data set can be remarkably improved.
Next, a detailed description will be given of a specific manner of creating the target index file in a case where the target index file needs to be created.
In this embodiment, the training data set for training the target deep learning model may include one or more. When the number of the training data sets is different, the mode of generating the target index file is slightly different.
If the training data set includes one, one possible implementation manner for constructing the target index file corresponding to the training data set may be to traverse the training data set, record the position of each piece of training data in the training data set, and construct the target index file according to the position.
The index information can be represented by byte offset of the training data in the training data set, that is, the position of each piece of training data in the training data set can be represented by byte offset of the training data in the training data set, and the byte offset of all the training data in the training data set can form an array ai, wherein i represents the ith piece of training data, namely, an incremental integer variable i is used and initialized to 0, and then each piece of training data in the training data set is traversed, and ai is assigned as the byte offset of the training data in the training data set.
If the training data set comprises a plurality of training data sets, one possible implementation manner for constructing the target index file corresponding to the training data set is to traverse the plurality of training data sets respectively, record the position of each piece of training data in the corresponding training data set, and construct a first array. And sequentially opening each training data set, and recording the file handle corresponding to the training data set in the second array. And constructing a third array according to the position index of the file handle corresponding to each training data set in the second array. And constructing a target index file according to the first array, the second array and the third array.
Wherein the first array is the above-mentioned array A [ i ]]Array A [ i ]]The byte offset recorded in (1) is the byte offset of the ith training data in a corresponding training data set. The second array may be by array C [ i ]]Representing, assuming M represents the number of training data set files that need to be read, NjRepresents the number of pieces of training data contained in the jth training data set (j e {1, …, M }), N represents the total number of pieces of training data contained in all the training data sets,
Figure BDA0002911811650000111
then the array C [ i ]]Is M, array A [ i ]]Is N. The third array may be B [ i ]]Is represented by the formula B [ i ]]The file handle assigned to the current training data set is in array C [ i ]]And i is incremented by 1 after each piece of training data is traversed. After all training data sets have been traversed, (A [ i ] can be],B[i]) As the ith trainingIndex information of data, and finally, the array A [ i ]]Array B [ i ]]And array C [ i ]]And storing the data in the index file for reading in subsequent use. The process needs to traverse all training data records once, and the time complexity is O (N), which is a function for representing the time complexity of the algorithm, wherein N is the total number of training data pieces contained in all the training data sets, and the time complexity is related to the total number of the training data pieces.
In this case, since the training data set includes a plurality of training data sets, and the first array records the byte offset of the ith training data in one training data set in the corresponding training data set, in order to read correct training data according to the first array representing the index information, the training data set in which the training data to be read is located needs to be determined from the plurality of training data sets, and then the training data can be read according to the byte offset of the training data to be read in the training data set. Therefore, in a possible implementation manner, the target training data corresponding to the target index information is read from the training data set and input into the target deep learning model by determining the target training data set in which the target training data is located from the plurality of training data sets according to the third array and the second array, and then reading the target training data corresponding to the target index information from the target training data set according to the first array and input into the target deep learning model.
Referring to fig. 6, an algorithm flowchart for reading the training data by the index lookup module may be shown, where index information (a [ i ], B [ i ]) (see S601 in fig. 6) is input to the index lookup module and output as corresponding target training data. The index lookup module first finds a file handle C [ B [ i ] ] of a training data set in which training data to be read is located, that is, determines a target training data set in which target training data is located (see S602 in fig. 6), and may be implemented by a function of handle ═ C [ B [ i ] ]. Then, according to the byte offset a [ i ] in the target data set, the file pointer corresponding to the target data set is moved to the beginning of the target training data (for example, the position indicated by the byte offset, see S603 in fig. 6), for example, by fseek operation. Then, a complete piece of target training data is read from the position (see S604 in fig. 6), and the target training data is returned (see S605 in fig. 6), which may be implemented by fseek (handle, a [ i ]). In the above process, the search operation for the file handle and the fseek operation are both random access operations, the time complexity is O (1), and O (1) is a fixed constant, which indicates that the time complexity is independent of the number of training data and is a fixed constant.
For a scenario with only one training data set, i.e., M ═ 1, only the array a [ i ] may be used to directly operate on the file handle of only one training data set.
In a possible implementation manner, when constructing the target index file, the file path of each training data set may be used as a key, and the position index of the file handle corresponding to each training data set in the second array is used as a value to construct the hash table. The hash table may be represented by an array H, and the hash table H stores a mapping from a file path to a position subscript of a file handle. Therefore, the target index file can be constructed according to the first array, the second array, the third array and the hash table, namely the array A [ i ], the array B [ i ] and the hash table H are stored in the target index file.
In this embodiment, after the target index file is constructed, if the training data needs to be read again later, the constructed target index file may be used, and the constructed array a [ i ], the array B [ i ], and the hash table H need only be read without re-determining the target index file, but since the file handle stored in the array C [ i ] may be changed, the array C [ i ] may be reconstructed. When reconstructing the array C [ i ], for the file path of each training data set, it needs to be ensured that the index of the position in the array C [ i ] is the value recorded in the hash table H. Therefore, when the training data is read again and input into the target deep learning model, the second array can be reconstructed according to the hash table.
If the array A [ j ] needs to be constructed when the target index file is constructed]Array B [ j ]]Array C [ j ]]And a hash table H, assuming that M represents the number of training data set files to be read, NjRepresents the jth training dataThe number of pieces of training data contained in the set (j ∈ {1, …, M }), N denotes the total number of pieces of training data contained in all sets of training data,
Figure BDA0002911811650000131
the algorithm flow for constructing the target index file can be seen in fig. 7: initializing an array A [ j ]]Array B [ j ]]Array C [ j ]]And hash table H (see S701); inputting a path array D of the training data set and the number M of the training data sets (see S702); starting from j being 0 (see S703), determining whether j is smaller than M (see S704), if so, opening the training data set, and recording a file handle corresponding to the training data set in the array C [ j [ ]]In (see S705), the process may be through C [ j ]]=open(D[j]) And (5) realizing. The hash table H stores a mapping of the file path to the location subscript of the file handle, which may be via H [ D [ j ]]]J implementation. J is increased by 1, i.e., j + ═ 1 (see S706). If the determination result in S704 is no, i is 0 and j is 0 (see S707), to determine the array a [ i ═ 0 ═ j (see S707)]And array B [ i ]]. Determine if j is less than M (see S708), and if so, determine array C [ j ]]If it has not been traversed (see S709), and if not, the byte offset of the training data in the corresponding training data set is determined (see S710), the process may be passed through A [ i ]]=ftell(C[j]) According to the implementation, the ftell function is used for returning the position of a file pointer in the current training data set. B [ i ]]The file handle assigned to the current training data set is in array C [ i ]]Position subscript (see S711). Move cj]The file pointer of (c) points to the end of the next training data (see S712). i + ═ 1 (see S713), j + ═ 1 (see S714), and the array a [ i ═ 1-]Array B [ i ]]And the hash table H is saved to the target index file (see S715). If the determination result of S708 is negative, directly execute S715; if the determination result of S709 is no, S714 is directly performed.
It should be noted that, in this embodiment, the form of the target index file may be expanded or simplified according to the characteristics of the training data and the actual scene needs.
For example, if only one training data set needs to be read, the array B [ i ], the array C [ i ] and the hash table H do not need to be constructed and stored, and only the array A [ i ] needs to be constructed and stored; if the sequence of reading the training data set each time is the same, only the arrays A [ i ], B [ i ] and C [ i ] are needed to be constructed and stored without constructing and storing a hash table H; if the target index file is not required to be stored for subsequent use, only the target index file is required to be used in a single training, the array Ci and the Hash table H are not required to be constructed, and the file handle of the training data set is directly stored in the array Ci; in addition, instead of the position index, the file path of the training data set may be directly stored in the array B [ i ], and a hash table H may be used to directly store the mapping from the file path of the training data set to the file handle of the training data set.
Next, a data access method provided in the embodiment of the present application is described in detail with reference to an actual application scenario. The application scenario is that in a deep learning related scenario, the method is used as a training acceleration component, and when a model trained based on a deep learning mode, such as a target deep learning model, is trained, training data required for training the target deep learning model can be read based on the method. Taking the example that the training data set includes one, referring to fig. 8, the method includes:
s801, the terminal device obtains a training data set used for training a target deep learning model.
S802, the terminal device determines whether a target index file corresponding to the training data set exists, if so, S803 is executed, and if not, S804 is executed.
And S803, the terminal equipment loads the target index file.
S804, the terminal device calls an index generation module to create a target index file.
When the target index file is created, an array A [ i ] can be constructed and stored in the target index file, wherein i represents the ith training data.
S805, in the process of training the target deep learning model of each wheel, the terminal equipment randomly scrambles the sequence of the index information in the target index file to obtain a target index file with the index information arranged according to a second sequence.
S806, the terminal device selects a plurality of adjacent index information as the target index information of the same reading batch according to the second sequence.
And S807, reading target training data corresponding to the target index information from the training data set, and inputting the target training data into a target deep learning model to complete training.
Based on the data access method provided by the foregoing embodiment, an embodiment of the present application provides a data access apparatus, which includes an obtaining unit 901, a determining unit 902, a scrambling unit 903, and a reading unit 904, as shown in fig. 9:
the acquiring unit 901 is configured to acquire a training data set used for training a target deep learning model;
the determining unit 902 is configured to determine, according to the training data set, a target index file, where the target index file records, according to a first order, index information of each piece of training data in the training data set, and the index information indicates a position of the training data in the training data set;
the scrambling unit 903 is configured to randomly scramble the order of the index information in the target index file during the process of training each pair of the target deep learning model, so as to obtain a target index file in which the index information is arranged according to a second order, where the first order is different from the second order;
the determining unit 902 is further configured to determine, according to the second order, target index information belonging to the same reading batch from a target index file;
the reading unit 904 is configured to, for each reading batch, read target training data corresponding to the target index information from the training data set and input the target deep learning model, so as to perform the round of training on the target deep learning model.
In an implementation manner, the determining unit 902 is configured to:
searching a target index file corresponding to the training data set in an index file storage area;
and if the target index file is not found in the index file storage area, constructing a target index file corresponding to the training data set.
In an implementation manner, if the training data set includes one, the determining unit 902 is configured to:
traversing the training data set, and recording the position of each piece of training data in the training data set;
and constructing the target index file according to the position.
In an implementation manner, if the training data set includes a plurality of training data sets, the determining unit 902 is configured to:
traversing the training data sets respectively, recording the position of each piece of training data in the corresponding training data set, and constructing a first array;
sequentially opening each training data set, and recording file handles corresponding to the training data sets in a second array;
constructing a third array according to the position index of the file handle corresponding to each training data set in the second array;
and constructing the target index file according to the first array, the second array and the third array.
In one implementation, the reading unit 904 is configured to:
determining a target training data set in which target training data are located from the plurality of training data sets according to the third array and the second array;
and according to the first array, reading target training data corresponding to the target index information from the target training data set and inputting the target training data into the target deep learning model.
In an implementation manner, the determining unit 902 is further configured to:
taking the file path of each training data set as a key, and taking the position index of the file handle corresponding to each training data set in the second array as a value to construct a hash table;
the constructing the target index file according to the first array, the second array and the third array comprises:
and constructing the target index file according to the first array, the second array, the third array and the hash table.
In an implementation manner, the determining unit 902 is further configured to:
and when the training data is read again and input into the target deep learning model, reconstructing the second array according to the hash table.
In one implementation, the reading unit 904 is further configured to:
and if the target index file is found in the index file storage area, loading the target index file.
The embodiment of the present application further provides a device for data access, where the device may be a data processing device and is configured to execute a data access method, and the device may be a terminal device, and for example, the terminal device is a smart phone:
fig. 10 is a block diagram illustrating a partial structure of a smartphone related to a terminal device provided in an embodiment of the present application. Referring to fig. 10, the smart phone includes: radio Frequency (RF) circuit 1010, memory 1020, input unit 1030, display unit 1040, sensor 1050, audio circuit 1060, wireless fidelity (WiFi) module 1070, processor 1080, and power source 1090. The input unit 1030 may include a touch panel 1031 and other input devices 1032, the display unit 1040 may include a display panel 1041, and the audio circuit 1060 may include a speaker 1061 and a microphone 1062. Those skilled in the art will appreciate that the smartphone configuration shown in fig. 10 is not intended to be limiting and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
The memory 1020 may be used to store software programs and modules, and the processor 1080 executes various functional applications and data processing of the smart phone by operating the software programs and modules stored in the memory 1020. The memory 1020 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the smartphone, and the like. Further, the memory 1020 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The processor 1080 is a control center of the smartphone, connects various parts of the entire smartphone through various interfaces and lines, and executes various functions and processes data of the smartphone by running or executing software programs and/or modules stored in the memory 1020 and calling data stored in the memory 1020, thereby integrally monitoring the smartphone. Optionally, processor 1080 may include one or more processing units; preferably, the processor 1080 may integrate an application processor, which handles primarily the operating system, user interfaces, applications, etc., and a modem processor, which handles primarily the wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 1080.
In this embodiment, the processor 1080 in the terminal device may perform the following steps:
acquiring a training data set for training a target deep learning model;
determining a target index file according to the training data set, wherein the target index file records index information of each piece of training data in the training data set according to a first sequence, and the index information represents the position of the training data in the training data set;
in the process that each wheel trains the target deep learning model, randomly disordering the sequence of the index information in the target index file to obtain a target index file in which the index information is arranged according to a second sequence, wherein the first sequence is different from the second sequence;
according to the second sequence, determining target index information belonging to the same reading batch from the target index file;
and for each reading batch, reading target training data corresponding to the target index information from the training data set and inputting the target deep learning model so as to perform the round of training on the target deep learning model.
Referring to fig. 11, fig. 11 is a block diagram of a server 1100 provided in this embodiment, where the server 1100 may have a large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1122 (e.g., one or more processors) and a memory 1132, and one or more storage media 1130 (e.g., one or more mass storage devices) storing an application program 1142 or data 1144. Memory 1132 and storage media 1130 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1122 may be provided in communication with the storage medium 1130 to execute a series of instruction operations in the storage medium 1130 on the server 1100.
The server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1158, and/or one or more operating systems 1141, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.
In this embodiment, the steps implemented by the server may be implemented based on the structure of the server described in fig. 11.
According to an aspect of the present application, there is provided a computer-readable storage medium for storing program code for executing the data access method described in the foregoing embodiments.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the embodiment.
The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (15)

1. A method of data access, the method comprising:
acquiring a training data set for training a target deep learning model;
determining a target index file according to the training data set, wherein the target index file records index information of each piece of training data in the training data set according to a first sequence, and the index information represents the position of the training data in the training data set;
in the process that each wheel trains the target deep learning model, randomly disordering the sequence of the index information in the target index file to obtain a target index file in which the index information is arranged according to a second sequence, wherein the first sequence is different from the second sequence;
according to the second sequence, determining target index information belonging to the same reading batch from the target index file;
and for each reading batch, reading target training data corresponding to the target index information from the training data set and inputting the target deep learning model so as to perform the round of training on the target deep learning model.
2. The method of claim 1, wherein determining a target index file from the training data set comprises:
searching a target index file corresponding to the training data set in an index file storage area;
and if the target index file is not found in the index file storage area, constructing a target index file corresponding to the training data set.
3. The method according to claim 2, wherein if the training data set includes one, the constructing the target index file corresponding to the training data set includes:
traversing the training data set, and recording the position of each piece of training data in the training data set;
and constructing the target index file according to the position.
4. The method according to claim 2, wherein if the training data set includes a plurality of training data sets, the constructing the target index file corresponding to the training data set includes:
traversing the training data sets respectively, recording the position of each piece of training data in the corresponding training data set, and constructing a first array;
sequentially opening each training data set, and recording file handles corresponding to the training data sets in a second array;
constructing a third array according to the position index of the file handle corresponding to each training data set in the second array;
and constructing the target index file according to the first array, the second array and the third array.
5. The method according to claim 4, wherein the reading of the target training data corresponding to the target index information from the training data set is input into the target deep learning model, and comprises:
determining a target training data set in which target training data are located from the plurality of training data sets according to the third array and the second array;
and according to the first array, reading target training data corresponding to the target index information from the target training data set and inputting the target training data into the target deep learning model.
6. The method of claim 4, further comprising:
taking the file path of each training data set as a key, and taking the position index of the file handle corresponding to each training data set in the second array as a value to construct a hash table;
the constructing the target index file according to the first array, the second array and the third array comprises:
and constructing the target index file according to the first array, the second array, the third array and the hash table.
7. The method of claim 6, further comprising:
and when the training data is read again and input into the target deep learning model, reconstructing the second array according to the hash table.
8. The method of claim 2, further comprising:
and if the target index file is found in the index file storage area, loading the target index file.
9. The method of any of claims 1-8, wherein the index information is represented by a byte offset of the training data in the training data set.
10. A data access apparatus, characterized in that the apparatus comprises an acquisition unit, a determination unit, a scramble unit, and a read unit:
the acquisition unit is used for acquiring a training data set used for training a target deep learning model;
the determining unit is configured to determine a target index file according to the training data set, where the target index file records index information of each piece of training data in the training data set according to a first order, and the index information indicates a position of the training data in the training data set;
the disordering unit is used for randomly disordering the sequence of the index information in the target index file in the process of training the target deep learning model by each wheel to obtain a target index file in which the index information is arranged according to a second sequence, wherein the first sequence is different from the second sequence;
the determining unit is further configured to determine, according to the second order, target index information belonging to the same reading batch from a target index file;
the reading unit is configured to, for each reading batch, read target training data corresponding to the target index information from the training data set and input the target deep learning model, so as to perform the round of training on the target deep learning model.
11. The apparatus of claim 10, wherein the determining unit is configured to:
searching a target index file corresponding to the training data set in an index file storage area;
and if the target index file is not found in the index file storage area, constructing a target index file corresponding to the training data set.
12. The apparatus according to claim 11, wherein if the training data set comprises one, the determining unit is configured to:
traversing the training data set, and recording the position of each piece of training data in the training data set;
and constructing the target index file according to the position.
13. The apparatus of claim 11, wherein if the training data set comprises a plurality of training data sets, the determining unit is configured to:
traversing the training data sets respectively, recording the position of each piece of training data in the corresponding training data set, and constructing a first array;
sequentially opening each training data set, and recording file handles corresponding to the training data sets in a second array;
constructing a third array according to the position index of the file handle corresponding to each training data set in the second array;
and constructing the target index file according to the first array, the second array and the third array.
14. An apparatus for data access, the apparatus comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any of claims 1-9 according to instructions in the program code.
15. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program code for performing the method of any of claims 1-9.
CN202110089281.2A 2021-01-22 2021-01-22 Data access method, device, equipment and storage medium Pending CN113065663A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110089281.2A CN113065663A (en) 2021-01-22 2021-01-22 Data access method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110089281.2A CN113065663A (en) 2021-01-22 2021-01-22 Data access method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113065663A true CN113065663A (en) 2021-07-02

Family

ID=76558685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110089281.2A Pending CN113065663A (en) 2021-01-22 2021-01-22 Data access method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113065663A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116363457A (en) * 2023-03-17 2023-06-30 阿里云计算有限公司 Task processing, image classification and data processing method of task processing model
WO2023230767A1 (en) * 2022-05-30 2023-12-07 华为技术有限公司 Model training system, model training method, training device and training node

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023230767A1 (en) * 2022-05-30 2023-12-07 华为技术有限公司 Model training system, model training method, training device and training node
CN116363457A (en) * 2023-03-17 2023-06-30 阿里云计算有限公司 Task processing, image classification and data processing method of task processing model
CN116363457B (en) * 2023-03-17 2024-04-30 阿里云计算有限公司 Task processing, image classification and data processing method of task processing model

Similar Documents

Publication Publication Date Title
CN109101620B (en) Similarity calculation method, clustering method, device, storage medium and electronic equipment
CN107391549B (en) Artificial intelligence based news recall method, device, equipment and storage medium
CN109857744B (en) Sparse tensor calculation method, device, equipment and storage medium
CN112037792B (en) Voice recognition method and device, electronic equipment and storage medium
CN113065663A (en) Data access method, device, equipment and storage medium
CN110209809B (en) Text clustering method and device, storage medium and electronic device
CN113298197B (en) Data clustering method, device, equipment and readable storage medium
CN111563192A (en) Entity alignment method and device, electronic equipment and storage medium
WO2020207410A1 (en) Data compression method, electronic device, and storage medium
CN112687266B (en) Speech recognition method, device, computer equipment and storage medium
CN113869420A (en) Text recommendation method based on comparative learning and related equipment
CN114529741A (en) Picture duplicate removal method and device and electronic equipment
CN111222399B (en) Method and device for identifying object identification information in image and storage medium
US11250080B2 (en) Method, apparatus, storage medium and electronic device for establishing question and answer system
CN116976428A (en) Model training method, device, equipment and storage medium
CN109697083B (en) Fixed-point acceleration method and device for data, electronic equipment and storage medium
CN114332550A (en) Model training method, system, storage medium and terminal equipment
CN111767419B (en) Picture searching method, device, equipment and computer readable storage medium
CN113407702B (en) Employee cooperation relationship intensity quantization method, system, computer and storage medium
CN114648679A (en) Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium
CN114648650A (en) Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium
CN113761152A (en) Question-answer model training method, device, equipment and storage medium
CN113705589A (en) Data processing method, device and equipment
CN115113855A (en) Audio data processing method and device, electronic equipment and storage medium
CN111797984A (en) Quantification and hardware acceleration method and device for multitask neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40048376

Country of ref document: HK