CN116432778B

CN116432778B - Data processing method and device, storage medium and electronic equipment

Info

Publication number: CN116432778B
Application number: CN202310695188.5A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2023-09-19
Anticipated expiration: 2043-06-12
Also published as: CN116432778A

Abstract

The specification discloses a data processing method, a data processing device, a storage medium and electronic equipment, wherein each tensor of a model stored in a video memory is determined in the model training process. And then determining a first cost for recalculating each tensor, estimating the access time of the next access to each tensor, determining the transmission time for transferring each tensor to other equipment, and further determining a second cost according to the access time and the transmission time. The final cost of evicting each tensor is determined from the recalculated first cost of each tensor and the transferred second cost of each tensor. And finally, selecting a target tensor from the tensors currently stored in the video memory according to the final cost of each tensor, and performing eviction. The method can keep the occupancy rate of the video memory within a reasonable range all the time, so that the GPU can complete training of the model. And the method that the time cost corresponding to each tensor is minimum is used for expelling each tensor, so that the time cost for expelling the tensor is reduced, and the model training efficiency is improved.

Description

Data processing method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, a data processing device, a storage medium, and an electronic device.

Background

With the development of technology, artificial intelligence is increasingly attracting attention. At present, a model can be established by combining actual conditions to solve problems in different scenes, and the model has wide application range and can rapidly obtain accurate prediction results through the model, so that the model is applied in various scenes.

Generally, when training a larger model, a graphics processor (GPU, graphic Processing Unit) is required to train the model, and tensors in the model (parameters of the model itself and output data of the model) are stored in the video memory of the GPU. However, the complexity of the model is high, that is, the model itself has a large number of parameters and the amount of data generated during training is large, so that the video memory cannot store all data of the model. Therefore, how to process tensors in the video memory to reasonably utilize the video memory is a difficult problem.

Based on this, the present specification provides a method of data processing.

Disclosure of Invention

The present disclosure provides a data processing method, apparatus, storage medium, and electronic device, so as to at least partially solve the foregoing problems of the prior art.

The technical scheme adopted in the specification is as follows:

the present specification provides a method of data processing, the method comprising:

determining at least part of tensors of the model stored in a video memory in the training process of the model;

determining a first cost for recalculating the at least partial tensor; and determining a second cost of transferring the at least partial tensor;

and selecting a target tensor from the at least partial tensor currently stored in the video memory according to the first cost and the second cost of the at least partial tensor, and expelling the target tensor.

Optionally, determining the second cost of transferring the at least partial tensor specifically includes:

estimating the access time of the next access to the at least partial tensor, and determining the transmission time required for transferring the at least partial tensor;

and determining a second cost for transferring the at least partial tensor according to the access time and the transmission time.

Optionally, the method further comprises:

determining reference access time of at least part of tensors in the process of completing one round of iterative training by the model;

Estimating the access time interval of the at least partial tensor according to the reference access time of the at least partial tensor;

estimating the access time of the next access to the at least partial tensor specifically includes:

and estimating the access time of the next access to the at least partial tensor according to the access time interval of the at least partial tensor.

Optionally, determining the transmission time required for transferring the at least partial tensor specifically includes:

determining a first time to transmit the at least partial tensor to a target device over a bus and determining a second time for the target device to return the at least partial tensor over the bus;

determining a queuing time when the at least partial tensor is transmitted over the bus;

and determining the transmission time required for transferring the at least partial tensor according to the first time, the second time and the queuing waiting time.

determining the time interval between the access time and the current time;

determining a second cost of transferring the at least partial tensor according to a difference between the transmission time and the time interval; wherein the difference is proportional to the second cost.

Optionally, selecting a target tensor from the at least partial tensors currently stored in the video memory and expelling the target tensor, which specifically includes:

determining a final cost of evicting the at least partial tensor;

and selecting a target tensor from the at least partial tensors currently stored in the video memory according to the final cost and expelling the target tensor.

Optionally, determining the final cost of evicting the at least partial tensor specifically includes:

and taking the smallest cost of the first cost and the second cost of the at least partial tensor as the final cost of expelling the at least partial tensor.

Optionally, before evicting the target tensor, the method further comprises:

and determining that the occupancy rate of the video memory is larger than a preset threshold value.

Optionally, selecting a target tensor from the at least partial tensors currently stored in the video memory specifically includes:

and sequentially selecting tensors as target tensors according to the order from small to large of the final cost of the at least partial tensors currently stored in the video memory.

The present specification provides an apparatus for data processing, comprising:

the tensor determining module is used for determining at least part of tensors of the model stored in the video memory in the training process of the model;

A cost calculation module for determining a first cost for recalculating the at least partial tensor; and determining a second cost of transferring the at least partial tensor;

and the tensor eviction module is used for selecting a target tensor from the at least partial tensor currently stored in the video memory according to the first cost and the second cost of the at least partial tensor and evicting the target tensor.

Optionally, the cost calculation module is specifically configured to estimate an access time of accessing the at least partial tensor next time, determine a transmission time required for transferring the at least partial tensor, and determine a second cost for transferring the at least partial tensor according to the access time and the transmission time.

Optionally, the tensor determining module is further configured to determine reference access time of each tensor in a process that the model completes one round of iterative training; estimating the access time interval of the at least partial tensor according to the reference access time of the at least partial tensor;

the cost calculation module is specifically configured to estimate, according to the access time interval of the at least partial tensor, an access time when the at least partial tensor is accessed next time.

Optionally, the cost calculation module is specifically configured to determine a first time when the at least part of the tensor is transmitted to the target device through the bus, and determine a second time when the target device returns each tensor through the bus; determining a queuing time when the at least partial tensor is transmitted over the bus; and determining the transmission time required for transferring the at least partial tensor according to the first time, the second time and the queuing waiting time.

Optionally, the cost calculation module is specifically configured to determine a time interval between the access time and the current time; determining a second cost of transferring the at least partial tensor according to a difference between the transmission time and the time interval; wherein the difference is proportional to the second cost.

Optionally, the tensor eviction module is specifically configured to determine a final cost of evicting the at least part of the tensor; and selecting a target tensor from the at least partial tensors currently stored in the video memory according to the final cost and expelling the target tensor.

Optionally, the tensor eviction module is specifically configured to use a minimum cost of the first cost and the second cost of the at least partial tensor as a final cost of evicting the at least partial tensor.

Optionally, the apparatus further comprises an occupancy determination module;

the occupancy rate determining module is specifically configured to determine that the occupancy rate of the video memory is greater than a preset threshold.

Optionally, the tensor eviction module is specifically configured to sequentially select, as the target tensor, the tensors according to an order from small to large of the final cost of the at least part of the tensors currently stored in the video memory.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor performs the method of data processing described above.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method of data processing as described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

in the method of data processing provided in the present specification, at least a partial tensor of a model stored in a memory is determined during model training. And then calculating the first cost of recalculating each tensor, estimating the access time of the next access to each tensor, determining the transmission time for transferring each tensor to other equipment, and further determining the second cost according to the access time and the transmission time. And determining a final cost for evicting each tensor according to the first cost and the second cost. And finally, expelling each tensor according to the final cost of each tensor.

It can be seen from the above method that the final cost is obtained by calculating the first cost for recalculating each tensor in the video memory and the second cost for transferring each tensor. And then the tensor currently stored in the video memory can be evicted according to the final cost. The method can enable the occupancy rate of the video memory to be always maintained in a reasonable range, and enable the GPU to continue to execute training of the model. And when each tensor in the video memory is evicted, the recalculation mode and the transfer mode are combined, the time cost of recalculating each tensor and the time cost of transferring each tensor are calculated, and then each tensor in the video memory is evicted in a mode that the time cost corresponding to each tensor is minimum is selected, so that the time cost of evicting each tensor in the video memory is reduced, and the model training efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a flow chart of a method of data processing in the present specification;

FIG. 2 is a schematic diagram of an apparatus for data processing provided in the present specification;

fig. 3 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a flow chart of a method for data processing provided in the present specification, which specifically includes the following steps:

S100: during the training of the model, the tensors of the model stored in the video memory are determined.

Generally, GPUs have greater advantages than CPUs in performing parallel computing and computationally intensive tasks because they have more arithmetic logic units than central processing units (Central Processing Unit, CPUs). Therefore, when training the model, the GPU may be used to perform training of the model to more efficiently complete training of the model. However, in the model training process, the video memory of the GPU cannot bear all parameters of the model and output results of the model, and based on this, the present specification provides a data processing method, which can release the video memory and improve the model training efficiency.

In one or more embodiments of the present description, the GPU may first obtain at least a partial tensor of the model stored in the memory to calculate the cost of evicting each tensor in the at least partial tensor in a subsequent step.

Wherein, during the training of the machine learning model, the Tensor (Tensor) is actually a multidimensional array (multidimensional array) for representing a high-dimensional matrix, vector. In the TensorFlow, which is commonly used for training models, the tensor is represented as an n-dimensional array of basic data types, i.e., in the TensorFlow, all data is an n-dimensional array, called tensor. In one or more embodiments of the present description, the input data or the output data of each node of the network layer in the model may be regarded as one Tensor.

S102: determining a first cost for recalculating the at least partial tensor; and determining a second cost of transferring the at least partial tensor.

In one or more embodiments of the present description, the GPU may first determine a first cost to recalculate each tensor. In particular, the time used to recalculate each tensor at the current time may be taken as the first cost.

The recalculation is to release the tensor in the video memory, and then recalculate the tensor when the tensor is used again. Transferring tensors is to transfer tensors to other devices (such as a CPU), and reading the tensors from the other devices when the tensors are reused.

In recalculating the tensor, the tensor may be recalculated based on the method of setting the gradient checkpoints, or other methods may be used, and the present specification is not limited thereto.

Then, when the video memory is released, the GPU may also transfer each tensor in the video memory to the target device for storage, so the GPU may determine a second cost of transferring each tensor.

In one or more embodiments of the present disclosure, the GPU may first estimate an access time of a next access tensor, determine a transmission time required to transfer at least a portion of the tensor, to determine a second cost of transferring the tensor based on the access time and the transmission time.

Since the model is iteratively trained through different batches of samples during model training, the GPU access to each tensor of the model is regular in time. The GPU may determine a reference access time for each tensor of the model in the process of completing one round of iterative training before estimating the access time for accessing each tensor next time, thereby estimating the access time interval for each tensor. The reference access time is the access time of accessing each tensor for the first time in the process of completing one round of training of the model. For example: the model starts training at 7 time 30 minutes, obtains tensor A at 7 time 30 minutes 5 seconds, obtains tensor B at 7 time 30 minutes 12 seconds, and obtains tensor C at 7 time 30 minutes 13 seconds, and the reference access time of tensor A, B, C is 7 time 30 minutes 5 seconds, 7 time 30 minutes 12 seconds, 7 time 30 minutes 13 seconds, and 7 time 30 minutes 14 seconds, respectively, and accesses tensor A again at 7 time 30 minutes 14 seconds, and the access time interval of tensor A can be determined to be 14 seconds.

The GPU may estimate access times for next access to each tensor based on the access time intervals for each tensor. As is clear from the above example, when the first access time of the tensor a is 7 hours 30 minutes 5 seconds and the access time interval is 14 seconds, the access times after the tensor a are 7 hours 30 minutes 19 seconds, 7 hours 30 minutes 33 seconds, and 7 hours 30 minutes 47 seconds …. Assuming that the current time is 7 hours 30 minutes 39 seconds, the next access time of access tensor a is 7 hours 30 minutes 47 seconds.

It should be noted that, when performing an iteration training on the model, one tensor may be accessed many times, and the access time intervals are not necessarily the same, so that each access time interval of the tensor in different phases of one iteration may be determined, and then the access time of the tensor is determined according to the access time intervals of the tensor in different phases in the iteration process of the subsequent iteration. In an ideal case, the access time intervals of the tensors may be the same, that is, the access period of the tensors may be determined according to the reference access time, and then the access time of the tensors may be estimated according to the access period.

When the GPU determines the transfer time required to transfer each tensor, the GPU may determine a first time to transfer each tensor to the target device over the bus and determine a second time for the target device to return each tensor to the GPU. And because the computations in the GPU are parallel, there may be multiple tensors in-line waiting to be transferred to the target device, and the in-line waiting time of each tensor is calculated when the tensor is transferred to the target device. The sum of the first time, the second time, and the queuing time may then be taken as the transmission time required to transfer each tensor.

The target device may be a CPU, that is, the GPU may transfer each tensor in the video memory to the cache of the CPU. And, data transmission is carried out between the GPU and the CPU through a PCIe bus.

In one or more embodiments of the present description, the GPU may determine a first time to transmit each tensor to the CPU over the PCIe bus and determine a second time for the CPU to return each tensor to the GPU over the PCIe bus. The queuing time of each tensor is the time of queuing each tensor for transmission through the PCIe bus.

Finally, the GPU may determine a second cost for transferring each tensor according to the transfer time required for transferring each tensor and the estimated access time for accessing each tensor next time.

Specifically, the GPU may determine a time interval between the access time and the current time, and determine the second cost according to a difference between the transmission time and the time interval, where the larger the difference between the transmission time and the time interval is, the larger the second cost of the transfer tensor is. When the transmission time is not greater than the time interval, the tensor is evicted by using the transferring mode at the current time, and the time cost is not increased when the tensor is acquired again at the next access time. Since transferring tensors does not add additional time overhead when the transmission time is not longer than the time interval, a preset minimum cost may also be used as the second cost, i.e. when the transmission time is not longer than the time interval, the second cost is the preset minimum cost. The preset minimum cost may be zero or a negative value, and the specification is not limited, so long as the preset minimum cost is smaller than the minimum second cost (the minimum value of the second cost is greater than zero).

For example: the transmission time is 5S, the time interval between the current time and the access time is 7S, and then the second cost may be a preset minimum cost. That is, the tensor X is evicted at the current time, and the time when the tensor X is transferred to other devices at the current time and the tensor X is acquired again is 5S, which is smaller than the time interval 7S between the access time of the next access to the tensor X and the current time, which indicates that the manner of evicting the tensor X by transferring does not have any influence on the use of the next tensor X in time.

And when the transmission time is longer than the time interval, a certain time delay is caused to the next use of the tensor by means of the transfer to expel the tensor. For example: assuming that the transmission time is 3S, the time interval between the current time and the access time is 2S, that is, the tensor is originally used after 2S at the current time, but the tensor is obtained after 3S at the current time, that is, the time overhead of 1S is increased.

In one or more embodiments of the present disclosure, the time cost of recalculating each tensor and the time cost of transferring each tensor are taken as costs, but when recalculating each tensor and transferring each tensor, not only the time cost but also the computational power resources consumed by recalculating each tensor and the computational power resources consumed by transferring each tensor may be taken as costs, and the time cost and the consumption of the computational power resources may be weighted and calculated.

S104: and selecting a target tensor from the at least partial tensor currently stored in the video memory according to the first cost and the second cost of the at least partial tensor, and expelling the target tensor.

In this specification, in order to improve training efficiency of the model, after calculating the first cost and the second cost of each tensor, the server may determine which tensor to evict from the video memory in which policy based on the first cost and the second cost.

In one or more embodiments of the present description, the GPU may first determine whether the occupancy of the video memory is greater than a preset threshold before selecting the target tensor from the video memory for eviction. If the occupancy rate of the video memory is determined to be greater than the preset threshold value, the GPU can select a target tensor from the tensors currently stored in the video memory according to the first cost and the second cost of each tensor and expel the target tensor.

In one or more embodiments of the present description, to improve the efficiency of model training, i.e., reduce the time overhead from evicting each tensor to reacquiring each tensor, the GPU may determine the final cost for each tensor based on the first cost and the second cost for each tensor. So that the server can select a target tensor from the tensors currently stored in the memory and evict based on the final cost. Such as: the GPU can select a specified number of tensors with the smallest final cost from the tensors currently stored in the video memory as target tensors according to the order from small to large of the final cost of each tensor currently stored in the video memory, and expel the tensors so that the occupancy rate of the video memory is smaller than a preset threshold value. Alternatively, the GPU may sequentially select, according to the order from the smallest final cost to the largest final cost, the tensor with the smallest final cost as the target tensor and expel the tensor until the occupancy rate of the video memory is not greater than the preset threshold.

In addition, in one or more embodiments of the present disclosure, when the target tensor is selected from at least a part of the tensors currently stored in the video memory according to the first cost and the second cost and evicted, each tensor may be respectively ordered according to the first cost and the second cost of each tensor, and then the target tensor is selected from each tensor and evicted based on the order of the ordered tensors. Specifically, the GPU may determine, according to each tensor ordered based on the second cost, whether the tensor is evicted for each tensor in order from small to large. If yes, the next tensor corresponding to the tensor is judged, if not, the first cost corresponding to the tensor is determined, if the first cost corresponding to the tensor is smaller than the second cost, the tensor is evicted in a recalculation mode, and if the first cost corresponding to the tensor is not smaller than the second cost, the tensor is evicted in a transfer mode.

Because the time cost of transferring may be hidden in an ideal case, or in other words, in a case where the time interval between the current time and the access time of the next tensor may cover the transmission time of transferring the tensor at the current time, it may be considered that no additional time overhead is required when transferring the tensor, in one or more embodiments of the present disclosure, it may be preferable to use transfer-release video memory, that is, release video memory according to the tensor ordered based on the second cost, and if the bus bandwidth resources are tight, recalculate, that is, release video memory according to the tensor ordered based on the first cost.

In the method for processing data provided by the present specification based on fig. 1, the occupancy rate of the video memory can be maintained within a reasonable range, and in the process of model training, the video memory can be reasonably utilized to enable the GPU to complete model training. When each tensor in the video memory is evicted, each tensor is not evicted only based on a recalculation mode, each tensor is not evicted only based on a transfer mode, the recalculation mode and the transfer mode are combined, the time cost of each tensor is calculated and the time cost of each tensor is calculated, each tensor in the video memory is evicted by selecting the mode that the time cost corresponding to each tensor is minimum, the time cost from evicting each tensor in the video memory to acquiring each tensor from the video memory again is reduced, and the model training efficiency is improved.

Further, in step S106, to further reduce the time overhead, the GPU may use a minimum heap (min-map) to manage the first cost and the second cost (i.e. the final cost) of each tensor, may use a queue to organize the queue in order from small to large according to the final cost of each tensor, and may further take the tensor with the minimum final cost of the first several tensors of the queue for eviction in the video memory, and may also construct a tree structure to manage the final cost of each tensor. When the target tensor with the minimum final cost of the appointed number is selected from the tensors to be evicted, the tensor with the minimum final cost of the appointed number can be quickly found, and the searching efficiency is improved.

Further, in the step S102, when determining the second cost according to the time interval between the access time and the current time and the transmission time, the transfer cost of the tensor may be the opposite number of the time benefit, and the second cost may be represented by the formula: the second cost = - (interval-transfer time) is calculated as the sum of the time to transfer the tensor from the current device (GPU) to the target device (e.g. CPU) and from the target device to the current device again, wherein the single transfer time is obtained by dividing the data size of the tensor by the bus (PCIe) bandwidth between the two devices (CPU-GPU), and the GPU can also record the bus queuing latency, which can be obtained by dividing the data size of the tensor waiting to be transferred by the bus bandwidth, and can be taken into account together with the transfer time.

The second cost is negative when it is determined that the time interval from the next memory access time to the current time of the tensor is greater than the transfer time for transferring the tensor. That is, it shows that if the time interval between the current time and the time of the next access to the tensor can mask the transmission time of the tensor, the cost is negative, and the larger the difference, the smaller the cost. In this way, tensors that do not introduce delay using the transfer can be preferentially evicted, avoiding the extra time-consuming recalculation.

According to the method, the tensor is calculated again and the tensor is transferred again, namely the recalculation cost and the transfer cost of the tensor are put together and are ordered from small to large, so that whether the tensor is calculated again or transferred is determined, the tensor is sequentially driven by the selected method until the occupancy rate of the video memory is not larger than a preset threshold value, time expenditure is reduced, and the model training efficiency is improved.

Based on the above-mentioned data processing method, the embodiment of the present disclosure further provides a schematic apparatus for data processing, as shown in fig. 2.

Fig. 2 is a schematic diagram of an apparatus for data processing according to an embodiment of the present disclosure, where the apparatus includes:

a tensor determination module 200, configured to determine at least a part of tensors of the model stored in the video memory during training of the model;

a cost calculation module 202 for determining a first cost for recalculating the at least partial tensor; and determining a second cost of transferring the at least partial tensor;

and the tensor eviction module 204 is configured to select a target tensor from the at least partial tensor currently stored in the video memory according to the first cost and the second cost of the at least partial tensor, and evict the target tensor.

Optionally, the cost calculation module 202 is specifically configured to estimate an access time of the next access to the at least partial tensor, determine a transmission time required for transferring the at least partial tensor, and determine a second cost for transferring the at least partial tensor according to the access time and the transmission time.

Optionally, the tensor determining module 200 is further configured to determine a reference access time of each tensor in a process of completing one round of iterative training by the model; estimating the access time interval of the at least partial tensor according to the reference access time of the at least partial tensor;

the cost calculation module 202 is specifically configured to estimate, according to the access time interval of the at least partial tensor, an access time of the next access to the at least partial tensor.

Optionally, the cost calculation module 202 is specifically configured to determine a first time for transmitting the at least partial tensor to a target device through a bus, and determine a second time for the target device to return each tensor through the bus; determining a queuing time when the at least partial tensor is transmitted over the bus; and determining the transmission time required for transferring the at least partial tensor according to the first time, the second time and the queuing waiting time.

Optionally, the cost calculation module 202 is specifically configured to determine a time interval between the access time and the current time; determining a second cost of transferring the at least partial tensor according to a difference between the transmission time and the time interval; wherein the difference is proportional to the second cost.

Optionally, the tensor eviction module 204 is specifically configured to determine a final cost of evicting the at least part of the tensor; and selecting a target tensor from the at least partial tensors currently stored in the video memory according to the final cost and expelling the target tensor.

Optionally, the tensor eviction module 204 is specifically configured to use a minimum cost of the first cost and the second cost of the at least partial tensor as a final cost of evicting the at least partial tensor.

Optionally, the apparatus further comprises an occupancy determination module 206;

the occupancy rate determining module 206 is specifically configured to determine that the occupancy rate of the video memory is greater than a preset threshold.

Optionally, the tensor eviction module 204 is specifically configured to sequentially select, as the target tensor, the tensors in order from the low cost to the high cost of the at least part of the tensors currently stored in the video memory.

The embodiments of the present specification also provide a computer-readable storage medium storing a computer program usable for executing the method of data processing described above.

Based on the data processing method described above, the embodiment of the present disclosure further proposes a schematic block diagram of the electronic device shown in fig. 3. At the hardware level, as in fig. 3, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, although it may include hardware required for other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the data processing method.

Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present application.

Claims

1. A method of data processing, the method comprising:

determining a first cost for recalculating the at least partial tensor; determining reference access time of at least part of tensors in the process of completing one round of iterative training by the model;

estimating the access time of the next access to the at least partial tensor according to the access time interval of the at least partial tensor;

determining a transmission time required to transfer the at least partial tensor;

determining a second cost of transferring the at least partial tensor according to the access time and the transmission time;

taking the smallest cost of the first cost and the second cost of the at least partial tensor as the final cost of evicting the at least partial tensor; and selecting a target tensor from the at least partial tensors currently stored in the video memory according to the final cost and expelling the target tensor.

2. The method according to claim 1, wherein determining the transmission time required for transferring the at least partial tensor, in particular, comprises:

3. The method of claim 1, wherein determining the second cost of transferring the at least partial tensor, in particular, comprises:

determining the time interval between the access time and the current time;

4. The method of claim 1, wherein selecting and evicting a target tensor from the at least partial tensors currently stored by the video memory, comprises:

for each tensor, if the first cost corresponding to the tensor is smaller than the second cost, the tensor is evicted by adopting a recalculation mode, and if the first cost corresponding to the tensor is not smaller than the second cost, the tensor is evicted by adopting a transfer mode.

5. The method of claim 1, wherein prior to evicting the target tensor, the method further comprises:

6. The method of claim 1, wherein selecting a target tensor from the at least partial tensors currently stored by the video memory, specifically comprises:

7. An apparatus for data processing, the apparatus comprising:

a cost calculation module for determining a first cost for recalculating the at least partial tensor; determining reference access time of at least part of tensors in the process of completing one round of iterative training by the model; estimating the access time interval of the at least partial tensor according to the reference access time of the at least partial tensor; estimating the access time of the next access to the at least partial tensor according to the access time interval of the at least partial tensor; determining a transmission time required to transfer the at least partial tensor; determining a second cost of transferring the at least partial tensor according to the access time and the transmission time;

A tensor eviction module, configured to use a minimum cost of the first cost and the second cost of the at least partial tensor as a final cost of evicting the at least partial tensor; and selecting a target tensor from the at least partial tensors currently stored in the video memory according to the final cost and expelling the target tensor.

8. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-6.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding claims 1-6 when executing the program.