CN114912570A

CN114912570A - Method, device and equipment for accelerating neural network model optimization and readable medium

Info

Publication number: CN114912570A
Application number: CN202111343930.3A
Authority: CN
Inventors: 张树鹏
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-11-14
Filing date: 2021-11-14
Publication date: 2022-08-16

Abstract

The invention discloses a method, a device, computer equipment and a medium for accelerating neural network model optimization, wherein the method comprises the following steps: performing data division, data communication, data updating and data scheduling through a GPU to realize parallel processing of data in a neural network algorithm; realizing neural network model parallelization by using redundancy, horizontal layer division or vertical interlayer division of neural network parameters through a GPU; and improving the parallelization efficiency of the neural network through a GPU integration model aggregation strategy. The method analyzes from the perspective of model parallel and data parallel, processes from several aspects of optimizing a frame and an algorithm, improving the utilization rate of the GPU and the GPU model aggregation parallelization, selects a proper optimizing frame and algorithm, uses a GPU model aggregation method to perform model aggregation, and improves the parallelization efficiency of the neural network.

Description

Method, device and equipment for accelerating neural network model optimization and readable medium

Technical Field

The invention relates to the technical field of computers, in particular to a method, a device, equipment and a readable medium for optimizing a parallel loop acceleration neural network model based on a GPU.

Background

The current deep learning and big data technology is closely related to various industries, but the traditional data analysis method cannot meet the requirements of processing data with huge volume, multiple sources, different data structures and rapid change due to the huge data generation, and the deep neural network has strong capabilities of feature extraction and summarization, resource integration, heterogeneous data processing and dynamic capture. Meanwhile, the huge amount of data provides powerful resource support for training of the deep neural network.

At present, a circulating neural network for deep learning is widely applied in various fields, and has a good effect on solving various problems, but the problems that a network model designed in the deep learning training process of mass data is larger and deeper, so that the training task time is greatly prolonged, the cost is increased, and the timeliness is low are solved. How to utilize the GPU parallel batch processing training task to solve the problems needing to be solved based on the method needs further research.

Disclosure of Invention

In view of this, an objective of the embodiments of the present invention is to provide a method for accelerating neural network model optimization. The method analyzes from the perspective of model parallelism and data parallelism, processes from aspects of optimizing frames and algorithms, improving GPU utilization rate and GPU model aggregation parallelism, selects proper optimizing frames and algorithms, uses a GPU model aggregation method to perform model aggregation, and improves the neural network parallelization efficiency.

In view of the foregoing, an aspect of the embodiments of the present invention provides a method for accelerating optimization of a neural network model. The method comprises the steps of carrying out data division, data communication, data updating and data scheduling through a GPU to realize parallel processing of data in a neural network algorithm; realizing neural network model parallelization by using redundancy, horizontal layer division or vertical interlayer division of neural network parameters through a GPU; and improving the parallelization efficiency of the neural network through a GPU integration model aggregation strategy.

In some embodiments, the parallel processing of data comprises: a plurality of small execution units in the neural network algorithm form a serial module, and data are processed in different computing units.

In some embodiments, the data partitioning is a segmentation operation performed on a data set that needs to be used for training a model; the data communication is that the parallel data blocks after the data splitting calculate the data in each thread on different processing units in the process of parallel execution and use the communication function of hardware equipment to realize the mutual access of the data on different processors; the data updating is to use a reasonable shared memory access mechanism to solve the problem of data synchronization in a plurality of threads by using the temporarily stored data; and the data scheduling is to solve the problem that different algorithms are needed to process different individuals of a plurality of data blocks by using an instant thread scheduling strategy.

In some embodiments, implementing neural network model parallelization by the GPU using redundancy, horizontal layer-wise partitioning, or vertical layer-wise partitioning of neural network parameters comprises: the hierarchical structure of the neural network can realize random division by using redundancy of parameters of the neural network, horizontal layer division or vertical interlayer division, and finally achieve parallelization of the model.

In some embodiments, the GPU integration model aggregation policy comprises: reducing the parameter quantity of the integrated model by using a compression method; and carrying out model polymerization by using a model addition polymerization method to obtain an integrated model.

In some embodiments, reducing the parameter quantity of the integration model using the compression method comprises: training a local sub-model on a local node by using local data; the local nodes communicate with each other through the network to obtain the sub-model trained in the previous step, model integration is carried out on the server node, local data are predicted by using the integrated model, and prediction information is stored; and compressing the local submodel on the local node by using a model compression method and combining the predicted value of the integrated model in the previous step to obtain an integrated model with the same parameter scale as the local submodel, and outputting a result.

In some embodiments, polymerizing the model using a model-summing polymerization approach to obtain the integrated model comprises: training a model of the user at a computing node by using data parallelism; or after the model parameters are updated, the model is weighted by selecting proper aggregation logic at the server node, and the model is aggregated to obtain the integrated model.

In another aspect of the embodiment of the present invention, a device for accelerating neural network model optimization is also provided. The apparatus includes a first module, a second module, and a third module. The first module is configured to improve the learning efficiency of a neural network algorithm through GPU data parallel; the second module is configured to improve the learning efficiency of the neural network algorithm in parallel through the GPU module; and the third module is configured to improve GPU utilization and neural network parallelization efficiency through a GPU integration model aggregation strategy.

In some embodiments, the first module is further configured to form a serial module by a plurality of small execution units in the neural network algorithm, and process data in different calculation units.

In some embodiments, the data partitioning is a segmentation operation performed on a data set that needs to be used for model training; the data communication is that the data in each thread is calculated on different processing units in the process of parallel execution of parallel data blocks after data splitting, and the communication function of hardware equipment is used for realizing the mutual access of the data on different processors; the data updating is to use a reasonable shared memory access mechanism to solve the problem of data synchronization in a plurality of threads by using the temporarily stored data; and the data scheduling is to solve the problem that different algorithms are needed to process different individuals of a plurality of data blocks by using an instant thread scheduling strategy.

In some embodiments, the second module is further configured to implement neural network model parallelization using redundancy, horizontal layer-wise partitioning, or vertical layer-wise partitioning of neural network parameters by the GPU includes: the hierarchical structure of the neural network can realize random division by using redundancy of parameters of the neural network, horizontal layer division or vertical interlayer division, and finally achieve parallelization of the model.

In some embodiments, the third module is further configured to reduce the number of parameters of the integration model using a compression method; and carrying out model polymerization by using a model addition polymerization method to obtain an integrated model.

In some embodiments, the third module is further configured to train out a local sub-model using the local data on the local node; the local nodes communicate with each other through the network to obtain the sub-model trained in the previous step, model integration is carried out on the server node, local data are predicted by using the integrated model, and prediction information is stored; and compressing the local submodel on the local node by using a model compression method and combining the predicted value of the integrated model in the previous step to obtain an integrated model with the same parameter scale as the local submodel, and outputting a result.

In some embodiments, the third module is further configured to perform model aggregation using a model summation polymerization method to obtain the integrated model, including: training a model of the user at a computing node by using data parallelism; or after the model parameters are updated, the model is weighted by selecting proper aggregation logic at the server node, and the model is aggregated to obtain the integrated model.

In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing steps of the method comprising: performing data division, data communication, data updating and data scheduling through a GPU to realize parallel processing of data in a neural network algorithm; realizing neural network model parallelization by using redundancy, horizontal layer division or vertical interlayer division of neural network parameters through a GPU; and improving the parallelization efficiency of the neural network through a GPU integration model aggregation strategy.

In some embodiments, the parallel processing of data comprises: a plurality of small execution units in the neural network algorithm form a serial module, and data are processed in different calculation units.

In some embodiments, the data partitioning is a segmentation operation performed on a data set that needs to be used for model training; the data communication is that the parallel data blocks after the data splitting calculate the data in each thread on different processing units in the process of parallel execution and use the communication function of hardware equipment to realize the mutual access of the data on different processors; the data updating is that the data stored temporarily uses a reasonable shared memory access mechanism to solve the problem of data synchronization in a plurality of threads; and the data scheduling is to solve the problem that different algorithms are needed to process different individuals of a plurality of data blocks by using an instant thread scheduling strategy.

In some embodiments, implementing neural network model parallelization by the GPU using redundancy of neural network parameters, horizontal layer-wise partitioning, or vertical interlayer partitioning comprises: the hierarchical structure of the neural network can realize random division by using redundancy of parameters of the neural network, horizontal layer division or vertical interlayer division, and finally achieve parallelization of the model.

In some embodiments, performing model aggregation using a model sum aggregation approach to obtain an integrated model comprises: training a model of the user at a computing node by using data parallelism; or after the model parameters are updated, the model is weighted by selecting proper aggregation logic at the server node, and the model is aggregated to obtain the integrated model.

In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.

The invention has at least the following beneficial technical effects:

the method for optimizing the parallel cyclic acceleration neural network model based on the GPU improves the parallel training efficiency of the cyclic neural network and provides an integrated strategy parallel optimization model. The characteristics and the application of various cyclic neural networks and other types of neural networks are greatly researched, and finally, a sub-network which has both training efficiency and fitting precision and serves as an integrated model is selected, so that the calculated amount is relatively small, and the model convergence speed is high. And realizing parallel training of the cyclic neural network from two angles of data parallelization and model parallelization in the GPU parallelization direction of the cyclic neural network. Meanwhile, a reasonable gradient descent method and an optimizer are used for optimization, and a GPU is used for optimization in the model parameter transmission process, so that the parallel computing capability is improved. And then, the recycled data is calculated by using a shared memory, abundant calculation resources of the GPU are fully utilized, and a balance point is found between the GPU utilization rate and the model parallelism.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a schematic diagram of an embodiment of a method for accelerating neural network model optimization provided by the present invention;

FIG. 2 is a schematic diagram of an embodiment of an apparatus for accelerating neural network model optimization according to the present invention;

FIG. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention;

FIG. 4 is a schematic diagram of an embodiment of a computer-readable storage medium provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In view of the above objects, a first aspect of embodiments of the present invention proposes an embodiment of a method of accelerating neural network model optimization. Fig. 1 is a schematic diagram illustrating an embodiment of a method for measuring acceleration neural network model optimization provided by the present invention. As shown in fig. 1, the method for accelerating neural network model optimization according to the embodiment of the present invention includes the following steps:

001. performing data division, data communication, data updating and data scheduling through a GPU to realize parallel processing of data in a neural network algorithm;

002. the neural network model parallelization is realized by using redundancy, horizontal layer-by-layer division or vertical interlayer division of neural network parameters through a GPU; and

003. and the parallelization efficiency of the neural network is improved through a GPU integration model aggregation strategy.

In this embodiment, much research work is performed on a Recurrent Neural Network (RNN), which is a Neural network that captures dynamic information in serialized data by using periodic connection of Hidden Layer (Hidden Layer) nodes and performs regression or classification work on the data of a sequence type. Compared to a conventional neural network, the recurrent neural network is different in that it also has interconnections between neurons. A feedback connection is added on the hidden layer, namely the input of the hidden layer of the RNN at the next time point comprises the hidden layer output at the current time point and the information provided by the neuron of the input layer at the next time point, and the mechanism enables the RNN to reserve the information of all previous time points by using a circular feedback connection, thereby endowing the RNN with a memory function. The RNN can resolve the mapping between input and output sequences in a short time using context-related information, and once the neural network and time span continue to recursively grow, the output impact of the input of the hidden layer on the neural network gradually declines, causing gradient disappearance and gradient explosion. RNN neural network models are not often used directly in the application domain because they may create gradient explosion or gradient disappearance problems.

A Time Convolutional Network (TCN) is a new algorithm that can be used to solve the problem of time series prediction. TCNs perform well in tasks such as Natural Language Processing (NLP) and time series prediction, and are used because they also have good parallelism. RNN takes a long time, and since the network only reads and parses one word or character in the input text at a time, the deep neural network must wait for the previous word to be processed before processing the next word. This means that RNNs cannot do massively parallel processing like CNNs. The TCN has the advantages of parallelism, flexible receptive field, stable gradient and lower memory used by model training; the drawback is that TCN may not have strong adaptability in the aspect of transfer learning, it is a unidirectional structure, and the application is yet to be verified.

The patent of the invention sets forth a GPU (graphics processing unit) parallel method, performs analysis from the perspective of model parallel and data parallel, and sets forth specific operation steps. Processing is carried out in several aspects of optimizing a frame and an algorithm, improving the GPU utilization rate and GPU model aggregation parallelization, a proper optimizing frame and algorithm are selected, data and models are optimized by utilizing some deep characteristics of a TensorFlow frame on the aspect of improving the GPU utilization rate, and finally a GPU model aggregation method is used for carrying out model aggregation, so that the neural network parallelization efficiency is improved.

In some embodiments of the invention, the parallel processing of data comprises: a plurality of small execution units in the neural network algorithm form a serial module, and data are processed in different calculation units.

A plurality of small execution units in the neural network algorithm form a serial module, and data are processed in different calculation units. The data parallelization refers to dividing data in the whole calculation process into small data blocks, and constructing a calculation method according to model requirements so that operation can be performed on a plurality of small data blocks, so that the data can be parallelized. The method comprises four steps of data division, data communication, data updating and data scheduling when data parallel is realized. Data in data parallel refers to data in a neural network algorithm, and the learning efficiency of the neural network algorithm can be improved by utilizing GPU parallel.

In some embodiments of the present invention, the data partitioning is a segmentation operation performed on a data set that needs to be used for model training; the data communication is that the data in each thread is calculated on different processing units in the process of parallel execution of parallel data blocks after data splitting, and the communication function of hardware equipment is used for realizing the mutual access of the data on different processors; the data updating is that the data stored temporarily uses a reasonable shared memory access mechanism to solve the problem of data synchronization in a plurality of threads; and the data scheduling is to solve the problem that different algorithms are needed to process different individuals of a plurality of data blocks by using an instant thread scheduling strategy.

In this embodiment, data partitioning is performed on a data set that needs to be used for model training, and there are two common data partitioning methods that are respectively performed on data samples and data dimensions. Generally, methods such as random sampling and scrambling segmentation are used for sample division, and data dimension division needs to be designed according to characteristics and actual conditions of a data set. The reasonable data division method is used for balancing the parallelism degree of the algorithm and the hardware utilization rate, and aims to reasonably use hardware equipment to improve the parallelism degree without losing precision. After data communication and data splitting, data in each thread can be calculated on different processing units in the parallel execution process of the data blocks capable of being parallel, the problem of communication among the data needs to be considered, and the communication function of hardware equipment is used for realizing the mutual access of the data on different processors and finishing the communication operation among different data. Data updating, some data can be temporarily stored in the process of parallel computing, and if the data in different threads in the problem of parallel computing does not need to share a storage space, namely the data are completely independent, then a simple function can be used for completing the synchronization among the threads. In most cases, the same storage unit may store data in different threads, the result may be wrong due to the direct data update on the storage unit, and the problem of data synchronization among multiple threads is solved by using a reasonable shared memory access mechanism. And after data is divided, different algorithms are needed among different individuals of the data blocks for processing, the problem can be solved by using an instant thread scheduling strategy, if the data in different threads is in a static dependency relationship, static data scheduling is carried out, and if the data in different threads is in a dynamic dependency relationship, dynamic data scheduling is carried out.

In some embodiments of the invention, implementing neural network model parallelization by the GPU using redundancy, horizontal layer partitioning, or vertical interlayer partitioning of neural network parameters comprises: the hierarchical structure of the neural network can realize random division by using redundancy of parameters of the neural network, horizontal layer division or vertical interlayer division, and finally achieve parallelization of the model.

In this embodiment, at present, models built in machine learning tasks are large in scale, and when the models are processed in the same computing device, the model performance is affected. In this case, the model needs to be divided, and it can also be understood that the operation task is divided, and a linear model having variable separability and a nonlinear model having strong variable dependency are divided.

The neural network is a model with strong nonlinearity, and the dependence relationship between parameters is tighter than that of a linear model, so that parallelization cannot be realized through simple division, and efficient model parallelization cannot be realized by means of a global intermediate variable by using skills such as the linear model. However, the hierarchical structure of the neural network can also provide some ideas for model parallelization, so that random division can be realized by using redundancy of parameters of the neural network, horizontal layer division or vertical interlayer division, and finally the purpose of model parallelization is achieved.

In some embodiments of the invention, the GPU integration model aggregation policy comprises: reducing the parameter quantity of the integration model by using a compression method; and carrying out model polymerization by using a model addition polymerization method to obtain an integrated model. Reducing the parameter quantity of the integrated model using the compression method includes: training a local sub-model on a local node by using local data; the local nodes communicate with each other through the network to obtain the sub-model trained in the previous step, model integration is carried out on the server node, local data are predicted by using the integrated model, and prediction information is stored; and compressing the local submodel on the local node by using a model compression method and combining the predicted value of the integrated model in the previous step to obtain an integrated model with the same parameter scale as the local submodel, and outputting a result. The method for carrying out model polymerization by using a model addition polymerization method to obtain the integrated model comprises the following steps: training a model of the user at a computing node by using data parallelism; or after the model parameters are updated, the model is weighted by selecting proper aggregation logic at the server node, and the model is aggregated to obtain the integrated model.

In this embodiment, the GPU integrates the model aggregation policy. The loss function of the model parameters in the neural network training is a non-convex function, but the loss function of the model output is usually a convex function, and the result of weighting or averaging the output of the local model is better than the result of predicting the output of the local model through experiments.

The integrated model is a method for weighting or averaging the output values of the model, the method can improve the robustness of the integrated model when the data are applied, and the performance of the model can be improved in experimental results. The accuracy of the integrated model is improved to a certain extent, but the parameter quantity of the deep circulation neural network is huge, the model parameters after integrated processing can be increased to multiple times of that of a single model, the model aggregation process can be carried out for multiple times in iterative training, so that model parameters can explode in the training, and the risk of overfitting is increased. The compression method is used to reduce the parameters of the integrated model to solve the problem, and meanwhile, the performance of the model is not reduced.

The method mainly comprises three steps, namely, training a local sub-model on a local node by using local data; secondly, local nodes mutually communicate through a network to obtain the sub-model trained in the previous step, model integration is carried out on the server node, local data are predicted by using the integrated model, and prediction information is stored; and thirdly, compressing the local submodel on a local node by using model compression methods such as knowledge screening and knowledge distillation and combining the predicted value of the integrated model in the previous step, finally obtaining the integrated model with the same parameter scale as the local submodel, and outputting the result. The knowledge distillation method predicts a sample by the integrated model to obtain a corresponding label and stores the label, and then selects a sub-model to retrain training data and the label stored in front to obtain model parameters, namely, rules in the integrated model are converted into the sub-model.

The model addition polymerization method mainly uses data parallel, after a computing node trains out a model of the computing node or model parameters are updated, the model is weighted by selecting proper aggregation logic at a server node, and model aggregation is carried out to obtain an integrated model. A synchronous random gradient descent method (SSGD) is used, which replaces the average value of the parameter with the average value of the gradient, the update method being as follows:

in the parallel K-machine, the random gradient decrease of the small batch has an influence on the optimization process, so that the proper learning rate in the SSGD needs to be adjusted to balance the change of the random gradient decrease of the small batch.

The invention adopts various optimization methods based on a GPU platform, provides an integrated optimization model, realizes the optimization from two directions of data parallel and model parallel, starts to build a recurrent neural network prediction model, reduces the complexity of neural network calculation, optimizes the training process and increases the parallelism of the prediction model. Meanwhile, the utilization rate of the GPU is improved by fully utilizing the memory. It is further shown that the proposed method can accelerate the efficiency of model training.

It should be particularly noted that, the steps in the embodiments of the method for server centralized test described above can be mutually intersected, replaced, added, and deleted, so that these methods for server centralized test, which are transformed by reasonable permutation and combination, should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiments.

In view of the above, according to a second aspect of the embodiments of the present invention, an apparatus for accelerating optimization of a neural network model is provided. Fig. 2 is a schematic diagram illustrating an embodiment of the apparatus for accelerating neural network model optimization provided by the present invention. As shown in fig. 2, the apparatus for accelerating neural network model optimization according to the embodiment of the present invention includes the following modules: the first module 011 is configured to improve the learning efficiency of a neural network algorithm through GPU data in parallel; a second module 012 configured to improve the learning efficiency of the neural network algorithm in parallel by the GPU module; and a third module 013 configured to improve GPU utilization and neural network parallelization efficiency via a GPU integration model aggregation strategy.

In some embodiments of the invention, the first module 011 is further configured to: a plurality of small execution units in the neural network algorithm form a serial module, and data are processed in different calculation units.

In some embodiments of the present invention, the second module 012 is further configured to: the method for realizing the parallelization of the neural network model by using the redundancy, horizontal layer division or vertical interlayer division of the neural network parameters through the GPU comprises the following steps: the hierarchical structure of the neural network can realize random division by using redundancy of parameters of the neural network, horizontal layer division or vertical interlayer division, and finally achieve parallelization of the model.

In some embodiments of the invention, the third module 013 is further configured to: reducing the parameter quantity of the integrated model by using a compression method; and carrying out model polymerization by using a model addition polymerization method to obtain an integrated model.

In some embodiments of the invention, the third module 013 is further configured to: training a local sub-model on a local node by using local data; the local nodes communicate with each other through the network to obtain the sub-model trained in the previous step, model integration is carried out on the server node, local data are predicted by using the integrated model, and prediction information is stored; and compressing the local submodel on the local node by using a model compression method and combining the predicted value of the integrated model in the previous step to obtain an integrated model with the same parameter scale as the local submodel, and outputting a result.

In some embodiments of the invention, the third module 013 is further configured to: the method for carrying out model polymerization by using a model addition polymerization method to obtain the integrated model comprises the following steps: training a model of the user at a computing node by using data parallelism; or after the model parameters are updated, the model is weighted by selecting proper aggregation logic at the server node, and the model is aggregated to obtain the integrated model.

Wherein the first module 011 implements data parallelism. The second module 012 implements model parallelism. The third module 013 improves GPU utilization.

In view of the above object, a third aspect of the embodiments of the present invention provides a computer device. Fig. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention. As shown in fig. 3, the computer apparatus according to the embodiment of the present invention includes the following means: at least one processor 021; and a memory 022, the memory 022 storing computer instructions 023 executable on the processor, the instructions when executed by the processor implementing steps of the method comprising: configuring the test case input field and setting configuration parameters; executing the test case, and judging whether the test case is a single case or a plurality of cases; responding to the test case being a single case, executing case page associated field configuration, popping up a window prompt, and checking whether to input; and responding to the test cases as multiple cases, executing the configuration of the page associated fields of the multiple cases, wherein the pop-up window information is the configuration information after the repeated information is eliminated.

The invention also provides a computer readable storage medium. FIG. 4 is a schematic diagram illustrating an embodiment of a computer-readable storage medium provided by the present invention. As shown in fig. 4, the computer readable storage medium 031 stores a computer program 032 which, when executed by a processor, performs the method as described above.

Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes in the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and the program of the method for centralized server testing can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods as described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions as defined in the method disclosed by an embodiment of the invention.

Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code means in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. As used herein, magnetic and optical disks include Compact Disks (CDs), laser disks, optical disks, Digital Versatile Disks (DVDs), floppy disks, blu-ray disks where disks usually reproduce data magnetically, while optical disks reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit or scope of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method for accelerating neural network model optimization, comprising the steps of:

performing data division, data communication, data updating and data scheduling through a GPU to realize parallel processing of data in a neural network algorithm;

realizing neural network model parallelization by using redundancy, horizontal layer division or vertical interlayer division of neural network parameters through a GPU; and

and the parallelization efficiency of the neural network is improved through a GPU integration model aggregation strategy.

2. The method of accelerating neural network model optimization of claim 1, wherein the parallel processing of data comprises: a plurality of small execution units in the neural network algorithm form a serial module, and data are processed in different calculation units.

3. The method of accelerating neural network model optimization of claim 2,

the data division is to perform segmentation operation on a data set required to be used by model training;

the data communication is that the data in each thread is calculated on different processing units in the process of parallel execution of parallel data blocks after data splitting, and the communication function of hardware equipment is used for realizing the mutual access of the data on different processors;

the data updating is to use a reasonable shared memory access mechanism to solve the problem of data synchronization in a plurality of threads by using the temporarily stored data; and

data scheduling is a problem that different algorithms are needed to process different individuals of a plurality of data blocks by using an instant thread scheduling strategy.

4. The method of accelerating neural network model optimization according to claim 1, wherein implementing neural network model parallelization by the GPU using redundancy, horizontal layer-wise partitioning, or vertical interlayer partitioning of neural network parameters comprises: the hierarchical structure of the neural network can realize random division by using redundancy of parameters of the neural network and horizontal layer division or vertical interlayer division, and finally achieve model parallelization.

5. The method of accelerating neural network model optimization of claim 1,

the GPU integration model aggregation strategy comprises the following steps:

reducing the parameter quantity of the integrated model by using a compression method; and

and carrying out model polymerization by using a model addition polymerization method to obtain an integrated model.

6. The method of accelerating neural network model optimization of claim 5,

reducing the parameter quantity of the integrated model using the compression method includes:

training a local sub-model on a local node by using local data;

the local nodes communicate with each other through the network to obtain the sub-model trained in the previous step, model integration is carried out on the server node, local data are predicted by using the integrated model, and prediction information is stored; and

and compressing the local submodel on the local node by using a model compression method and combining the predicted value of the integrated model in the previous step to obtain an integrated model with the same parameter scale as the local submodel, and outputting a result.

7. The method of accelerating neural network model optimization of claim 5,

the method for carrying out model polymerization by using a model addition polymerization method to obtain the integrated model comprises the following steps: training a model of the user at a computing node by using data parallelism; or after the model parameters are updated, the model is weighted by selecting proper aggregation logic at the server node, and the model aggregation is carried out to obtain the integrated model.

8. An apparatus for accelerating neural network model optimization, comprising:

the first module is configured to improve the learning efficiency of a neural network algorithm through GPU data in parallel;

the second module is configured for improving the learning efficiency of the neural network algorithm in parallel through the GPU module; and

and the third module is configured to improve GPU utilization rate and neural network parallelization efficiency through a GPU integration model aggregation strategy.

9. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.