CN114154616A - RNN parallel model and implementation method and system thereof on multi-core CPU - Google Patents
RNN parallel model and implementation method and system thereof on multi-core CPU Download PDFInfo
- Publication number
- CN114154616A CN114154616A CN202111204314.XA CN202111204314A CN114154616A CN 114154616 A CN114154616 A CN 114154616A CN 202111204314 A CN202111204314 A CN 202111204314A CN 114154616 A CN114154616 A CN 114154616A
- Authority
- CN
- China
- Prior art keywords
- rnn
- parallel
- layer
- model
- core
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 239000010410 layer Substances 0.000 claims abstract description 178
- 239000011229 interlayer Substances 0.000 claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 19
- 238000013528 artificial neural network Methods 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims description 51
- 238000013507 mapping Methods 0.000 claims description 29
- 125000004122 cyclic group Chemical group 0.000 claims description 15
- 230000011218 segmentation Effects 0.000 claims description 13
- 101100161752 Mus musculus Acot11 gene Proteins 0.000 claims description 2
- 230000002123 temporal effect Effects 0.000 claims 1
- 238000012545 processing Methods 0.000 abstract description 10
- 230000000306 recurrent effect Effects 0.000 abstract description 8
- 238000005457 optimization Methods 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 16
- 238000003860 storage Methods 0.000 description 14
- 238000004590 computer program Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000011160 research Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000000638 solvent extraction Methods 0.000 description 3
- 241001522296 Erithacus rubecula Species 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 239000002356 single layer Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Image Processing (AREA)
Abstract
The invention discloses an RNN parallel model and a method and a system for realizing the RNN parallel model on a multi-core CPU (central processing unit). In-layer and interlayer parallel optimization is carried out aiming at a multi-layer RNN structure, an original input sequence is divided into a plurality of minimum subsequences, and the divided multi-layer RNN acts on each minimum subsequence; analyzing the dependency relationship of data in the multi-layer sub RNN acting on each minimum subsequence by adopting an interlayer parallel method, and parallelizing the loop units of different layers and different time steps; the method comprises the steps that models in a core group are adopted to map a multi-level RNN parallel model on a first level of a multi-core CPU in parallel; and the multi-level RNN parallel model is mapped to the second level of the multi-core CPU by adopting the data parallel between the core groups, so that the characteristics of the multi-core CPU architecture are fully utilized, and the parallel training of the multi-level RNN parallel model on the multi-core CPU is realized. The invention fully utilizes the characteristics of the multi-core processor architecture, realizes a finer-grained parallel mode of the recurrent neural network, and accelerates the training of the network.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence and parallel computing, and particularly relates to an RNN parallel model and a method and a system for realizing the RNN parallel model on a multi-core CPU.
Background
With the rapid development of deep learning, a Recurrent Neural Network (RNN) has been widely applied to natural language processing and time series tasks, including artificial intelligence applications such as emotion classification, machine translation, intelligent question answering, and sequence prediction. The RNN has the ability to obtain input sequence order information and can pass information from a previous time step to a next time step in the network to capture the correlation between sequence information over time. Because the original RNN faces the gradient problem of gradual disappearance and explosion, Long-short term memory networks (LSTM) and Gated round robin units (GRUs) are the most widely used round robin units at present, and achieve good performance on various tasks. However, as the amount of data increases and the number of network layers becomes deeper, feedforward and Convolutional Neural Networks (CNNs) can be successfully parallelized because they do not require any internal state that describes the link between past and future data. However, during RNN reasoning and training, there is a data dependency relationship, so that it is difficult to parallelize RNNs, and it takes a lot of time to train RNNs, which limits academic research and industrial application.
In order to solve the problem, some researches use CNN instead of RNN to perform natural language processing and time sequence tasks, and although such researches can utilize the parallel acceleration characteristic of CNN, the conventional CNN cannot acquire sequence information of the sequence, and thus the accuracy rate has a certain loss. Some researches improve the RNN training speed by improving the cyclic unit, such as SRU (Simple recurrent unit, SRU) and MGU (minimal gate unit, MGU), etc., which simplify the gating unit mechanism, reduce the number of training parameters, and achieve good acceleration effect, but do not change the connection mode of the conventional cyclic neural network. At present, for a study on changing a connection mode of a conventional recurrent neural network, such as a Sliced Recurrent Neural Network (SRNN), parallelization is implemented by dividing a sequence into a plurality of subsequences, and extra parameters are introduced to obtain high-level feature information between the divided sequences, which is faster than a network connection mode of a conventional RNN model without changing a gating mechanism of a recurrent unit, but the SRNN mainly uses a single-layer RNN model to extract features of a minimum subsequence. However, for more complex multivariate sequence data, more features can be extracted by using a multi-layer RNN model, which can achieve better effect than a single-layer RNN model, but data association exists between different time steps in the same layer and different layers in the same time step in the multi-layer RNN, and the next layer needs the output of the previous layer and realizes further feature extraction.
Therefore, the method has important significance for the parallel optimization of the traditional multilayer RNN model and how the sequence data and the network model of the model make full use of the characteristics of the multi-core CPU architecture to carry out reasonable distribution.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide an RNN parallel model and a method and a system for implementing the RNN parallel model on a multi-core CPU, aiming at the defects in the prior art, a multi-level RNN parallel model performs in-layer and inter-layer multi-level parallel optimization for the traditional multi-layer RNN structure, and maps the multi-level RNN parallel mode to a multi-core processor; by means of a fine-grained data and model division method, the characteristics of a multi-core CPU architecture are fully utilized, a parallel mode of a cyclic neural network with finer granularity is achieved, and training of the network is accelerated.
The invention adopts the following technical scheme:
a method for realizing RNN parallel model on multi-core CPU, using SRNN segmentation method for layer parallel of multi-layer RNN parallel model, dividing original input sequence into many minimum subsequences, and acting segmented multi-layer RNN on each minimum subsequence; analyzing the dependency relationship of data in the multi-layer sub-RNN acting on each minimum subsequence by adopting an interlayer parallel method, and parallelizing the cycle units of different layers and different time steps to obtain a multi-layer RNN parallel model; the first-level mapping method of the multi-level RNN parallel model on the multi-core CPU is that the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating the multi-level sub-RNN used for the minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for carrying out data parallel calculation between the core groups; the second-level mapping method of the multi-level RNN parallel model on the multi-core CPU is data parallel between core groups, each core group obtains the output of the multi-level sub-RNN through calculation, the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, and parallel training of the multi-level RNN parallel model on the multi-core CPU is achieved.
Specifically, the intra-layer parallel SRNN segmentation method specifically comprises the following steps:
dividing the input sequence X into N subsequences N with equal length, and determining the length of each subsequence NEach subsequence N is divided into N equal-length subsequences again, and the operation is repeated for k times to obtain NkA minimum subsequence, each minimum subsequence having a length ofIntroducing an extra circulation unit to obtain high-level feature information among the cut subsequences, constructing an SRNN model, and extracting features of each minimum subsequence by using the cut H-level sub RNN.
Further, when the SRNN network is constructed, the time complexity is:
wherein ,representing a multi-layer RNN acting on the smallest subsequenceThe required computation time, nk represents the computation time required for the additional added cyclic unit.
Specifically, parallelizing different layers and different time steps specifically includes:
for layers of a multilayer RNN parallel model, dividing a multilayer RNN from a time step dimension according to the size of a minimum subsequence, extracting the minimum subsequence feature by using the divided multilayer RNN, and introducing an additional circulation unit to obtain high-level feature information among cut sequences; and for the interlayer of the multi-layer RNN parallel model, performing interlayer parallelization on each multi-layer sub-RNN acting on the minimum subsequence.
Further, when the sub-RNN with the H layer is used for extracting the minimum subsequence feature, the calculation result of the t-th time step of the H layer is obtainedAs the same time step of the lower layerAnd the next time step of the layerThe computational complexity of the multi-level sub-RNN acting on the smallest subsequence becomes
Further, after the multi-layer RNN acting on the minimum subsequence is subjected to multi-layer parallelization according to the data dependency relationship of different layers and different time steps, when the layer 1 circulation unit in the 1 st time stepAfter the calculation is finished, theAs a layer 2 cyclic unit in the 1 st time stepAnd layer 1 cyclic unit in time step 2When inputtingAndsimultaneously calculating when the data input condition is satisfiedAndwill be provided withAndas an output ofThe inputs of (a) are calculated in parallel,for the layer 3 loop element in time step 1,for the layer 2 loop element in time step 2,is the layer 1 loop element in time step 3.
Specifically, the first-level mapping of the multi-level RNN parallel model on the multi-core CPU specifically is:
and a core group consisting of processor cores sharing the same last-level cache is used for calculating the multi-layer sub-RNN used for the minimum subsequence, and the multi-layer sub-RNN in the model adopts a model parallel mode to distribute units capable of performing parallel calculation in different layers and different time steps to different processor cores.
Further, the second-level mapping of the multi-level RNN parallel model on the multi-core CPU specifically comprises:
using each core group to calculate a multi-layer sub-RNN used for the minimum subsequence to obtain the final output of the multi-layer RNN, using the final output as the input of an additional cycle unit, and then performing stack RNN calculation, wherein the core groups are communicated through a shared memory; when the stack RNN calculation is carried out, the models in each core group are the same, the inputs are different, and the data between the core groups are mapped in parallel.
The other technical scheme of the invention is that the input sequence X of the RNN parallel model and the multilayer RNN model is [ X ]1,x2,…,xT]And the scale of the network expanded from the time dimension is H × T, H is the number of hidden layers, T is the time step of the sequence, and the SimpleRNN unit, the LSTM unit or the GRU unit is selected from the circulating units in the multi-layer RNN parallel model.
Another technical solution of the present invention is a system for implementing an RNN parallel model on a multi-core CPU, including:
the dividing module is used for dividing the original input sequence into a plurality of minimum subsequences by using an SRNN (sequence-specific neural network) segmentation method for the intra-layer parallelism of the multi-layer RNN parallel model, and applying the segmented multi-layer RNN to each minimum subsequence;
the parallelization module analyzes the dependency relationship of data in the multi-layer sub-RNN acting on each minimum subsequence by adopting an interlayer parallelization method, and parallelizes the cycle units of different layers and different time steps to obtain a multi-layer RNN parallelization model;
the first mapping module is used for mapping a multi-layer RNN parallel model on a multi-core CPU in a first-level mode, namely, the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating multi-layer sub-RNNs used for a minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for performing data parallel calculation between the core groups;
the second mapping module is used for mapping the multi-layer RNN parallel model on the multi-core CPU in a second layer mode, namely, data parallel between cores; and each core group obtains the output of the multi-layer sub-RNN through calculation, and the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, so that the parallel training of a multi-layer RNN parallel model on the multi-core CPU is realized. .
Compared with the prior art, the invention has at least the following beneficial effects:
the invention relates to a method for realizing an RNN parallel model on a multi-core CPU (Central processing Unit). according to the data dependency of a multi-level RNN parallel model and the architectural characteristics of a multi-core processor, the multi-level RNN parallel model is mapped to the multi-core processor for training; by the fine-grained data and model division method, the characteristics of the multi-core processor architecture are fully utilized, and the parallel training method of the multilayer cyclic neural network with finer granularity is realized.
Furthermore, an SRNN segmentation method is used in parallel in the layer, an original input sequence is divided into a plurality of minimum subsequences, an extra circulation unit is introduced to obtain high-level characteristic information among the segmented sequences, the segmented multilayer sub-RNNs act on each minimum subsequence, data dependency does not exist among the multilayer sub-RNNs, and parallel calculation can be carried out. When the model is constructed by adopting the in-layer parallel method, the time complexity is changed from HT of the traditional multilayer RNN model to HT of the traditional multilayer RNN model
Furthermore, the interlayer parallelism parallelizes the loop units of different layers and different time steps by analyzing the data dependency relationship in the multi-layer sub RNN acting on the minimum subsequence. By utilizing the potential parallelism and data dependency in the multi-layer sub-RNN, the multi-layer sub-RNN acting on the sub-sequence is further parallelized without changing the network structure. Further constructing a model by adopting the interlayer parallel method to obtain a multi-layer RNN parallel model, wherein the time complexity of the model is determined by the time complexity of the last stepBecome into
Further, for the proposed multi-level RNN parallel model, a method for realizing the multi-core processor architecture is designed: the core group consisting of the processor cores sharing the same last-level cache is used for calculating the multi-layer sub RNN used for the minimum subsequence, units which can be calculated in parallel in different layers and different time steps are distributed to different processor cores in a model parallel mode, communication is frequent, only hidden state information is transmitted among the units, communication traffic is small, and model parallel is realized more effectively by using the characteristic that the processor cores share the cache.
Further, each core group is used for calculating the multilayer sub-RNN used for the minimum subsequence to obtain the final output of the multilayer sub-RNN, the final output is used as the input of an additional calculation neuron, then the stack RNN calculation is carried out, and the core groups are communicated through a shared memory. When the stack RNN calculation is carried out, the models in each core group are the same, the input is different, and the communication frequency is low, so that the data between the core groups are mapped in parallel according to the method.
The invention relates to an RNN parallel model, which changes the structure of the traditional multilayer RNN network model, and the multilayer parallel structure body of the model of the invention is as follows: dividing an original input sequence into a plurality of minimum subsequences by using an SRNN segmentation method in a layer, and acting the segmented multilayer sub-RNNs on each minimum subsequence; and (3) performing parallel calculation on different layers and different time steps of the cyclic units by analyzing the dependency relationship of data in the multi-layer sub RNN acting on each minimum subsequence between layers, and further performing parallel optimization on the multi-layer sub RNN. A multi-level RNN parallel model is formed in the mode, the traditional multi-level RNN model is divided in a finer granularity, and the training speed is increased.
In summary, the method of the present invention includes a multilevel parallel manner in and between layers in the traditional multilayer RNN model, and maps the multilevel parallel manner to the multi-core processor, and by the fine-grained data and model partitioning method, the characteristics of the multi-core processor architecture are fully utilized, the finer-grained parallel manner of the recurrent neural network is realized, and the training of the network is accelerated.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a schematic diagram of an RNN parallel model;
FIG. 2 is a schematic diagram of intra-layer parallel partitioning of a multi-layer RNN;
FIG. 3 is a schematic diagram of a multi-layered RNN, wherein (a) is a schematic diagram of a multi-layered RNN acting on a minimum subsequence; (b) the interlayer parallel optimization schematic diagram is shown;
FIG. 4 is a schematic diagram of an RNN parallel model;
FIG. 5 is a schematic diagram of an FT-2000+/64 multi-core processor architecture;
FIG. 6 is a diagram illustrating a mapping method of an RNN parallel model on a multi-core processor.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be understood that the terms "comprises" and/or "comprising" indicate the presence of the stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.
The invention provides a multi-level RNN parallel model and a method for realizing the same on a multi-core CPU (Central processing Unit), wherein the multi-level RNN parallel model carries out in-layer and interlayer multi-level parallel optimization aiming at the traditional multi-layer RNN structure, and maps the multi-level parallel mode onto a multi-core processor. The invention comprises the following steps: aiming at a multilayer RNN model, an SRNN segmentation method is used in parallel in a layer, and the segmented multilayer RNN acts on each minimum subsequence; the interlayer parallelism parallelizes different layers and different time steps by analyzing the data dependency relationship in the multi-layer sub RNN acting on the minimum subsequence. A multi-level RNN parallel model is constructed in the two parallel modes, and an implementation method on a multi-core processor is designed from the model parallel layer and the data parallel layer.
The invention discloses a method for realizing an RNN parallel model on a multi-core CPU (Central processing Unit), which takes an FT-2000+ multi-core processor as an example for explanation and comprises the following steps:
s1, constructing a multilayer RNN model;
constructing a multilayer RNN model according to the size of an input sequence, wherein the input sequence X is [ X ]1,x2,...,xT]Which isAnd the middle T represents the time step of the sequence, the number of hidden layers is H, and the network size of the constructed multilayer RNN model expanded from the time dimension is H x T. The loop elements in the multi-layer RNN model are selected from basic SimpleRNN elements, LSTM elements, GRU elements, or other modified loop elements.
S2, parallelly using an SRNN segmentation method in the multilayer RNN model (time step dimension) constructed in the step S1, dividing the original input sequence into a plurality of minimum subsequences, and acting the segmented multilayer RNN on each minimum subsequence;
an input sequence X is first divided into N equal-length subsequences N, where X can be expressed as X ═ N1,N2,...,Nn]The length of each subsequence N
Then dividing each subsequence N into N equal-length subsequences again, repeating the operation for k times until a minimum subsequence with proper size is obtained, wherein the number of the minimum subsequences is NkEach minimum subsequence having a length of
Introducing an extra circulation unit to obtain high-level feature information among the cut sequences, and constructing the SRNN according to the dividing mode, wherein the minimum subsequence uses the cut H-level sub RNNs to extract sequence features. Thus, the multi-level sub-RNN scale for processing each minimum subsequence isAnd are parallel to each other.
When the SRNN is constructed, the time complexity is as follows:
wherein ,represents the computation time required for the multi-layered RNN acting on the smallest subsequence, and nk represents the computation time required for the additional addition of a cyclic unit.
For a conventional multi-layer RNN model structure with a sequence having time steps of T16 and H4, the time-series dimension is expanded as shown in fig. 1. Each time the sequence is divided into 2 equal-length subsequences, that is, n is 2, and k is repeated 2 times, the length of the minimum subsequence is l is 4, and an additional cyclic unit is added, so that the structure of the neural network after intra-layer parallel division is performed is shown in fig. 2.
S3, parallelizing different layers and different time steps by analyzing the data dependency relationship in the multi-layer sub RNN acting on the minimum subsequence;
when the sub-RNN with the H layer is used for extracting the minimum subsequence feature, the calculation result of the t-th time step of the H layer is obtainedIn other words, it will be the same time step as the lower layerAnd the next time step of the layerThe input of (a) is performed,andthere is no data dependency, and therefore,andare computable in parallel. According to the analysis, the calculation result of the t time step of the h layerThe units that can be calculated in parallel are:
[z=min(H,l)]
in such a parallel approach, the computational complexity of the multi-level sub-RNN acting on the smallest sub-sequence is reduced from the original oneBecome into
Therefore, for the neural network constructed in fig. 2, the original calculation flow of the multi-layer sub-RNN for the minimum subsequence is shown in fig. 3(a), the 4-layer cycle units in the 1 st time step are calculated first, and the calculation sequence isSubsequently, the layer 4 cyclic unit in the 2 nd time step is calculated in such a way that the layer 4 cyclic unit is calculated up to the 4 th time stepThis is equivalent to a serial calculation. After performing multi-level parallelization on the multi-level sub-RNN acting on the minimum sub-sequence according to the data dependency relationship of different levels and different time steps, as shown in FIG. 3(b), whenAfter the calculation, the result is taken asAndwhen these two data are satisfied simultaneouslyIn the case of the above-described situation,andare simultaneously calculable;andas an output ofThe inputs of (a) are computed in parallel.
The time complexity of the multi-level RNN parallel model isAs shown in fig. 4, the sequence data contains 16 time steps, each in parallel from two dimensions. The SRNN parallel mode of the step S2 is used in the layer, namely, after a traditional multilayer RNN model is divided from the time step dimension according to the size of the minimum subsequence, the divided multilayer RNN is used for extracting the minimum subsequence feature, and an additional circulation unit is introduced to obtain the high-level feature information between the cut sequences; and (4) performing interlayer parallelization on each multi-layer sub RNN acting on the minimum subsequence by using a parallel mode in the step, and performing finer-grained parallelization on the traditional multi-layer RNN model.
S4, a first-level mapping method of the multi-level RNN parallel model on the multi-core CPU is the model parallel in the core group;
the FT processor is divided into three parallel levels, as shown in FIG. 5, 64 processor cores of a chip integration are divided into 8 panels, each panel has 2 cluster, and each cluster contains 4 processor cores.
For the proposed multi-level RNN parallel model, a method for realizing the multi-core processor architecture is designed: the multi-layer sub-RNN in the model adopts a model parallel mode, and according to the mode, units which can be calculated in parallel in different layers and different time steps can be distributed to different processor cores, the communication is frequent, but only hidden state information is transmitted between the units, the communication traffic is small, and the model parallel is realized more effectively by utilizing the characteristic that the processor cores share the cache.
S5, the second-level mapping method of the multi-level RNN parallel model on the multi-core CPU is data parallel between core groups.
Referring to fig. 6, in step S4, each core group is used to calculate the multi-layer sub-RNN for the minimum subsequence, so as to obtain the final output of the multi-layer RNN, which is used as the input for adding additional calculating neurons to perform the calculation of the stacked RNN, and the core groups communicate with each other through the shared memory. When the stack RNN calculation is carried out, the models in each core group are the same, the input is different, and the communication frequency is low, so that the data between the core groups are mapped in parallel according to the method.
In another embodiment of the present invention, an implementation system of an RNN parallel model on a multi-core CPU is provided, where the implementation system can be used to implement the implementation method of the RNN parallel model on the multi-core CPU, and specifically, the implementation system of the RNN parallel model on the multi-core CPU includes a partitioning module, a parallelization module, a first mapping module, and a second mapping module.
The dividing module is used for dividing an original input sequence into a plurality of minimum subsequences by using an SRNN (sequence-specific neural network) segmentation method for in-layer parallel of the multilayer RNN model, and applying the segmented multilayer RNN to each minimum subsequence;
the parallelization module analyzes the dependency relationship of data in the multi-layer RNN acting on each minimum subsequence by adopting an interlayer parallelization method and parallelizes the circulating units in different layers and different time steps;
the first mapping module is used for mapping a multi-layer RNN parallel model on a multi-core CPU in a first-level mode, namely, the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating multi-layer sub-RNNs used for a minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for performing data parallel calculation between the core groups;
the second mapping module is used for mapping the multi-layer RNN parallel model on the multi-core CPU in a second layer mode, namely, data parallel between cores; and each core group obtains the output of the multi-layer sub-RNN through calculation, and the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, so that the parallel training of a multi-layer RNN parallel model on the multi-core CPU is realized.
In yet another embodiment of the present invention, a terminal device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor being configured to execute the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and is specifically adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor provided by the embodiment of the invention can be used for the operation of the method for realizing the multi-level RNN parallel model on the multi-core CPU, and comprises the following steps:
using an SRNN segmentation method for intra-layer parallel of the multilayer RNN model, dividing an original input sequence into a plurality of minimum subsequences, and acting the segmented multilayer RNN on each minimum subsequence; analyzing the dependency relationship of data in the multi-layer RNN acting on each minimum subsequence by adopting an interlayer parallel method, and parallelizing the loop units of different layers and different time steps; the first-level mapping method of the multi-level RNN parallel model on the multi-core CPU is that the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating the multi-level sub-RNN used for the minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for carrying out data parallel calculation between the core groups; the second-level mapping method of the multi-level RNN parallel model on the multi-core CPU is data parallel between cores; and each core group obtains the output of the multi-layer sub-RNN through calculation, and the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, so that the parallel training of a multi-layer RNN parallel model on the multi-core CPU is realized.
In still another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in a terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor can load and execute one or more instructions stored in the computer readable storage medium to realize the corresponding steps of the method for realizing the multi-level RNN parallel model on the multi-core CPU in the embodiment; one or more instructions in the computer-readable storage medium are loaded by the processor and perform the steps of:
using an SRNN segmentation method for intra-layer parallel of the multilayer RNN model, dividing an original input sequence into a plurality of minimum subsequences, and acting the segmented multilayer RNN on each minimum subsequence; analyzing the dependency relationship of data in the multi-layer RNN acting on each minimum subsequence by adopting an interlayer parallel method, and parallelizing the loop units of different layers and different time steps; the first-level mapping method of the multi-level RNN parallel model on the multi-core CPU is that the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating the multi-level sub-RNN used for the minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for carrying out data parallel calculation between the core groups; the second-level mapping method of the multi-level RNN parallel model on the multi-core CPU is data parallel between cores; and each core group obtains the output of the multi-layer sub-RNN through calculation, and the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, so that the parallel training of a multi-layer RNN parallel model on the multi-core CPU is realized.
The problem that the speed of the traditional multi-layer RNN is low can be solved, the RNN can be trained in parallel by improving the overall structure, and experimental results on data sets such as weather prediction time series, natural language processing and the like show that the speed of the multi-layer RNN parallel model provided by the invention is improved by 4 times compared with the traditional RNN and the accuracy rate is kept equivalent to that of the traditional RNN. When the traditional three-layer RNN model is changed into a multi-layer RNN parallel model structure with the time step length of 24, the training speed is improved by about 2.9 times; when the traditional five-layer RNN model is in a multi-layer RNN parallel model structure, the time step is 48, and the training speed is improved by about 3.7 times. Therefore, when the scale of the recurrent neural network is larger, the longer the time sequence length is or the larger the network layer number is, the better the parallel effect is, and the more obvious the promotion speed is.
In summary, the multi-level RNN parallel model, the method and the system for implementing the same on the multi-core CPU of the present invention include a multi-level parallel mode in and between the middle layers of the multi-layer RNN, and map the multi-level parallel mode to the soar processor. For processors with other multi-core architectures, the method provided by the invention can be used for constructing a multi-level RNN parallel model, and the data and the model are divided in parallel according to the architecture of the processor.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.
Claims (10)
1. A realization method of an RNN parallel model on a multi-core CPU is characterized in that an SRNN segmentation method is used for layer-in-parallel of a multi-layer RNN model, an original input sequence is divided into a plurality of minimum subsequences, and the segmented multi-layer RNN acts on each minimum subsequence; analyzing the dependency relationship of data in the multi-layer sub-RNN acting on each minimum subsequence by adopting an interlayer parallel method, and parallelizing the cycle units of different layers and different time steps to obtain a multi-layer RNN parallel model; the first-level mapping method of the multi-level RNN parallel model on the multi-core CPU is that the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating the multi-level sub-RNN used for the minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for carrying out data parallel calculation between the core groups; the second-level mapping method of the multi-level RNN parallel model on the multi-core CPU is data parallel between core groups, each core group obtains the output of multi-layer sub-RNNs through calculation, the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, and parallel training of the RNN parallel model on the multi-core CPU is achieved.
2. The method according to claim 1, wherein the intra-layer parallel use SRNN segmentation method specifically comprises:
dividing the input sequence X into N subsequences N with equal length, and determining the length of each subsequence NEach subsequence N is divided into N equal-length subsequences again, and the operation is repeated for k times to obtain NkA minimum subsequence, each minimum subsequence having a length ofIntroducing extra circulation unit to obtain quiltAnd cutting high-level feature information among the subsequences to construct an SRNN model, wherein each minimum subsequence uses the cut H-level sub RNN to extract features.
3. The method of claim 2, wherein the temporal complexity in constructing the SRNN network is:
4. The method according to claim 1, wherein parallelizing the different layers and time steps is embodied as:
in layers of a multi-layer RNN parallel model, dividing an original multi-layer RNN from a time step dimension according to the size of a minimum subsequence, extracting the minimum subsequence feature by using the divided multi-layer RNN, and introducing an additional circulation unit to obtain high-layer feature information among cut sequences; and for the interlayer of the multi-layer RNN parallel model, performing interlayer parallelization on each multi-layer sub-RNN acting on the minimum subsequence.
5. The method of claim 4, wherein the calculation result of the t-th time step of the H-th layer is obtained when the sub-RNN with the H-layer is used for extracting the minimum subsequence featureAs the same time step of the lower layerAnd the next time step of the layerThe computational complexity of the multi-level sub-RNN acting on the smallest subsequence becomes
6. The method according to claim 4, wherein after performing multi-layer parallelization on the multi-layer sub-RNN acting on the minimum subsequence according to the data dependency relationship of different layers and different time steps, when the layer 1 loop unit in the 1 st time stepAfter the calculation is finished, theAs a layer 2 cyclic unit in the 1 st time stepAnd layer 1 cyclic unit in time step 2When inputtingAndsimultaneously calculating when the data input condition is satisfiedAndwill be provided withAndas an output ofThe inputs of (a) are calculated in parallel,for the layer 3 loop element in time step 1,for the layer 2 loop element in time step 2,is the layer 1 loop element in time step 3.
7. The method of claim 1, wherein the first-level mapping of the multi-level RNN parallel model on the multi-core CPU is specifically:
and a core group consisting of processor cores sharing the same last-level cache is used for calculating the multi-layer sub-RNN used for the minimum subsequence, and the multi-layer sub-RNN in the model adopts a model parallel mode to distribute units capable of performing parallel calculation in different layers and different time steps to different processor cores.
8. The method according to claim 7, wherein the second-level mapping of the multi-level RNN parallel model on the multi-core CPU is specifically:
using each core group to calculate a multi-layer sub-RNN used for the minimum subsequence to obtain the final output of the multi-layer RNN, using the final output as the input of an additional cycle unit, and then performing stack RNN calculation, wherein the core groups are communicated through a shared memory; when the stack RNN calculation is carried out, the models in each core group are the same, the inputs are different, and the data between the core groups are mapped in parallel.
9. An RNN parallel model for use in the method of claim 1, wherein the input sequence of the RNN parallel model is X ═ X1,x2,…,xT]And the scale of the network expanded from the time dimension is H × T, H is the number of hidden layers, T is the time step of the sequence, and the circulating unit in the RNN parallel model selects a SimpleRNN unit, an LSTM unit or a GRU unit.
10. A realization system of RNN parallel model on multi-core CPU is characterized by comprising:
the dividing module is used for dividing the original input sequence into a plurality of minimum subsequences by using an SRNN (sequence-specific neural network) segmentation method for the intra-layer parallel use of the multilayer RNN model, and applying the segmented multilayer RNN to each minimum subsequence;
the parallelization module analyzes the dependency relationship of data in the multi-layer sub-RNN acting on each minimum subsequence by adopting an interlayer parallelization method, and parallelizes the cycle units of different layers and different time steps to obtain a multi-layer RNN parallelization model;
the first mapping module is used for mapping a multi-layer RNN parallel model on a multi-core CPU in a first-level mode, namely, the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating multi-layer sub-RNNs used for a minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for performing data parallel calculation between the core groups;
the second mapping module is used for mapping the multi-layer RNN parallel model on the multi-core CPU in a second layer mode, namely, data parallel between cores; and each core group obtains the output of the multi-layer sub-RNN through calculation, and the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, so that the parallel training of a multi-layer RNN parallel model on the multi-core CPU is realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111204314.XA CN114154616B (en) | 2021-10-15 | 2021-10-15 | RNN parallel model and method and system for implementing RNN parallel model on multi-core CPU |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111204314.XA CN114154616B (en) | 2021-10-15 | 2021-10-15 | RNN parallel model and method and system for implementing RNN parallel model on multi-core CPU |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114154616A true CN114154616A (en) | 2022-03-08 |
CN114154616B CN114154616B (en) | 2023-08-18 |
Family
ID=80462735
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111204314.XA Active CN114154616B (en) | 2021-10-15 | 2021-10-15 | RNN parallel model and method and system for implementing RNN parallel model on multi-core CPU |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114154616B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3179415A1 (en) * | 2015-12-11 | 2017-06-14 | Baidu USA LLC | Systems and methods for a multi-core optimized recurrent neural network |
CN109086865A (en) * | 2018-06-11 | 2018-12-25 | 上海交通大学 | A kind of series model method for building up based on cutting Recognition with Recurrent Neural Network |
US20190311245A1 (en) * | 2018-04-09 | 2019-10-10 | Microsoft Technology Licensing, Llc | Deep learning model scheduling |
-
2021
- 2021-10-15 CN CN202111204314.XA patent/CN114154616B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3179415A1 (en) * | 2015-12-11 | 2017-06-14 | Baidu USA LLC | Systems and methods for a multi-core optimized recurrent neural network |
CN106875013A (en) * | 2015-12-11 | 2017-06-20 | 百度(美国)有限责任公司 | The system and method for optimizing Recognition with Recurrent Neural Network for multinuclear |
US20190311245A1 (en) * | 2018-04-09 | 2019-10-10 | Microsoft Technology Licensing, Llc | Deep learning model scheduling |
CN109086865A (en) * | 2018-06-11 | 2018-12-25 | 上海交通大学 | A kind of series model method for building up based on cutting Recognition with Recurrent Neural Network |
Non-Patent Citations (3)
Title |
---|
冯诗影;韩文廷;金旭;迟孟贤;安虹;: "循环神经网络在语音识别模型中的训练加速方法", 小型微型计算机系统, no. 12 * |
庄连生;吕扬;杨健;李厚强;: "时频联合长时循环神经网络", 计算机研究与发展, no. 12 * |
陈虎;高波涌;陈莲娜;余翠;: "结合注意力机制与双向切片GRU的情感分类模型", 小型微型计算机系统, no. 09 * |
Also Published As
Publication number | Publication date |
---|---|
CN114154616B (en) | 2023-08-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Implementation of training convolutional neural networks | |
Chen et al. | Big data deep learning: challenges and perspectives | |
WO2021233342A1 (en) | Neural network construction method and system | |
Zainab et al. | Fpga based implementations of rnn and cnn: A brief analysis | |
WO2021218517A1 (en) | Method for acquiring neural network model, and image processing method and apparatus | |
CN109496294A (en) | The Compilation Method and system of artificial intelligence process device, storage medium and terminal | |
US11429855B2 (en) | Acceleration of neural networks using depth-first processing | |
CN110991631A (en) | Neural network acceleration system based on FPGA | |
Jin et al. | Training large scale deep neural networks on the intel xeon phi many-core coprocessor | |
WO2022007867A1 (en) | Method and device for constructing neural network | |
CN110543939A (en) | hardware acceleration implementation framework for convolutional neural network backward training based on FPGA | |
CN114742225A (en) | Neural network reasoning acceleration method based on heterogeneous platform | |
CN114492723A (en) | Neural network model training method, image processing method and device | |
Xiyuan et al. | A Review of FPGA‐Based Custom Computing Architecture for Convolutional Neural Network Inference | |
Park et al. | Speculative backpropagation for CNN parallel training | |
Sun et al. | Optimized light-weight convolutional neural networks for histopathologic cancer detection | |
WO2022156475A1 (en) | Neural network model training method and apparatus, and data processing method and apparatus | |
US20230306236A1 (en) | Device and method for executing lstm neural network operation | |
CN114519425A (en) | Convolution neural network acceleration system with expandable scale | |
Wai et al. | A scalable FPGA based accelerator for Tiny-YOLO-v2 using OpenCL | |
CN114154616A (en) | RNN parallel model and implementation method and system thereof on multi-core CPU | |
CN109978143B (en) | Stack type self-encoder based on SIMD architecture and encoding method | |
CN114008636A (en) | Optimizing machine learning model performance | |
CN111160535A (en) | DGCNN model acceleration method based on Hadoop | |
WO2023071658A1 (en) | Ai model processing method and apparatus, and ai model computing method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |