CN114154616A - RNN parallel model and implementation method and system thereof on multi-core CPU - Google Patents

RNN parallel model and implementation method and system thereof on multi-core CPU Download PDF

Info

Publication number
CN114154616A
CN114154616A CN202111204314.XA CN202111204314A CN114154616A CN 114154616 A CN114154616 A CN 114154616A CN 202111204314 A CN202111204314 A CN 202111204314A CN 114154616 A CN114154616 A CN 114154616A
Authority
CN
China
Prior art keywords
rnn
parallel
layer
model
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111204314.XA
Other languages
Chinese (zh)
Other versions
CN114154616B (en
Inventor
董小社
余星达
陈维多
何欣瑞
王强
陈衡
王龙翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202111204314.XA priority Critical patent/CN114154616B/en
Publication of CN114154616A publication Critical patent/CN114154616A/en
Application granted granted Critical
Publication of CN114154616B publication Critical patent/CN114154616B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an RNN parallel model and a method and a system for realizing the RNN parallel model on a multi-core CPU (central processing unit). In-layer and interlayer parallel optimization is carried out aiming at a multi-layer RNN structure, an original input sequence is divided into a plurality of minimum subsequences, and the divided multi-layer RNN acts on each minimum subsequence; analyzing the dependency relationship of data in the multi-layer sub RNN acting on each minimum subsequence by adopting an interlayer parallel method, and parallelizing the loop units of different layers and different time steps; the method comprises the steps that models in a core group are adopted to map a multi-level RNN parallel model on a first level of a multi-core CPU in parallel; and the multi-level RNN parallel model is mapped to the second level of the multi-core CPU by adopting the data parallel between the core groups, so that the characteristics of the multi-core CPU architecture are fully utilized, and the parallel training of the multi-level RNN parallel model on the multi-core CPU is realized. The invention fully utilizes the characteristics of the multi-core processor architecture, realizes a finer-grained parallel mode of the recurrent neural network, and accelerates the training of the network.

Description

RNN parallel model and implementation method and system thereof on multi-core CPU
Technical Field
The invention belongs to the technical field of artificial intelligence and parallel computing, and particularly relates to an RNN parallel model and a method and a system for realizing the RNN parallel model on a multi-core CPU.
Background
With the rapid development of deep learning, a Recurrent Neural Network (RNN) has been widely applied to natural language processing and time series tasks, including artificial intelligence applications such as emotion classification, machine translation, intelligent question answering, and sequence prediction. The RNN has the ability to obtain input sequence order information and can pass information from a previous time step to a next time step in the network to capture the correlation between sequence information over time. Because the original RNN faces the gradient problem of gradual disappearance and explosion, Long-short term memory networks (LSTM) and Gated round robin units (GRUs) are the most widely used round robin units at present, and achieve good performance on various tasks. However, as the amount of data increases and the number of network layers becomes deeper, feedforward and Convolutional Neural Networks (CNNs) can be successfully parallelized because they do not require any internal state that describes the link between past and future data. However, during RNN reasoning and training, there is a data dependency relationship, so that it is difficult to parallelize RNNs, and it takes a lot of time to train RNNs, which limits academic research and industrial application.
In order to solve the problem, some researches use CNN instead of RNN to perform natural language processing and time sequence tasks, and although such researches can utilize the parallel acceleration characteristic of CNN, the conventional CNN cannot acquire sequence information of the sequence, and thus the accuracy rate has a certain loss. Some researches improve the RNN training speed by improving the cyclic unit, such as SRU (Simple recurrent unit, SRU) and MGU (minimal gate unit, MGU), etc., which simplify the gating unit mechanism, reduce the number of training parameters, and achieve good acceleration effect, but do not change the connection mode of the conventional cyclic neural network. At present, for a study on changing a connection mode of a conventional recurrent neural network, such as a Sliced Recurrent Neural Network (SRNN), parallelization is implemented by dividing a sequence into a plurality of subsequences, and extra parameters are introduced to obtain high-level feature information between the divided sequences, which is faster than a network connection mode of a conventional RNN model without changing a gating mechanism of a recurrent unit, but the SRNN mainly uses a single-layer RNN model to extract features of a minimum subsequence. However, for more complex multivariate sequence data, more features can be extracted by using a multi-layer RNN model, which can achieve better effect than a single-layer RNN model, but data association exists between different time steps in the same layer and different layers in the same time step in the multi-layer RNN, and the next layer needs the output of the previous layer and realizes further feature extraction.
Therefore, the method has important significance for the parallel optimization of the traditional multilayer RNN model and how the sequence data and the network model of the model make full use of the characteristics of the multi-core CPU architecture to carry out reasonable distribution.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide an RNN parallel model and a method and a system for implementing the RNN parallel model on a multi-core CPU, aiming at the defects in the prior art, a multi-level RNN parallel model performs in-layer and inter-layer multi-level parallel optimization for the traditional multi-layer RNN structure, and maps the multi-level RNN parallel mode to a multi-core processor; by means of a fine-grained data and model division method, the characteristics of a multi-core CPU architecture are fully utilized, a parallel mode of a cyclic neural network with finer granularity is achieved, and training of the network is accelerated.
The invention adopts the following technical scheme:
a method for realizing RNN parallel model on multi-core CPU, using SRNN segmentation method for layer parallel of multi-layer RNN parallel model, dividing original input sequence into many minimum subsequences, and acting segmented multi-layer RNN on each minimum subsequence; analyzing the dependency relationship of data in the multi-layer sub-RNN acting on each minimum subsequence by adopting an interlayer parallel method, and parallelizing the cycle units of different layers and different time steps to obtain a multi-layer RNN parallel model; the first-level mapping method of the multi-level RNN parallel model on the multi-core CPU is that the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating the multi-level sub-RNN used for the minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for carrying out data parallel calculation between the core groups; the second-level mapping method of the multi-level RNN parallel model on the multi-core CPU is data parallel between core groups, each core group obtains the output of the multi-level sub-RNN through calculation, the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, and parallel training of the multi-level RNN parallel model on the multi-core CPU is achieved.
Specifically, the intra-layer parallel SRNN segmentation method specifically comprises the following steps:
dividing the input sequence X into N subsequences N with equal length, and determining the length of each subsequence N
Figure BDA0003306189240000031
Each subsequence N is divided into N equal-length subsequences again, and the operation is repeated for k times to obtain NkA minimum subsequence, each minimum subsequence having a length of
Figure BDA0003306189240000032
Introducing an extra circulation unit to obtain high-level feature information among the cut subsequences, constructing an SRNN model, and extracting features of each minimum subsequence by using the cut H-level sub RNN.
Further, when the SRNN network is constructed, the time complexity is:
Figure BDA0003306189240000033
wherein ,
Figure BDA0003306189240000034
representing a multi-layer RNN acting on the smallest subsequenceThe required computation time, nk represents the computation time required for the additional added cyclic unit.
Specifically, parallelizing different layers and different time steps specifically includes:
for layers of a multilayer RNN parallel model, dividing a multilayer RNN from a time step dimension according to the size of a minimum subsequence, extracting the minimum subsequence feature by using the divided multilayer RNN, and introducing an additional circulation unit to obtain high-level feature information among cut sequences; and for the interlayer of the multi-layer RNN parallel model, performing interlayer parallelization on each multi-layer sub-RNN acting on the minimum subsequence.
Further, when the sub-RNN with the H layer is used for extracting the minimum subsequence feature, the calculation result of the t-th time step of the H layer is obtained
Figure BDA0003306189240000041
As the same time step of the lower layer
Figure BDA0003306189240000042
And the next time step of the layer
Figure BDA0003306189240000043
The computational complexity of the multi-level sub-RNN acting on the smallest subsequence becomes
Figure BDA0003306189240000044
Further, after the multi-layer RNN acting on the minimum subsequence is subjected to multi-layer parallelization according to the data dependency relationship of different layers and different time steps, when the layer 1 circulation unit in the 1 st time step
Figure BDA0003306189240000045
After the calculation is finished, the
Figure BDA0003306189240000046
As a layer 2 cyclic unit in the 1 st time step
Figure BDA0003306189240000047
And layer 1 cyclic unit in time step 2
Figure BDA0003306189240000048
When inputting
Figure BDA0003306189240000049
And
Figure BDA00033061892400000410
simultaneously calculating when the data input condition is satisfied
Figure BDA00033061892400000411
And
Figure BDA00033061892400000412
will be provided with
Figure BDA00033061892400000413
And
Figure BDA00033061892400000414
as an output of
Figure BDA00033061892400000415
The inputs of (a) are calculated in parallel,
Figure BDA00033061892400000416
for the layer 3 loop element in time step 1,
Figure BDA00033061892400000417
for the layer 2 loop element in time step 2,
Figure BDA00033061892400000418
is the layer 1 loop element in time step 3.
Specifically, the first-level mapping of the multi-level RNN parallel model on the multi-core CPU specifically is:
and a core group consisting of processor cores sharing the same last-level cache is used for calculating the multi-layer sub-RNN used for the minimum subsequence, and the multi-layer sub-RNN in the model adopts a model parallel mode to distribute units capable of performing parallel calculation in different layers and different time steps to different processor cores.
Further, the second-level mapping of the multi-level RNN parallel model on the multi-core CPU specifically comprises:
using each core group to calculate a multi-layer sub-RNN used for the minimum subsequence to obtain the final output of the multi-layer RNN, using the final output as the input of an additional cycle unit, and then performing stack RNN calculation, wherein the core groups are communicated through a shared memory; when the stack RNN calculation is carried out, the models in each core group are the same, the inputs are different, and the data between the core groups are mapped in parallel.
The other technical scheme of the invention is that the input sequence X of the RNN parallel model and the multilayer RNN model is [ X ]1,x2,…,xT]And the scale of the network expanded from the time dimension is H × T, H is the number of hidden layers, T is the time step of the sequence, and the SimpleRNN unit, the LSTM unit or the GRU unit is selected from the circulating units in the multi-layer RNN parallel model.
Another technical solution of the present invention is a system for implementing an RNN parallel model on a multi-core CPU, including:
the dividing module is used for dividing the original input sequence into a plurality of minimum subsequences by using an SRNN (sequence-specific neural network) segmentation method for the intra-layer parallelism of the multi-layer RNN parallel model, and applying the segmented multi-layer RNN to each minimum subsequence;
the parallelization module analyzes the dependency relationship of data in the multi-layer sub-RNN acting on each minimum subsequence by adopting an interlayer parallelization method, and parallelizes the cycle units of different layers and different time steps to obtain a multi-layer RNN parallelization model;
the first mapping module is used for mapping a multi-layer RNN parallel model on a multi-core CPU in a first-level mode, namely, the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating multi-layer sub-RNNs used for a minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for performing data parallel calculation between the core groups;
the second mapping module is used for mapping the multi-layer RNN parallel model on the multi-core CPU in a second layer mode, namely, data parallel between cores; and each core group obtains the output of the multi-layer sub-RNN through calculation, and the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, so that the parallel training of a multi-layer RNN parallel model on the multi-core CPU is realized. .
Compared with the prior art, the invention has at least the following beneficial effects:
the invention relates to a method for realizing an RNN parallel model on a multi-core CPU (Central processing Unit). according to the data dependency of a multi-level RNN parallel model and the architectural characteristics of a multi-core processor, the multi-level RNN parallel model is mapped to the multi-core processor for training; by the fine-grained data and model division method, the characteristics of the multi-core processor architecture are fully utilized, and the parallel training method of the multilayer cyclic neural network with finer granularity is realized.
Furthermore, an SRNN segmentation method is used in parallel in the layer, an original input sequence is divided into a plurality of minimum subsequences, an extra circulation unit is introduced to obtain high-level characteristic information among the segmented sequences, the segmented multilayer sub-RNNs act on each minimum subsequence, data dependency does not exist among the multilayer sub-RNNs, and parallel calculation can be carried out. When the model is constructed by adopting the in-layer parallel method, the time complexity is changed from HT of the traditional multilayer RNN model to HT of the traditional multilayer RNN model
Figure BDA0003306189240000061
Furthermore, the interlayer parallelism parallelizes the loop units of different layers and different time steps by analyzing the data dependency relationship in the multi-layer sub RNN acting on the minimum subsequence. By utilizing the potential parallelism and data dependency in the multi-layer sub-RNN, the multi-layer sub-RNN acting on the sub-sequence is further parallelized without changing the network structure. Further constructing a model by adopting the interlayer parallel method to obtain a multi-layer RNN parallel model, wherein the time complexity of the model is determined by the time complexity of the last step
Figure BDA0003306189240000062
Become into
Figure BDA0003306189240000063
Further, for the proposed multi-level RNN parallel model, a method for realizing the multi-core processor architecture is designed: the core group consisting of the processor cores sharing the same last-level cache is used for calculating the multi-layer sub RNN used for the minimum subsequence, units which can be calculated in parallel in different layers and different time steps are distributed to different processor cores in a model parallel mode, communication is frequent, only hidden state information is transmitted among the units, communication traffic is small, and model parallel is realized more effectively by using the characteristic that the processor cores share the cache.
Further, each core group is used for calculating the multilayer sub-RNN used for the minimum subsequence to obtain the final output of the multilayer sub-RNN, the final output is used as the input of an additional calculation neuron, then the stack RNN calculation is carried out, and the core groups are communicated through a shared memory. When the stack RNN calculation is carried out, the models in each core group are the same, the input is different, and the communication frequency is low, so that the data between the core groups are mapped in parallel according to the method.
The invention relates to an RNN parallel model, which changes the structure of the traditional multilayer RNN network model, and the multilayer parallel structure body of the model of the invention is as follows: dividing an original input sequence into a plurality of minimum subsequences by using an SRNN segmentation method in a layer, and acting the segmented multilayer sub-RNNs on each minimum subsequence; and (3) performing parallel calculation on different layers and different time steps of the cyclic units by analyzing the dependency relationship of data in the multi-layer sub RNN acting on each minimum subsequence between layers, and further performing parallel optimization on the multi-layer sub RNN. A multi-level RNN parallel model is formed in the mode, the traditional multi-level RNN model is divided in a finer granularity, and the training speed is increased.
In summary, the method of the present invention includes a multilevel parallel manner in and between layers in the traditional multilayer RNN model, and maps the multilevel parallel manner to the multi-core processor, and by the fine-grained data and model partitioning method, the characteristics of the multi-core processor architecture are fully utilized, the finer-grained parallel manner of the recurrent neural network is realized, and the training of the network is accelerated.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a schematic diagram of an RNN parallel model;
FIG. 2 is a schematic diagram of intra-layer parallel partitioning of a multi-layer RNN;
FIG. 3 is a schematic diagram of a multi-layered RNN, wherein (a) is a schematic diagram of a multi-layered RNN acting on a minimum subsequence; (b) the interlayer parallel optimization schematic diagram is shown;
FIG. 4 is a schematic diagram of an RNN parallel model;
FIG. 5 is a schematic diagram of an FT-2000+/64 multi-core processor architecture;
FIG. 6 is a diagram illustrating a mapping method of an RNN parallel model on a multi-core processor.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be understood that the terms "comprises" and/or "comprising" indicate the presence of the stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.
The invention provides a multi-level RNN parallel model and a method for realizing the same on a multi-core CPU (Central processing Unit), wherein the multi-level RNN parallel model carries out in-layer and interlayer multi-level parallel optimization aiming at the traditional multi-layer RNN structure, and maps the multi-level parallel mode onto a multi-core processor. The invention comprises the following steps: aiming at a multilayer RNN model, an SRNN segmentation method is used in parallel in a layer, and the segmented multilayer RNN acts on each minimum subsequence; the interlayer parallelism parallelizes different layers and different time steps by analyzing the data dependency relationship in the multi-layer sub RNN acting on the minimum subsequence. A multi-level RNN parallel model is constructed in the two parallel modes, and an implementation method on a multi-core processor is designed from the model parallel layer and the data parallel layer.
The invention discloses a method for realizing an RNN parallel model on a multi-core CPU (Central processing Unit), which takes an FT-2000+ multi-core processor as an example for explanation and comprises the following steps:
s1, constructing a multilayer RNN model;
constructing a multilayer RNN model according to the size of an input sequence, wherein the input sequence X is [ X ]1,x2,...,xT]Which isAnd the middle T represents the time step of the sequence, the number of hidden layers is H, and the network size of the constructed multilayer RNN model expanded from the time dimension is H x T. The loop elements in the multi-layer RNN model are selected from basic SimpleRNN elements, LSTM elements, GRU elements, or other modified loop elements.
S2, parallelly using an SRNN segmentation method in the multilayer RNN model (time step dimension) constructed in the step S1, dividing the original input sequence into a plurality of minimum subsequences, and acting the segmented multilayer RNN on each minimum subsequence;
an input sequence X is first divided into N equal-length subsequences N, where X can be expressed as X ═ N1,N2,...,Nn]The length of each subsequence N
Figure BDA0003306189240000091
Then dividing each subsequence N into N equal-length subsequences again, repeating the operation for k times until a minimum subsequence with proper size is obtained, wherein the number of the minimum subsequences is NkEach minimum subsequence having a length of
Figure BDA0003306189240000092
Introducing an extra circulation unit to obtain high-level feature information among the cut sequences, and constructing the SRNN according to the dividing mode, wherein the minimum subsequence uses the cut H-level sub RNNs to extract sequence features. Thus, the multi-level sub-RNN scale for processing each minimum subsequence is
Figure BDA0003306189240000101
And are parallel to each other.
When the SRNN is constructed, the time complexity is as follows:
Figure BDA0003306189240000102
wherein ,
Figure BDA0003306189240000103
represents the computation time required for the multi-layered RNN acting on the smallest subsequence, and nk represents the computation time required for the additional addition of a cyclic unit.
For a conventional multi-layer RNN model structure with a sequence having time steps of T16 and H4, the time-series dimension is expanded as shown in fig. 1. Each time the sequence is divided into 2 equal-length subsequences, that is, n is 2, and k is repeated 2 times, the length of the minimum subsequence is l is 4, and an additional cyclic unit is added, so that the structure of the neural network after intra-layer parallel division is performed is shown in fig. 2.
S3, parallelizing different layers and different time steps by analyzing the data dependency relationship in the multi-layer sub RNN acting on the minimum subsequence;
when the sub-RNN with the H layer is used for extracting the minimum subsequence feature, the calculation result of the t-th time step of the H layer is obtained
Figure BDA0003306189240000104
In other words, it will be the same time step as the lower layer
Figure BDA0003306189240000105
And the next time step of the layer
Figure BDA0003306189240000106
The input of (a) is performed,
Figure BDA0003306189240000107
and
Figure BDA0003306189240000108
there is no data dependency, and therefore,
Figure BDA0003306189240000109
and
Figure BDA00033061892400001010
are computable in parallel. According to the analysis, the calculation result of the t time step of the h layer
Figure BDA00033061892400001011
The units that can be calculated in parallel are:
Figure BDA00033061892400001012
[z=min(H,l)]
in such a parallel approach, the computational complexity of the multi-level sub-RNN acting on the smallest sub-sequence is reduced from the original one
Figure BDA00033061892400001013
Become into
Figure BDA00033061892400001014
Therefore, for the neural network constructed in fig. 2, the original calculation flow of the multi-layer sub-RNN for the minimum subsequence is shown in fig. 3(a), the 4-layer cycle units in the 1 st time step are calculated first, and the calculation sequence is
Figure BDA0003306189240000111
Subsequently, the layer 4 cyclic unit in the 2 nd time step is calculated in such a way that the layer 4 cyclic unit is calculated up to the 4 th time step
Figure BDA0003306189240000112
This is equivalent to a serial calculation. After performing multi-level parallelization on the multi-level sub-RNN acting on the minimum sub-sequence according to the data dependency relationship of different levels and different time steps, as shown in FIG. 3(b), when
Figure BDA0003306189240000113
After the calculation, the result is taken as
Figure BDA0003306189240000114
And
Figure BDA0003306189240000115
when these two data are satisfied simultaneouslyIn the case of the above-described situation,
Figure BDA0003306189240000116
and
Figure BDA0003306189240000117
are simultaneously calculable;
Figure BDA0003306189240000118
and
Figure BDA0003306189240000119
as an output of
Figure BDA00033061892400001110
The inputs of (a) are computed in parallel.
The time complexity of the multi-level RNN parallel model is
Figure BDA00033061892400001111
As shown in fig. 4, the sequence data contains 16 time steps, each in parallel from two dimensions. The SRNN parallel mode of the step S2 is used in the layer, namely, after a traditional multilayer RNN model is divided from the time step dimension according to the size of the minimum subsequence, the divided multilayer RNN is used for extracting the minimum subsequence feature, and an additional circulation unit is introduced to obtain the high-level feature information between the cut sequences; and (4) performing interlayer parallelization on each multi-layer sub RNN acting on the minimum subsequence by using a parallel mode in the step, and performing finer-grained parallelization on the traditional multi-layer RNN model.
S4, a first-level mapping method of the multi-level RNN parallel model on the multi-core CPU is the model parallel in the core group;
the FT processor is divided into three parallel levels, as shown in FIG. 5, 64 processor cores of a chip integration are divided into 8 panels, each panel has 2 cluster, and each cluster contains 4 processor cores.
For the proposed multi-level RNN parallel model, a method for realizing the multi-core processor architecture is designed: the multi-layer sub-RNN in the model adopts a model parallel mode, and according to the mode, units which can be calculated in parallel in different layers and different time steps can be distributed to different processor cores, the communication is frequent, but only hidden state information is transmitted between the units, the communication traffic is small, and the model parallel is realized more effectively by utilizing the characteristic that the processor cores share the cache.
S5, the second-level mapping method of the multi-level RNN parallel model on the multi-core CPU is data parallel between core groups.
Referring to fig. 6, in step S4, each core group is used to calculate the multi-layer sub-RNN for the minimum subsequence, so as to obtain the final output of the multi-layer RNN, which is used as the input for adding additional calculating neurons to perform the calculation of the stacked RNN, and the core groups communicate with each other through the shared memory. When the stack RNN calculation is carried out, the models in each core group are the same, the input is different, and the communication frequency is low, so that the data between the core groups are mapped in parallel according to the method.
In another embodiment of the present invention, an implementation system of an RNN parallel model on a multi-core CPU is provided, where the implementation system can be used to implement the implementation method of the RNN parallel model on the multi-core CPU, and specifically, the implementation system of the RNN parallel model on the multi-core CPU includes a partitioning module, a parallelization module, a first mapping module, and a second mapping module.
The dividing module is used for dividing an original input sequence into a plurality of minimum subsequences by using an SRNN (sequence-specific neural network) segmentation method for in-layer parallel of the multilayer RNN model, and applying the segmented multilayer RNN to each minimum subsequence;
the parallelization module analyzes the dependency relationship of data in the multi-layer RNN acting on each minimum subsequence by adopting an interlayer parallelization method and parallelizes the circulating units in different layers and different time steps;
the first mapping module is used for mapping a multi-layer RNN parallel model on a multi-core CPU in a first-level mode, namely, the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating multi-layer sub-RNNs used for a minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for performing data parallel calculation between the core groups;
the second mapping module is used for mapping the multi-layer RNN parallel model on the multi-core CPU in a second layer mode, namely, data parallel between cores; and each core group obtains the output of the multi-layer sub-RNN through calculation, and the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, so that the parallel training of a multi-layer RNN parallel model on the multi-core CPU is realized.
In yet another embodiment of the present invention, a terminal device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor being configured to execute the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and is specifically adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor provided by the embodiment of the invention can be used for the operation of the method for realizing the multi-level RNN parallel model on the multi-core CPU, and comprises the following steps:
using an SRNN segmentation method for intra-layer parallel of the multilayer RNN model, dividing an original input sequence into a plurality of minimum subsequences, and acting the segmented multilayer RNN on each minimum subsequence; analyzing the dependency relationship of data in the multi-layer RNN acting on each minimum subsequence by adopting an interlayer parallel method, and parallelizing the loop units of different layers and different time steps; the first-level mapping method of the multi-level RNN parallel model on the multi-core CPU is that the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating the multi-level sub-RNN used for the minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for carrying out data parallel calculation between the core groups; the second-level mapping method of the multi-level RNN parallel model on the multi-core CPU is data parallel between cores; and each core group obtains the output of the multi-layer sub-RNN through calculation, and the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, so that the parallel training of a multi-layer RNN parallel model on the multi-core CPU is realized.
In still another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in a terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor can load and execute one or more instructions stored in the computer readable storage medium to realize the corresponding steps of the method for realizing the multi-level RNN parallel model on the multi-core CPU in the embodiment; one or more instructions in the computer-readable storage medium are loaded by the processor and perform the steps of:
using an SRNN segmentation method for intra-layer parallel of the multilayer RNN model, dividing an original input sequence into a plurality of minimum subsequences, and acting the segmented multilayer RNN on each minimum subsequence; analyzing the dependency relationship of data in the multi-layer RNN acting on each minimum subsequence by adopting an interlayer parallel method, and parallelizing the loop units of different layers and different time steps; the first-level mapping method of the multi-level RNN parallel model on the multi-core CPU is that the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating the multi-level sub-RNN used for the minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for carrying out data parallel calculation between the core groups; the second-level mapping method of the multi-level RNN parallel model on the multi-core CPU is data parallel between cores; and each core group obtains the output of the multi-layer sub-RNN through calculation, and the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, so that the parallel training of a multi-layer RNN parallel model on the multi-core CPU is realized.
The problem that the speed of the traditional multi-layer RNN is low can be solved, the RNN can be trained in parallel by improving the overall structure, and experimental results on data sets such as weather prediction time series, natural language processing and the like show that the speed of the multi-layer RNN parallel model provided by the invention is improved by 4 times compared with the traditional RNN and the accuracy rate is kept equivalent to that of the traditional RNN. When the traditional three-layer RNN model is changed into a multi-layer RNN parallel model structure with the time step length of 24, the training speed is improved by about 2.9 times; when the traditional five-layer RNN model is in a multi-layer RNN parallel model structure, the time step is 48, and the training speed is improved by about 3.7 times. Therefore, when the scale of the recurrent neural network is larger, the longer the time sequence length is or the larger the network layer number is, the better the parallel effect is, and the more obvious the promotion speed is.
In summary, the multi-level RNN parallel model, the method and the system for implementing the same on the multi-core CPU of the present invention include a multi-level parallel mode in and between the middle layers of the multi-layer RNN, and map the multi-level parallel mode to the soar processor. For processors with other multi-core architectures, the method provided by the invention can be used for constructing a multi-level RNN parallel model, and the data and the model are divided in parallel according to the architecture of the processor.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (10)

1. A realization method of an RNN parallel model on a multi-core CPU is characterized in that an SRNN segmentation method is used for layer-in-parallel of a multi-layer RNN model, an original input sequence is divided into a plurality of minimum subsequences, and the segmented multi-layer RNN acts on each minimum subsequence; analyzing the dependency relationship of data in the multi-layer sub-RNN acting on each minimum subsequence by adopting an interlayer parallel method, and parallelizing the cycle units of different layers and different time steps to obtain a multi-layer RNN parallel model; the first-level mapping method of the multi-level RNN parallel model on the multi-core CPU is that the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating the multi-level sub-RNN used for the minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for carrying out data parallel calculation between the core groups; the second-level mapping method of the multi-level RNN parallel model on the multi-core CPU is data parallel between core groups, each core group obtains the output of multi-layer sub-RNNs through calculation, the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, and parallel training of the RNN parallel model on the multi-core CPU is achieved.
2. The method according to claim 1, wherein the intra-layer parallel use SRNN segmentation method specifically comprises:
dividing the input sequence X into N subsequences N with equal length, and determining the length of each subsequence N
Figure FDA0003306189230000011
Each subsequence N is divided into N equal-length subsequences again, and the operation is repeated for k times to obtain NkA minimum subsequence, each minimum subsequence having a length of
Figure FDA0003306189230000012
Introducing extra circulation unit to obtain quiltAnd cutting high-level feature information among the subsequences to construct an SRNN model, wherein each minimum subsequence uses the cut H-level sub RNN to extract features.
3. The method of claim 2, wherein the temporal complexity in constructing the SRNN network is:
Figure FDA0003306189230000013
wherein ,
Figure FDA0003306189230000021
represents the computation time required for the multi-layered RNN acting on the smallest subsequence, and nk represents the computation time required for the additional addition of a cyclic unit.
4. The method according to claim 1, wherein parallelizing the different layers and time steps is embodied as:
in layers of a multi-layer RNN parallel model, dividing an original multi-layer RNN from a time step dimension according to the size of a minimum subsequence, extracting the minimum subsequence feature by using the divided multi-layer RNN, and introducing an additional circulation unit to obtain high-layer feature information among cut sequences; and for the interlayer of the multi-layer RNN parallel model, performing interlayer parallelization on each multi-layer sub-RNN acting on the minimum subsequence.
5. The method of claim 4, wherein the calculation result of the t-th time step of the H-th layer is obtained when the sub-RNN with the H-layer is used for extracting the minimum subsequence feature
Figure FDA0003306189230000022
As the same time step of the lower layer
Figure FDA0003306189230000023
And the next time step of the layer
Figure FDA0003306189230000024
The computational complexity of the multi-level sub-RNN acting on the smallest subsequence becomes
Figure FDA0003306189230000025
6. The method according to claim 4, wherein after performing multi-layer parallelization on the multi-layer sub-RNN acting on the minimum subsequence according to the data dependency relationship of different layers and different time steps, when the layer 1 loop unit in the 1 st time step
Figure FDA0003306189230000026
After the calculation is finished, the
Figure FDA0003306189230000027
As a layer 2 cyclic unit in the 1 st time step
Figure FDA0003306189230000028
And layer 1 cyclic unit in time step 2
Figure FDA0003306189230000029
When inputting
Figure FDA00033061892300000210
And
Figure FDA00033061892300000211
simultaneously calculating when the data input condition is satisfied
Figure FDA00033061892300000212
And
Figure FDA00033061892300000213
will be provided with
Figure FDA00033061892300000214
And
Figure FDA00033061892300000215
as an output of
Figure FDA00033061892300000216
The inputs of (a) are calculated in parallel,
Figure FDA00033061892300000217
for the layer 3 loop element in time step 1,
Figure FDA00033061892300000218
for the layer 2 loop element in time step 2,
Figure FDA00033061892300000219
is the layer 1 loop element in time step 3.
7. The method of claim 1, wherein the first-level mapping of the multi-level RNN parallel model on the multi-core CPU is specifically:
and a core group consisting of processor cores sharing the same last-level cache is used for calculating the multi-layer sub-RNN used for the minimum subsequence, and the multi-layer sub-RNN in the model adopts a model parallel mode to distribute units capable of performing parallel calculation in different layers and different time steps to different processor cores.
8. The method according to claim 7, wherein the second-level mapping of the multi-level RNN parallel model on the multi-core CPU is specifically:
using each core group to calculate a multi-layer sub-RNN used for the minimum subsequence to obtain the final output of the multi-layer RNN, using the final output as the input of an additional cycle unit, and then performing stack RNN calculation, wherein the core groups are communicated through a shared memory; when the stack RNN calculation is carried out, the models in each core group are the same, the inputs are different, and the data between the core groups are mapped in parallel.
9. An RNN parallel model for use in the method of claim 1, wherein the input sequence of the RNN parallel model is X ═ X1,x2,…,xT]And the scale of the network expanded from the time dimension is H × T, H is the number of hidden layers, T is the time step of the sequence, and the circulating unit in the RNN parallel model selects a SimpleRNN unit, an LSTM unit or a GRU unit.
10. A realization system of RNN parallel model on multi-core CPU is characterized by comprising:
the dividing module is used for dividing the original input sequence into a plurality of minimum subsequences by using an SRNN (sequence-specific neural network) segmentation method for the intra-layer parallel use of the multilayer RNN model, and applying the segmented multilayer RNN to each minimum subsequence;
the parallelization module analyzes the dependency relationship of data in the multi-layer sub-RNN acting on each minimum subsequence by adopting an interlayer parallelization method, and parallelizes the cycle units of different layers and different time steps to obtain a multi-layer RNN parallelization model;
the first mapping module is used for mapping a multi-layer RNN parallel model on a multi-core CPU in a first-level mode, namely, the model in a core group is parallel, the core group consisting of a plurality of processor cores sharing cache is used for calculating multi-layer sub-RNNs used for a minimum subsequence, different parts of the model are distributed in the core group, and the obtained calculation result is used for performing data parallel calculation between the core groups;
the second mapping module is used for mapping the multi-layer RNN parallel model on the multi-core CPU in a second layer mode, namely, data parallel between cores; and each core group obtains the output of the multi-layer sub-RNN through calculation, and the output is used as the input of an additional circulation unit to perform data parallel calculation between the core groups, so that the parallel training of a multi-layer RNN parallel model on the multi-core CPU is realized.
CN202111204314.XA 2021-10-15 2021-10-15 RNN parallel model and method and system for implementing RNN parallel model on multi-core CPU Active CN114154616B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111204314.XA CN114154616B (en) 2021-10-15 2021-10-15 RNN parallel model and method and system for implementing RNN parallel model on multi-core CPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111204314.XA CN114154616B (en) 2021-10-15 2021-10-15 RNN parallel model and method and system for implementing RNN parallel model on multi-core CPU

Publications (2)

Publication Number Publication Date
CN114154616A true CN114154616A (en) 2022-03-08
CN114154616B CN114154616B (en) 2023-08-18

Family

ID=80462735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111204314.XA Active CN114154616B (en) 2021-10-15 2021-10-15 RNN parallel model and method and system for implementing RNN parallel model on multi-core CPU

Country Status (1)

Country Link
CN (1) CN114154616B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3179415A1 (en) * 2015-12-11 2017-06-14 Baidu USA LLC Systems and methods for a multi-core optimized recurrent neural network
CN109086865A (en) * 2018-06-11 2018-12-25 上海交通大学 A kind of series model method for building up based on cutting Recognition with Recurrent Neural Network
US20190311245A1 (en) * 2018-04-09 2019-10-10 Microsoft Technology Licensing, Llc Deep learning model scheduling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3179415A1 (en) * 2015-12-11 2017-06-14 Baidu USA LLC Systems and methods for a multi-core optimized recurrent neural network
CN106875013A (en) * 2015-12-11 2017-06-20 百度(美国)有限责任公司 The system and method for optimizing Recognition with Recurrent Neural Network for multinuclear
US20190311245A1 (en) * 2018-04-09 2019-10-10 Microsoft Technology Licensing, Llc Deep learning model scheduling
CN109086865A (en) * 2018-06-11 2018-12-25 上海交通大学 A kind of series model method for building up based on cutting Recognition with Recurrent Neural Network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
冯诗影;韩文廷;金旭;迟孟贤;安虹;: "循环神经网络在语音识别模型中的训练加速方法", 小型微型计算机系统, no. 12 *
庄连生;吕扬;杨健;李厚强;: "时频联合长时循环神经网络", 计算机研究与发展, no. 12 *
陈虎;高波涌;陈莲娜;余翠;: "结合注意力机制与双向切片GRU的情感分类模型", 小型微型计算机系统, no. 09 *

Also Published As

Publication number Publication date
CN114154616B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
Liu et al. Implementation of training convolutional neural networks
Chen et al. Big data deep learning: challenges and perspectives
WO2021233342A1 (en) Neural network construction method and system
Zainab et al. Fpga based implementations of rnn and cnn: A brief analysis
WO2021218517A1 (en) Method for acquiring neural network model, and image processing method and apparatus
CN109496294A (en) The Compilation Method and system of artificial intelligence process device, storage medium and terminal
US11429855B2 (en) Acceleration of neural networks using depth-first processing
CN110991631A (en) Neural network acceleration system based on FPGA
Jin et al. Training large scale deep neural networks on the intel xeon phi many-core coprocessor
WO2022007867A1 (en) Method and device for constructing neural network
CN110543939A (en) hardware acceleration implementation framework for convolutional neural network backward training based on FPGA
CN114742225A (en) Neural network reasoning acceleration method based on heterogeneous platform
CN114492723A (en) Neural network model training method, image processing method and device
Xiyuan et al. A Review of FPGA‐Based Custom Computing Architecture for Convolutional Neural Network Inference
Park et al. Speculative backpropagation for CNN parallel training
Sun et al. Optimized light-weight convolutional neural networks for histopathologic cancer detection
WO2022156475A1 (en) Neural network model training method and apparatus, and data processing method and apparatus
US20230306236A1 (en) Device and method for executing lstm neural network operation
CN114519425A (en) Convolution neural network acceleration system with expandable scale
Wai et al. A scalable FPGA based accelerator for Tiny-YOLO-v2 using OpenCL
CN114154616A (en) RNN parallel model and implementation method and system thereof on multi-core CPU
CN109978143B (en) Stack type self-encoder based on SIMD architecture and encoding method
CN114008636A (en) Optimizing machine learning model performance
CN111160535A (en) DGCNN model acceleration method based on Hadoop
WO2023071658A1 (en) Ai model processing method and apparatus, and ai model computing method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant