CN116431562B - Multi-head attention mechanism fusion calculation distribution method based on acceleration processor - Google Patents

Multi-head attention mechanism fusion calculation distribution method based on acceleration processor Download PDF

Info

Publication number
CN116431562B
CN116431562B CN202310687654.5A CN202310687654A CN116431562B CN 116431562 B CN116431562 B CN 116431562B CN 202310687654 A CN202310687654 A CN 202310687654A CN 116431562 B CN116431562 B CN 116431562B
Authority
CN
China
Prior art keywords
fusion
operator
data
calculation
slave core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310687654.5A
Other languages
Chinese (zh)
Other versions
CN116431562A (en
Inventor
徐旎林
闫夏超
高伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taichu Wuxi Electronic Technology Co ltd
Original Assignee
Taichu Wuxi Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taichu Wuxi Electronic Technology Co ltd filed Critical Taichu Wuxi Electronic Technology Co ltd
Priority to CN202310687654.5A priority Critical patent/CN116431562B/en
Publication of CN116431562A publication Critical patent/CN116431562A/en
Application granted granted Critical
Publication of CN116431562B publication Critical patent/CN116431562B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to the field of data processing, and discloses a multi-head attention mechanism fusion calculation distribution method based on an acceleration processor, which comprises the following steps: acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed; based on the calculation requirement and the slave core information, carrying out fusion association on operators of the slave core to obtain fusion operators and calculation logic corresponding to each fusion operator; and sequentially calling interfaces corresponding to the fusion operators to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator in sequence, and a calculation result is obtained. The method is favorable for fully using hardware resources by combining fusion operators, and compared with the condition of only processing a single operator, the method can reduce the starting overhead of a slave core by combining the fusion operators, avoid frequent memory read-write, reduce memory access data volume, avoid memory bandwidth competition and greatly reduce the influence of memory bottleneck.

Description

Multi-head attention mechanism fusion calculation distribution method based on acceleration processor
Technical Field
The invention relates to the technical field of data processing, in particular to a multi-head attention mechanism fusion calculation distribution method based on an acceleration processor.
Background
With the increase of computing power, deep learning has become an important breakthrough in the field of artificial intelligence in the last decade. While more and more complex applications require more and more massive computing data, more efficient and faster support frames are being driven. In order to improve the calculation efficiency of the framework, specific hardware structures are combined, the algorithm structures of various operators are considered, the algorithm structures are matched with each other, and hardware resources are fully utilized so as to achieve the optimal calculation effect.
The current AI acceleration processor mostly adopts heterogeneous many-core design, and can obtain high-efficiency computing performance and reduce power consumption to a certain extent. The heterogeneous many-core architecture comprises a master core and a slave core array which share memory space. However, in the prior art, each calculation needs to write the output of the operator back to the memory, and when the next calculation is performed, the data is read back from the main memory to the cache space of the slave core again for calculation. Repeated data movement causes the access data to become large, occupies bandwidth and affects performance.
Disclosure of Invention
In view of this, the embodiment of the invention provides a multi-head attention mechanism fusion calculation distribution method based on an acceleration processor, so as to solve the problems of large data access quantity and low efficiency in the data distribution calculation process in the prior art.
In a first aspect, an embodiment of the present invention provides a multi-head attention mechanism fusion computation allocation method based on an acceleration processor, which is applied to a main core of the acceleration processor, where the acceleration processor includes a main core, a memory, and a slave core array formed by a plurality of slave cores, each slave core processes operators of a plurality of operation types, and the method includes:
acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed;
based on the calculation requirement and the slave core information, carrying out fusion association on operators of the slave core to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator;
and sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator in sequence, and a calculation result is obtained.
By reasonably fusing part of calculation processes, the method is favorable for full use of slave operators through a fusion operator combination mode, compared with the condition that only a single operator is processed at the same time, the combination of the fusion operator can reduce the starting cost of the slave core, avoid frequent memory reading and writing, directly store intermediate results in the slave core for the next calculation, reduce the memory access data quantity, avoid memory bandwidth competition and greatly reduce the influence of memory bottleneck.
In an optional implementation manner, the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, which includes:
extracting the operation type of each operator in each slave core from the slave core information;
and carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator.
By dividing and merging operators, the data parallel processing capability of the slave cores of the acceleration processor is fully utilized, and a reasonable allocation strategy is provided for the calculation tasks of the operators by combining the hardware structure of the acceleration processor, so that the overall operator performance is improved.
In a second aspect, an embodiment of the present invention provides a multi-head attention mechanism fusion computation allocation method based on an acceleration processor, which is applied to a slave core of the acceleration processor, where the acceleration processor includes a master core, a memory, and a slave core array formed by a plurality of slave cores, and each slave core processes operators of a plurality of operation types, and the method includes:
acquiring an interface for starting a fusion operator corresponding to a slave core, and determining a current fusion operator to be processed and a corresponding calculation logic; the method comprises the steps that a master core carries out fusion association on operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, to-be-processed data in a memory and calculation requirements of the to-be-processed data, and a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core;
Extracting current to-be-processed data corresponding to the current to-be-processed fusion operator from the memory;
and calculating the current data to be processed by utilizing a corresponding operator according to the calculation logic to obtain a calculation result, and writing the calculation result into the memory.
The combination of the fusion processing operators can reduce the starting overhead of the slave core, avoid frequent memory reading and writing, and directly store the intermediate result on the slave core for the next calculation. In particular, the data obtained by simple calculation of the non-matrix multiplication type can be also written back into the memory without writing back the data, so that the memory access data quantity is reduced as much as possible, the memory bandwidth competition is avoided, and the influence of the memory bottleneck is greatly reduced.
In an optional implementation manner, when the fusion operator to be processed is a first fusion operator, the current data to be processed is data to be processed corresponding to the first fusion operator in a memory, the current data to be processed includes first data to be processed and first data to be transposed, the computing logic is used for computing the current data to be processed by using the corresponding operator to obtain a computing result, and the computing result is written into the memory, where the computing method includes:
performing matrix multiplication operation on the first data to be processed to obtain first intermediate data;
Obtaining first transfer data by carrying out transfer processing on the first data to be transferred;
performing matrix multiplication operation on the first transfer data to obtain second intermediate data;
and performing matrix multiplication operation based on the first intermediate data and the second intermediate data to obtain a first calculation result, and storing the first calculation result into a memory.
By dividing the fusion operator, the first intermediate data, the first transfer data and the second intermediate data can be completely transferred through the buffer memory of the slave core without writing back into the memory, repeated fetching of the data is reduced as much as possible, the data access quantity is reduced, and the bandwidth pressure is reduced.
In an optional implementation manner, when the fusion operator to be processed is a second fusion operator, the current data to be processed is a second data to be processed corresponding to the second fusion operator in a memory and a first calculation result of the first fusion operator, and the calculating the current data to be processed by using the corresponding operator according to the calculation logic to obtain a calculation result and writing the calculation result into the memory includes:
performing matrix multiplication operation on the second data to be processed to obtain third intermediate data;
Performing non-matrix multiplication operation on the first calculation result to obtain fourth intermediate data;
and performing matrix multiplication operation based on the third intermediate data and the fourth intermediate data to obtain a second calculation result, and storing the second calculation result into a memory.
By dividing the fusion operator, the third intermediate data and the fourth intermediate data can be transmitted through the buffer memory of the slave core without writing back to the memory, repeated fetching of the data is reduced as much as possible, the data access quantity is reduced, and the bandwidth pressure is reduced.
In an optional implementation manner, when the fusion operator to be processed is a third fusion operator, the current data to be processed is third data to be processed corresponding to the third fusion operator in a memory and a second calculation result of the second fusion operator, and the calculating the current data to be processed by using the corresponding operator according to the calculation logic to obtain a calculation result and writing the calculation result into the memory includes:
and performing matrix multiplication operation on the third data to be processed and the second calculation result to obtain a third calculation result, and storing the third calculation result into a memory.
In an alternative embodiment, the slave core further comprises a first buffer space and a second buffer space;
The first buffer space and the second buffer space are used for alternately performing a process of acquiring data from the memory and a data processing process.
When the first buffer space is used for calculation, the second buffer space acquires data from the memory, and after the calculation of the first buffer space is completed, the first buffer space and the second buffer space are exchanged. Through the mode, the memory access process and the calculation process are mutually hidden, different calculation modes are performed simultaneously, and the overall performance is improved.
In a third aspect, an embodiment of the present invention provides a multi-head attention mechanism fusion computing allocation device based on an acceleration processor, which is applied to a main core of the acceleration processor, where the acceleration processor includes a main core, a memory, and a slave core array formed by a plurality of slave cores, where each slave core processes operators of a plurality of operation types, and the method includes:
the first acquisition module is used for acquiring the slave core information, the data to be processed in the memory and the calculation requirement of the data to be processed;
the association module is used for carrying out fusion association on operators of the slave cores based on the calculation requirement and the slave core information to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator;
And the calling module is used for sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator in sequence, and a calculation result is obtained.
In a fourth aspect, an embodiment of the present invention provides a multi-head attention mechanism fusion computing allocation device based on an acceleration processor, which is applied to a slave core of the acceleration processor, where the acceleration processor includes a master core, a memory, and a slave core array formed by a plurality of slave cores, and each slave core processes operators of a plurality of operation types, and includes:
the second acquisition module is used for acquiring an interface of a fusion operator corresponding to the starting slave core and determining a current fusion operator to be processed and a corresponding calculation logic thereof; the method comprises the steps that a master core carries out fusion association on operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, to-be-processed data in a memory and calculation requirements of the to-be-processed data, and a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core;
The extraction module is used for extracting current to-be-processed data corresponding to the current to-be-processed fusion operator from the memory;
and the calculation module is used for calculating the current data to be processed by utilizing a corresponding operator according to the calculation logic to obtain a calculation result, and writing the calculation result into the memory.
In a fifth aspect, embodiments of the present invention provide a multi-headed attention mechanism fusion computing dispensing system based on an acceleration processor, comprising: a master core, a memory and a slave core array formed by a plurality of slave cores, wherein each slave core processes operators of a plurality of operation types;
the master core acquires slave core information, data to be processed in a memory and calculation requirements of the data to be processed;
based on the calculation requirement and the slave core information, carrying out fusion association on operators of the slave core to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator;
sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start a slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to calculation logic of each fusion operator sequentially to obtain a calculation result;
The slave core acquires an interface for starting a fusion operator corresponding to the slave core, and determines a fusion operator to be processed currently and calculation logic corresponding to the fusion operator;
extracting current to-be-processed data corresponding to the current to-be-processed fusion operator from the memory;
and calculating the current data to be processed by utilizing a corresponding operator according to the calculation logic to obtain a calculation result, and writing the calculation result into the memory.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram illustration of a multi-headed processor-based attention mechanism fusion calculation distribution method in accordance with an embodiment of the present invention;
FIG. 2 is a flow diagram illustration of another acceleration processor based multi-headed attention mechanism fusion calculation allocation method in accordance with an embodiment of the present invention;
FIG. 3 is a flow diagram illustration of yet another acceleration processor based multi-headed attention mechanism fusion calculation distribution method in accordance with an embodiment of the present invention;
FIG. 4 is a flow diagram illustration of yet another acceleration processor based multi-headed attention mechanism fusion calculation distribution method in accordance with an embodiment of the present invention;
FIG. 5 is a block diagram of a multi-headed processor-based attention mechanism fusion computing apparatus in accordance with an embodiment of the present invention;
FIG. 6 is a block diagram of another acceleration processor based multi-headed attention mechanism fusion computing apparatus in accordance with an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
With the increase of computing power, deep learning has become an important breakthrough in the field of artificial intelligence in the last decade. While more and more complex applications require more and more massive computing data, more efficient and faster support frames are being driven. In order to improve the calculation efficiency of the framework, specific hardware structures are combined, the algorithm structures of various operators are considered, the algorithm structures are matched with each other, and hardware resources are fully utilized so as to achieve the optimal calculation effect.
The multi-headed attentiveness mechanism is an artificial intelligence technique that allows a neural network to concentrate on portions of critical information while ignoring non-important portions when processing sequence data. Attention mechanisms have been widely used in the fields of natural language processing, computer vision, speech recognition, and the like.
The calculation formula of the multi-head attention mechanism can be expressed as follows:
q (query) is the query, K (key) and V (value) are the key and corresponding value, respectively, and O (output) is the query's corresponding output.
Wherein Q and O are in correspondence, i.e. the sequence lengths of Q and O are the same. K and V are corresponding relations, i.e. the sequence lengths of K and V are identical. In the calculation process, each query Q is calculated with all K, then multiplied by the corresponding V, and finally the output results are integrated.
In the above formula, the result of multiplying each K and Q is a value, then, the time series result of K is calculated by Softmax, and then multiplied by the corresponding V, and finally, all the results are accumulated to obtain the corresponding output.
The current AI acceleration processor mostly adopts heterogeneous many-core design, and can obtain high-efficiency computing performance and reduce power consumption to a certain extent. The heterogeneous many-core architecture comprises a master core and a slave core array, which share memory space. Each slave core has its own independent cache space, and can process the calculation data in parallel. Each slave core is provided with an accelerating component which is specially used for processing matrix multiplication or other operations, so that the computing speed is improved to the greatest extent.
When the processor is used for processing deep learning training related applications at present, a main core usually calls different operator interfaces in sequence according to the network requirements, data is read from a main memory to a cache space of the main core, corresponding calculation is carried out on the data, and after the processing is completed, a calculation result is written back to a memory.
Taking the case of a multi-head attention mechanism as an example, the dimension information of each parameter needs to be defined first. Wherein the dimension of the input attribute_in (q, k, v) is [ seq_len, hidden_size ], the dimension of the input weight kernel (Wq, wk, wv) is [ hidden_size, nheads x proj_size ],
the dimension of the output weight (Wo) is [ nfeads_size, hidden_size ]. If the multi-head attention mechanism is realized according to the traditional method and the formula definition, the multi-head attention mechanism can be basically split into three operators, namely a gemm operator, a batch_gemm operator and a softmax operator, and the calculation steps are as follows:
1) First, each attribute_in, kernel is read into the cache space of the slave core from the main memory, matrix calculation is performed in the slave core to obtain Q, K, V, the dimension is [ seq_len, n heads ] proj_size ], and Q, K, V is written back to the main memory.
2) The dimensions of Q, K are adjusted by reading Q, K from main memory to the cache space of the slave core again.
The dimensions of Q are adjusted to [ n heads, seq_len, proj_size ], the dimensions of K are adjusted to [ n heads, proj_size, seq_len ], and Q, K are written back to main memory again.
3) The cache space from Q, K to the slave core is read in from the main memory, the latch matrix calculation is performed on Q, K to obtain QK, the dimensions are [ nfeads, seq_len ], and the QK is written back to the main memory.
4) And reading the QK from the main memory to the cache space of the slave core again, performing softmax calculation on the QK to obtain S_QK, and writing the S_QK back to the main memory, wherein the dimensions are [ nfeads, seq_len ].
5) S_QK is read in from the main memory, V is read into the cache space of the slave core, batch matrix calculation is carried out on the V, S_QK, V_S_QK is obtained, the dimension is [ n heads, seq_len, proj_size ], and the V_S_QK is written back into the main memory.
6) V_S_QK, wo is read from the main memory to the cache space of the slave core, matrix calculation is carried out on V_S_QK, wo, so that the attribute_out is obtained, the dimension is [ seq_len, hidden_size ], and the attribute_out is written back to the main memory.
The above calculation process has obvious disadvantages:
1) The output after each calculation can be used as input for the next calculation, but since the traditional calculation mode splits the whole calculation flow into three operators, gemm, batch_gemm, softmax, the output of the operator has to be written back to the main memory, and the data has to be read back from the main memory to the cache space of the slave core again for calculation until the next calculation. Repeated data movement causes the access data to become large, occupies bandwidth and affects performance. The most obvious duplicate data loading is that in computing the matrix multiplication of Q, K, Q, K needs to be read in twice, the first read in only performs a dimension adjustment on Q, K in the slave core computation, without any additional computation.
2) From the core computation, only one of the matrix class computation (gemm, batch_gemm) and the non-matrix class computation (softmax) is performed at the same time, but for the acceleration processor the matrix class computation and the non-matrix class computation may be computed in parallel by different operators within the core.
The embodiment of the invention provides a multi-head attention mechanism fusion calculation distribution method based on an acceleration processor, which is characterized by acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed. And carrying out fusion association on operators of the slave cores based on the calculation requirements and the slave core information to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator. And sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator sequentially, and a calculation result is obtained. Compared with the situation that only a single operator is processed at the same time, the combination of the fusion processing operators can reduce the starting cost of the slave core, avoid frequent memory read-write, directly store the intermediate result in the slave core for the next calculation, reduce the memory access data amount, avoid the memory bandwidth competition and greatly reduce the influence of memory bottleneck.
In accordance with an embodiment of the present invention, there is provided an embodiment of a multi-headed attention mechanism fusion computing method based on an acceleration processor, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.
In this embodiment, a multi-head attention mechanism fusion calculation allocation method based on an acceleration processor is provided, which can be used for a main core of the acceleration processor, such as a SWAI chip, and the acceleration processor includes a main core, a memory, and a slave core array formed by a plurality of slave cores, where each slave core processes operators of a plurality of operation types, and fig. 1 is a flowchart of the multi-head attention mechanism fusion calculation allocation method based on the acceleration processor according to an embodiment of the present invention, and as shown in fig. 1, the flowchart includes the following steps:
step S101: the slave core information, the data to be processed in the memory and the calculation requirement of the data to be processed are acquired. Specifically, the slave core information comprises the type of slave core supporting calculation, and the data to be processed in the memory can be divided in row-column dimension to optimize the data access mode.
Step S102: and carrying out fusion association on operators of the slave cores based on the calculation requirements and the slave core information to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator. Specifically, by dividing and fusing operators, the data parallel processing capability of the slave cores of the acceleration processor is fully utilized, and a reasonable allocation strategy is provided for the calculation tasks of the operators by combining the hardware structure of the acceleration processor, so that the overall operator performance is improved.
Step S103: and sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator sequentially, and a calculation result is obtained. Specifically, the method of combining the fusion operators is favorable for fully using hardware resources, and compared with the condition that only a single operator is processed at the same time, the method of combining the fusion operators can reduce the starting overhead of the slave core and avoid frequent memory reading and writing.
Through the steps S101 to S104, the multi-head attention mechanism fusion calculation distribution method based on the acceleration processor provided by the embodiment of the invention is beneficial to fully using hardware resources in a fusion operator combination mode, and compared with the case of only processing a single operator at the same time, the combination of the fusion processing operators can reduce the starting overhead of the slave core, avoid frequent memory reading and writing, directly store the intermediate result in the slave core for the next calculation, reduce the memory access data volume, avoid the memory bandwidth competition, and greatly reduce the influence of the memory bottleneck.
In this embodiment, a multi-head attention mechanism fusion calculation allocation method based on an acceleration processor is provided, which can be used for a main core of the acceleration processor, such as a SWAI chip, and the acceleration processor includes a main core, a memory, and a slave core array formed by a plurality of slave cores, where each slave core processes operators of a plurality of operation types, and fig. 2 is a flowchart of the multi-head attention mechanism fusion calculation allocation method based on the acceleration processor according to an embodiment of the present invention, and as shown in fig. 2, the flowchart includes the following steps:
step S201: the slave core information, the data to be processed in the memory and the calculation requirement of the data to be processed are acquired. Specifically, the slave core information comprises the type of slave core supporting calculation, and the data to be processed in the memory can be divided in row-column dimension to optimize the data access mode.
Step S202: and carrying out fusion association on operators of the slave cores based on the calculation requirements and the slave core information to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator. Specifically, by dividing and fusing operators, the data parallel processing capability of the slave cores of the acceleration processor is fully utilized, and a reasonable allocation strategy is provided for the calculation tasks of the operators by combining the hardware structure of the acceleration processor, so that the overall operator performance is improved.
Step S203: and sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator sequentially, and a calculation result is obtained. Specifically, the method of combining the fusion operators is favorable for fully using hardware resources, and compared with the condition that only a single operator is processed at the same time, the method of combining the fusion operators can reduce the starting overhead of the slave core and avoid frequent memory reading and writing.
Specifically, step S202 described above includes:
step S2021: the operation type of each operator in each slave core is extracted from the slave core information.
Step S2022: and carrying out fusion association on operators according to the calculation requirements and the operation types to obtain a first fusion operator, a second fusion operator and a third fusion operator.
Specifically, by dividing and fusing operators, the data parallel processing capability of the slave cores of the acceleration processor is fully utilized, and a reasonable allocation strategy is provided for the calculation tasks of the operators by combining the hardware structure of the acceleration processor, so that the overall operator performance is improved.
Specifically, according to the calculation logic and the correlation, the whole calculation process can be decomposed into three fusion big operators, the output result of the last calculation of each fusion operator must be written back into the memory, the calculation results of other processes can be used as the intermediate result of the next calculation, and the calculation results can be selected not to be written back into the memory.
In this embodiment, a multi-head attention mechanism fusion calculation allocation method based on an acceleration processor is provided, which may be used for a slave core of the acceleration processor, such as a SWAI chip, where the acceleration processor includes a master core, a memory, and a slave core array formed by a plurality of slave cores, each slave core processes operators of a plurality of operation types, and fig. 3 is a flowchart of the multi-head attention mechanism fusion calculation allocation method based on the acceleration processor according to an embodiment of the present invention, and as shown in fig. 3, the flowchart includes the following steps:
step S301: and acquiring an interface for starting a fusion operator corresponding to the slave core, and determining a fusion operator to be processed currently and a corresponding calculation logic thereof. The method comprises the steps that a master core fuses and associates operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed, and the first fusion operator, the second fusion operator, the third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core. Specifically, by reasonably fusing part of calculation processes, the full use of slave operators is facilitated through a fusion operator combination mode, compared with the situation that only a single operator is processed at the same time, the combination of the fusion processing operators can reduce the starting overhead of the slave core, avoid frequent memory reading and writing, directly store intermediate results in the slave core for the next calculation, reduce the memory access data volume, avoid memory bandwidth competition and greatly reduce the influence of memory bottleneck.
Step S302: and extracting the current data to be processed corresponding to the current fusion operator to be processed from the memory. Specifically, the data in the memory can be divided in row and column dimensions, the data access mode is optimized, the data access quantity is reduced, the access efficiency is improved, and the influence of access bottleneck on the overall performance of operators is reduced.
Step S303: and calculating the current data to be processed by utilizing a corresponding operator according to calculation logic to obtain a calculation result, and writing the calculation result into a memory. Specifically, the whole calculation process is decomposed into three fusion big operators, the output result of each fusion operator except the last calculation is necessarily written back to the memory, the calculation results of other processes are taken as the intermediate result of the next calculation, the memory can be selected not to be written back, the data access mode is optimized, the data access quantity is reduced, the access efficiency is improved, and the influence of access bottleneck on the overall performance of the operators is reduced.
In this embodiment, a multi-head attention mechanism fusion calculation allocation method based on an acceleration processor is provided, which can be used for a slave core of a master core of an acceleration processor, such as a SWAI chip, and the acceleration processor includes a master core, a memory, and a slave core array formed by a plurality of slave cores, where each slave core processes operators of a plurality of operation types, and fig. 4 is a flowchart of the multi-head attention mechanism fusion calculation allocation method based on the acceleration processor according to an embodiment of the present invention, and as shown in fig. 4, the flowchart includes the following steps:
Step S401: and acquiring an interface for starting a fusion operator corresponding to the slave core, and determining a fusion operator to be processed currently and a corresponding calculation logic thereof. The method comprises the steps that a master core fuses and associates operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed, and the first fusion operator, the second fusion operator, the third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core. Specifically, by reasonably fusing part of calculation processes, the full use of slave operators is facilitated through a fusion operator combination mode, compared with the situation that only a single operator is processed at the same time, the combination of the fusion processing operators can reduce the starting overhead of the slave core, avoid frequent memory reading and writing, directly store intermediate results in the slave core for the next calculation, reduce the memory access data volume, avoid memory bandwidth competition and greatly reduce the influence of memory bottleneck.
Step S402: and extracting the current data to be processed corresponding to the current fusion operator to be processed from the memory. Specifically, the data in the memory can be divided in row and column dimensions, the data access mode is optimized, the data access quantity is reduced, the access efficiency is improved, and the influence of access bottleneck on the overall performance of operators is reduced.
Step S403: and calculating the current data to be processed by utilizing a corresponding operator according to calculation logic to obtain a calculation result, and writing the calculation result into a memory. Specifically, the whole calculation process is decomposed into three fusion big operators, the output result of each fusion operator except the last calculation is necessarily written back to the memory, the calculation results of other processes are taken as the intermediate result of the next calculation, the memory can be selected not to be written back, the data access mode is optimized, the data access quantity is reduced, the access efficiency is improved, and the influence of access bottleneck on the overall performance of the operators is reduced.
Specifically, when the fusion operator to be processed in the step S401 is the first fusion operator, the current data to be processed is the data to be processed corresponding to the first fusion operator in the memory, the current data to be processed includes the first data to be processed and the first data to be transposed, and the step S403 includes:
and a step a1 of performing matrix multiplication operation on the first data to be processed to obtain first intermediate data.
And a step a2, obtaining first transposed data by carrying out transposition processing on the first data to be transposed.
And a step a3, performing matrix multiplication operation on the first transfer data to obtain second intermediate data.
And a step a4, performing matrix multiplication operation based on the first intermediate data and the second intermediate data to obtain a first calculation result, and storing the first calculation result into a memory.
Specifically, since numerous data transposition operations are involved in the calculation process, the acceleration processor supports data transposition reading, and the data transposition reading instruction of the acceleration processor can be used for replacing original data sequence reading and performing combination operation of manual data re-shooting on the slave core, so that the overall calculation efficiency is improved.
Specifically, when the fusion operator to be processed in the step S401 is the second fusion operator, the current data to be processed is the second data to be processed corresponding to the second fusion operator in the memory and the first calculation result of the first fusion operator, and the step S403 includes:
and b1, performing matrix multiplication operation on the second data to be processed to obtain third intermediate data.
And b2, performing non-matrix multiplication operation on the first calculation result to obtain fourth intermediate data.
And b3, performing matrix multiplication operation based on the third intermediate data and the fourth intermediate data to obtain a second calculation result, and storing the second calculation result into a memory.
Specifically, when the fusion operator to be processed in the step S401 is the third fusion operator, the current data to be processed is the third data to be processed corresponding to the third fusion operator in the memory and the second calculation result of the second fusion operator, and the step S403 includes:
And c1, performing matrix multiplication operation on the third data to be processed and the second calculation result to obtain a third calculation result, and storing the third calculation result into a memory.
In this embodiment, a multi-head attention mechanism fusion calculation allocation method based on an acceleration processor is provided, which can be used for a slave core of a master core of the acceleration processor, such as a SWAI chip, and the acceleration processor includes the master core, a memory, and a slave core array formed by a plurality of slave cores, where each slave core processes operators of a plurality of operation types, and the slave core further includes a first buffer space and a second buffer space. The first buffer space and the second buffer space are used for alternately performing a process of acquiring data from the memory and a data processing process.
Specifically, by setting a double-buffer mechanism, the memory access data and the data calculation are executed in parallel by using two cache spaces, and when the data calculation is performed, a matrix multiplication calculation instruction and a non-matrix multiplication calculation instruction can be submitted at the same time, so that two types of calculation components can be executed in parallel, the data calculation time is shortened, the data memory access time and the data calculation time are hidden from each other, and the overall calculation time is accelerated.
In one embodiment, a SWAI chip is described as an example, where the SWAI chip includes a master core and a slave core array, and the master core and the slave core share a memory space, and each slave core has its own cache space and an acceleration component that specifically processes matrix multiplication and other operations. Based on the hardware structure of the SWAI chip, the computation process of the multi-head attention mechanism can be decomposed into three large fusion operators:
1) Fusing the calculation matrices q=wq, iq, k=wk, ik, and qk= (kTWk, iT) (Wq, iq);
2) Fusing the calculation matrices v=wv, iv, calculate softmax (QK), and calculate hi=batch_gemm (V, softmax (QK));
3) Finally, combining all the obtained hi to complete the calculation of the intent_out=gemm (hi, wo);
the three fusion operators are named as A, B and C, and the complete calculation process of the operator A can be described as follows:
1. the main core calls an A interface and starts the slave core;
2. wk, i and k are directly read into a slave core cache from a main memory through hardware transposition, and are stored as kT, wk and iT in the slave core cache;
3. on the slave core, the computation of k=gemm (kT, wk, iT) is done using a matrix multiplication acceleration component, and Wq, i and q are read in from the master to the slave core cache at the same time;
4. on the slave cores, the matrix multiplication acceleration component is used for completing the computation of q=gemm (Wq, i, Q), since the computation of K is completed and is not read back to the master memory, the matrix multiplication acceleration component can be used again after the computation of Q is completed, the computation of qk=gemm (kTWk, iT, wq, iq) is completed, and during the computation, the computation-completed QK is written back to the master memory from the core cache by the partition;
5. the master core confirms that all slave cores complete calculation and perform synchronization, and complete calculation of the fusion operator A is completed;
The complete calculation process for the B operator can be described as:
1. the master core calls the B interface and starts the slave core;
2. reading Wv, i and v and QK from the main memory into the slave core cache;
3. on the slave cores, the computation of v=gemm (Wv, i, V) is done using a matrix-multiplication acceleration component, while the computation of s_qk=softmax (QK) is done simultaneously using a non-matrix-multiplication acceleration component;
4. on the slave core, performing hi=batch_gemm (V, softmax (QK)) computation using a matrix multiplication acceleration component, while the chunks write the computed hi back to main memory from the core cache;
5. the master core confirms that all slave cores complete calculation and perform synchronization, and complete calculation of the fusion operator B is completed;
the complete calculation process for the C operator can be described as:
1. the master core calls a C interface and starts the slave core;
2. reading Wo, i and hi from the main memory into the slave core cache;
3. on the slave core, the computation of the attribute_out=gemm (Wo, i, hi) is completed using a matrix multiplication acceleration component, while the partition writes the computed attribute_out back to the master from the core cache;
4. the master core confirms that all slave cores complete calculation and perform synchronization, and complete calculation of the fusion operator C is completed;
in the calculation process, the main storage data transposition can be taken to the cache section of the slave core, because for the calculation of K=gemm (kT, wk, iT) in the fusion operator A, the transposition read instruction is directly only required to be called to take the Wk, i, K transposition to the cache section of the slave core, the next calculation can be carried out, the two-pass data are not required to be accessed like the common method, the first-pass access data are subjected to data transposition, the second-pass access data are subjected to data calculation, the data access quantity is increased, and the extra bandwidth resources are occupied.
Preferably, the memory access units based on the SWAI chip have asynchronous characteristics, so that a double buffering technology can be applied in the calculation of each fusion operator for improving the data transmission efficiency and reducing the delay in the data transmission. When double buffering is used, there are two buffers, one for receiving data and the other for processing data. When one buffer is receiving data, the other buffer can process data simultaneously, thereby realizing parallelization processing. For example, the A operator calculation process proceeds by allocating two blocks of buffers on the cache region of the slave core, the first block of buffer being the calculation buffer and the second block of buffer being the memory access buffer. First undergo a cold loading process: a memory access class instruction is initiated, and the first blocks Wk, i and k are taken to a calculation buffer area. Then enter the double buffering process: and initiating a next access class instruction, taking the next blocks Wk, i and k onto the access buffer area, judging whether the data on the calculation buffer area is completely taken, if so, performing matrix multiplication calculation on the data on the buffer area, wherein the calculation process is parallel to the process of taking the data on the next block buffer area, and exchanging the access buffer area and the calculation buffer area after the calculation is completed. In such a way, the memory access process and the calculation process are mutually hidden, and the overall performance is improved.
Each type of computation in the slave core corresponds to an efficient computation element and has asynchronous properties, which can be further optimized for the double buffer mode of the fusion operator B. There are three types of data operations in the fusion operator B: the three types of calculation can be respectively completed by three different efficient calculation components in the core by the SWAI chip, and the parallel calculation is realized by a fusion operator. The specific method comprises the following steps: two blocks of buffers are allocated on the cache region of the slave core, wherein the first block of buffer is a calculation buffer and the second block of buffer is a memory access buffer. First of all, a cold loading process is also carried out: the first blocks Wv, i and v and the first block QK are read into the calculation buffer from the main memory. Then enter the double buffering process: and reading the next Wv, i and v and the next QK from the main memory into the memory buffer, judging whether the data on the calculation buffer is already fetched, if so, performing matrix multiplication calculation on the data on the buffer, wherein the calculation process is parallel to the process of fetching the data on the next buffer, and exchanging the memory buffer and the calculation buffer after the calculation is completed. In such a way, the memory access process and the calculation process are mutually hidden, and the overall performance is improved.
Firstly, submitting a matrix multiplication related instruction to calculate V=gemm (Wv, i, V), and submitting a non-matrix multiplication related instruction to calculate S_QK=softmax (QK) after the instruction is finished, wherein matrix multiplication and non-matrix multiplication calculation are respectively carried out by different hardware components in the whole calculation process, and are calculated in parallel. The whole large calculation process is parallel to the process of taking the data in the next buffer area, and after the calculation is completed, the memory access buffer area and the calculation buffer area are exchanged. Through the mode, the memory access process and the calculation process are mutually hidden, different calculation modes are performed simultaneously, and the overall performance is improved.
The SWAI chip has 32 slave core computing units per core group, arranged in 4 rows and 8 columns. The operator computation tasks need to be distributed to the 32 slave computing units at the time of computation. Taking the fusion operator C as an example, the core calculation of the fusion operator C is that the attribute_out=gemm (Wo, i, hi), the data dimension of hi is [ n heads, seq_len, proj_size ], the data dimension of Wo, i is [ n heads, proj_size, hidden_size ], and the gemm calculation can be converted into one batch_gemm calculation, namely n heads gemm calculation. Each child gemm calculation completes a matrix multiplication of [ seq_len, proj_size ] [ proj_size, hidden_size ]. In a typical model network, nfeads are typically multiples of 4, which can be converted to multiples of 4 by zero padding, if not. The nfeads may be de-sliced onto 4 lines of each core group of the SWAI chip. The advantage of this segmentation is that each line of each core group of the SWAI chip processes the respective nfets, without disturbing each other, in full parallelism. Meanwhile, the seq_len can be further segmented to 8 columns of each core group of the SWAI chip, and for Wo, i data can be taken from a main memory to a slave core high-speed space in a line broadcast memory access mode, so that memory access efficiency is improved. After each row completes the gemm calculation of the respectively allocated nfeads, the calculation results of 4 rows need to be merged once, and the data protocol can be efficiently realized through inter-column slave core communication.
The SWAI chip comprises a master core and a slave core array, which share memory space, and each slave core has own cache space and can process computing tasks in parallel. Each slave core also has acceleration components that specifically handle matrix multiplication or other operations. Different operator combinations can be called for many times in the calculation process of the multi-head attention mechanism, and the calculation process can be divided into two major categories of matrix multiplication operation and non-matrix multiplication operation.
The embodiment also provides a multi-head attention mechanism fusion calculation distribution device based on an acceleration processor, which is used for realizing the embodiment and the preferred implementation mode, and the description is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a multi-head attention mechanism fusion computing and distributing device based on an acceleration processor, which is applied to a main core of the acceleration processor, wherein the acceleration processor includes a main core, a memory and a slave core array formed by a plurality of slave cores, and each slave core processes operators of a plurality of operation types, as shown in fig. 5, and includes:
The first obtaining module 501 is configured to obtain the slave core information, the data to be processed in the memory, and the computing requirement of the data to be processed.
The association module 502 is configured to perform fusion association on the operators of the slave core based on the calculation requirement and the slave core information, to obtain a first fusion operator, a second fusion operator, a third fusion operator, and calculation logic corresponding to each fusion operator.
And the calling module 503 is configured to call interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator in sequence to start the slave core, so that the slave core calculates the data to be processed by using the corresponding operators according to the calculation logic of each fusion operator in sequence, and a calculation result is obtained.
In some alternative embodiments, the association module 502 includes:
and the extraction unit is used for extracting the operation type of each operator in each slave core from the slave core information.
And the fusion unit is used for carrying out fusion association on the operators according to the calculation requirements and the operation types to obtain a first fusion operator, a second fusion operator and a third fusion operator.
The embodiment also provides a multi-head attention mechanism fusion calculation distribution device based on an acceleration processor, which is used for realizing the embodiment and the preferred implementation mode, and the description is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a multi-head attention mechanism fusion calculation allocation device based on an acceleration processor, which is applied to a slave core of the acceleration processor, wherein the acceleration processor includes a master core, a memory and a slave core array formed by a plurality of slave cores, and each slave core processes operators of a plurality of operation types, as shown in fig. 6, and includes:
the second obtaining module 601 is configured to obtain an interface for starting a fusion operator corresponding to the slave core, and determine a fusion operator to be processed currently and a computation logic corresponding to the fusion operator. The method comprises the steps that a master core fuses and associates operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed, and the first fusion operator, the second fusion operator, the third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core.
And the extracting module 602 is configured to extract, from the memory, current data to be processed corresponding to the current fusion operator to be processed.
The calculating module 603 is configured to calculate, according to the calculating logic, the current data to be processed by using the corresponding operator to obtain a calculation result, and write the calculation result into the memory.
Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.
The multi-headed processor-based attention mechanism fusion computing apparatus of this embodiment is presented in the form of functional units, referred to as ASIC (Application Specific Integrated Circuit ) circuits, processors and memory executing one or more software or firmware programs, and/or other devices capable of providing the functionality described above.
Further functional descriptions of the above respective modules are the same as those of the above corresponding embodiments, and are not repeated here.
The embodiment of the invention also provides a multi-head attention mechanism fusion calculation distribution system based on the acceleration processor, which comprises the following steps: a master core, a memory and a slave core array formed by a plurality of slave cores, wherein each slave core processes operators of a plurality of operation types;
the master core acquires slave core information, data to be processed in a memory and calculation requirements of the data to be processed;
based on the calculation requirement and the slave core information, carrying out fusion association on operators of the slave core to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator;
Sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start a slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to calculation logic of each fusion operator sequentially to obtain a calculation result;
the slave core acquires an interface for starting a fusion operator corresponding to the slave core, and determines a fusion operator to be processed currently and calculation logic corresponding to the fusion operator;
extracting current to-be-processed data corresponding to the current to-be-processed fusion operator from the memory;
and calculating the current data to be processed by utilizing a corresponding operator according to the calculation logic to obtain a calculation result, and writing the calculation result into the memory.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (8)

1. A multi-head attention mechanism fusion calculation distribution method based on an acceleration processor, which is applied to a main core of the acceleration processor, wherein the acceleration processor comprises the main core, a memory and a slave core array formed by a plurality of slave cores, and each slave core processes operators of a plurality of operation types, and the method is characterized by comprising the following steps:
Acquiring slave core information, data to be processed in a memory and calculation requirements of the data to be processed;
based on the calculation requirement and the slave core information, carrying out fusion association on operators of the slave core to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator;
sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start a slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to calculation logic of each fusion operator sequentially to obtain a calculation result;
the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, wherein the method comprises the following steps:
extracting the operation type of each operator in each slave core from the slave core information;
carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator;
the slave core further comprises a first buffer space and a second buffer space;
the first buffer space and the second buffer space are used for alternately performing a process of acquiring data from the memory and a data processing process.
2. A multi-headed attention mechanism fusion calculation allocation method based on an acceleration processor, applied to a slave core of the acceleration processor, the acceleration processor comprising a master core, a memory and a slave core array formed by a plurality of slave cores, each slave core processing operators of a plurality of operation types, the method comprising:
acquiring an interface for starting a fusion operator corresponding to a slave core, and determining a current fusion operator to be processed and a corresponding calculation logic; the method comprises the steps that a master core carries out fusion association on operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, to-be-processed data in a memory and calculation requirements of the to-be-processed data, and a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core; the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, wherein the method comprises the following steps: extracting the operation type of each operator in each slave core from the slave core information; carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator;
Extracting current to-be-processed data corresponding to the current to-be-processed fusion operator from the memory;
calculating the current data to be processed by utilizing a corresponding operator according to the calculation logic to obtain a calculation result, and writing the calculation result into the memory;
the slave core further comprises a first buffer space and a second buffer space;
the first buffer space and the second buffer space are used for alternately performing a process of acquiring data from the memory and a data processing process.
3. The method for computing and distributing the multi-head attention mechanism fusion based on the acceleration processor according to claim 2, wherein when the fusion operator to be processed is a first fusion operator, the current data to be processed is data to be processed corresponding to the first fusion operator in a memory, the current data to be processed includes the first data to be processed and the first data to be transposed, the computing the current data to be processed by using the corresponding operator according to the computing logic to obtain a computing result, and writing the computing result into the memory, the method comprises:
performing matrix multiplication operation on the first data to be processed to obtain first intermediate data;
obtaining first transfer data by carrying out transfer processing on the first data to be transferred;
Performing matrix multiplication operation on the first transfer data to obtain second intermediate data;
and performing matrix multiplication operation based on the first intermediate data and the second intermediate data to obtain a first calculation result, and storing the first calculation result into a memory.
4. The method for computing and distributing a multi-head attention mechanism fusion based on an acceleration processor according to claim 3, wherein when the fusion operator to be processed is a second fusion operator, the current data to be processed is a second data to be processed corresponding to the second fusion operator in a memory and a first computation result of the first fusion operator, the computing the current data to be processed by the corresponding operator according to the computation logic to obtain a computation result, and writing the computation result into the memory, the method comprises:
performing matrix multiplication operation on the second data to be processed to obtain third intermediate data;
performing non-matrix multiplication operation on the first calculation result to obtain fourth intermediate data;
and performing matrix multiplication operation based on the third intermediate data and the fourth intermediate data to obtain a second calculation result, and storing the second calculation result into a memory.
5. The method for computing and distributing the multi-head attention mechanism fusion based on the acceleration processor according to claim 4, wherein when the fusion operator to be processed is a third fusion operator, the current data to be processed is a third data to be processed corresponding to the third fusion operator in a memory and a second computation result of the second fusion operator, the computing the current data to be processed by the corresponding operator according to the computation logic to obtain a computation result, and writing the computation result into the memory, wherein the method comprises:
And performing matrix multiplication operation on the third data to be processed and the second calculation result to obtain a third calculation result, and storing the third calculation result into a memory.
6. A multi-headed attention mechanism fusion computing and allocation device based on an acceleration processor, applied to a main core of the acceleration processor, the acceleration processor comprising a main core, a memory and a slave core array formed by a plurality of slave cores, each slave core processing operators of a plurality of operation types, the device comprising:
the first acquisition module is used for acquiring the slave core information, the data to be processed in the memory and the calculation requirement of the data to be processed;
the association module is used for carrying out fusion association on operators of the slave cores based on the calculation requirement and the slave core information to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator; the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, wherein the method comprises the following steps: extracting the operation type of each operator in each slave core from the slave core information; carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator;
The calling module is used for sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start the slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to the calculation logic of each fusion operator in sequence to obtain a calculation result; the slave core further comprises a first buffer space and a second buffer space; the first buffer space and the second buffer space are used for alternately performing a process of acquiring data from the memory and a data processing process.
7. A multi-headed attention mechanism fusion computing apparatus based on an acceleration processor, for use with a slave core of the acceleration processor, the acceleration processor comprising a master core, a memory and a slave core array of a plurality of slave cores, each slave core processing operators of a plurality of operation types, comprising:
the second acquisition module is used for acquiring an interface of a fusion operator corresponding to the starting slave core and determining a current fusion operator to be processed and a corresponding calculation logic thereof; the method comprises the steps that a master core carries out fusion association on operators of a slave core based on calculation requirements and slave core information by acquiring slave core information, to-be-processed data in a memory and calculation requirements of the to-be-processed data, and a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator are obtained, and interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator are sequentially called to start the slave core; the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, wherein the method comprises the following steps: extracting the operation type of each operator in each slave core from the slave core information; carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator;
The extraction module is used for extracting current to-be-processed data corresponding to the current to-be-processed fusion operator from the memory;
the calculation module is used for calculating the current data to be processed by utilizing a corresponding operator according to the calculation logic to obtain a calculation result, and writing the calculation result into the memory; the slave core further comprises a first buffer space and a second buffer space; the first buffer space and the second buffer space are used for alternately performing a process of acquiring data from the memory and a data processing process.
8. A multi-headed attention mechanism fusion computing dispensing system based on an acceleration processor, comprising: a master core, a memory and a slave core array formed by a plurality of slave cores, wherein each slave core processes operators of a plurality of operation types;
the master core acquires slave core information, data to be processed in a memory and calculation requirements of the data to be processed;
based on the calculation requirement and the slave core information, carrying out fusion association on operators of the slave core to obtain a first fusion operator, a second fusion operator, a third fusion operator and calculation logic corresponding to each fusion operator; the performing fusion association on the operators of the slave cores based on the calculation logic and the slave core information to obtain a first fusion operator, a second fusion operator and a third fusion operator, wherein the method comprises the following steps: extracting the operation type of each operator in each slave core from the slave core information; carrying out fusion association on the operators according to the calculation requirement and the operation type to obtain a first fusion operator, a second fusion operator and a third fusion operator;
Sequentially calling interfaces corresponding to the first fusion operator, the second fusion operator and the third fusion operator to start a slave core, so that the slave core calculates the data to be processed by utilizing the corresponding operators according to calculation logic of each fusion operator sequentially to obtain a calculation result; the slave core further comprises a first buffer space and a second buffer space; the first buffer space and the second buffer space are used for alternately carrying out a process of acquiring data from the memory and a data processing process;
the slave core acquires an interface for starting a fusion operator corresponding to the slave core, and determines a fusion operator to be processed currently and calculation logic corresponding to the fusion operator;
extracting current to-be-processed data corresponding to the current to-be-processed fusion operator from the memory;
and calculating the current data to be processed by utilizing a corresponding operator according to the calculation logic to obtain a calculation result, and writing the calculation result into the memory.
CN202310687654.5A 2023-06-12 2023-06-12 Multi-head attention mechanism fusion calculation distribution method based on acceleration processor Active CN116431562B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310687654.5A CN116431562B (en) 2023-06-12 2023-06-12 Multi-head attention mechanism fusion calculation distribution method based on acceleration processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310687654.5A CN116431562B (en) 2023-06-12 2023-06-12 Multi-head attention mechanism fusion calculation distribution method based on acceleration processor

Publications (2)

Publication Number Publication Date
CN116431562A CN116431562A (en) 2023-07-14
CN116431562B true CN116431562B (en) 2023-11-28

Family

ID=87081807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310687654.5A Active CN116431562B (en) 2023-06-12 2023-06-12 Multi-head attention mechanism fusion calculation distribution method based on acceleration processor

Country Status (1)

Country Link
CN (1) CN116431562B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116805155B (en) * 2023-08-25 2024-01-19 太初(无锡)电子科技有限公司 LSTM network processing method, device, equipment and readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095583A (en) * 2016-06-20 2016-11-09 国家海洋局第海洋研究所 Principal and subordinate's nuclear coordination calculation and programming framework based on new martial prowess processor
CN112527393A (en) * 2019-09-18 2021-03-19 无锡江南计算技术研究所 Instruction scheduling optimization device and method for master-slave fusion architecture processor
CN112527262A (en) * 2019-09-19 2021-03-19 无锡江南计算技术研究所 Automatic vector optimization method for non-uniform width of deep learning framework compiler
CN113012023A (en) * 2021-02-22 2021-06-22 中国科学技术大学 Video analysis acceleration method and system based on many-core processor
CN114661460A (en) * 2022-02-15 2022-06-24 无锡江南计算技术研究所 AI framework two-stage parallel acceleration method for heterogeneous many-core processor
CN115203126A (en) * 2022-09-15 2022-10-18 太初(无锡)电子科技有限公司 Operator fusion processing method, device, equipment and storage medium
CN115952393A (en) * 2023-03-13 2023-04-11 山东大学 Forward computing method and system of multi-head attention mechanism based on super computer

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095583A (en) * 2016-06-20 2016-11-09 国家海洋局第海洋研究所 Principal and subordinate's nuclear coordination calculation and programming framework based on new martial prowess processor
CN112527393A (en) * 2019-09-18 2021-03-19 无锡江南计算技术研究所 Instruction scheduling optimization device and method for master-slave fusion architecture processor
CN112527262A (en) * 2019-09-19 2021-03-19 无锡江南计算技术研究所 Automatic vector optimization method for non-uniform width of deep learning framework compiler
CN113012023A (en) * 2021-02-22 2021-06-22 中国科学技术大学 Video analysis acceleration method and system based on many-core processor
CN114661460A (en) * 2022-02-15 2022-06-24 无锡江南计算技术研究所 AI framework two-stage parallel acceleration method for heterogeneous many-core processor
CN115203126A (en) * 2022-09-15 2022-10-18 太初(无锡)电子科技有限公司 Operator fusion processing method, device, equipment and storage medium
CN115952393A (en) * 2023-03-13 2023-04-11 山东大学 Forward computing method and system of multi-head attention mechanism based on super computer

Also Published As

Publication number Publication date
CN116431562A (en) 2023-07-14

Similar Documents

Publication Publication Date Title
CN104915322B (en) A kind of hardware-accelerated method of convolutional neural networks
CN107657581A (en) Convolutional neural network CNN hardware accelerator and acceleration method
CN111178518A (en) Software and hardware cooperative acceleration method based on FPGA
CN116431562B (en) Multi-head attention mechanism fusion calculation distribution method based on acceleration processor
US9378533B2 (en) Central processing unit, GPU simulation method thereof, and computing system including the same
US20210373799A1 (en) Method for storing data and method for reading data
US20230214338A1 (en) Data moving method, direct memory access apparatus and computer system
CN112905530A (en) On-chip architecture, pooled computational accelerator array, unit and control method
CN111488051A (en) Cloud deep neural network optimization method based on CPU and FPGA cooperative computing
CN115516450A (en) Inference engine circuit architecture
WO2023098256A1 (en) Neural network operation method and apparatus, chip, electronic device and storage medium
CN115880132A (en) Graphics processor, matrix multiplication task processing method, device and storage medium
CN114911596A (en) Scheduling method and device for model training, electronic equipment and storage medium
WO2020103883A1 (en) Method for executing matrix multiplication, circuit and soc
CN113286174A (en) Video frame extraction method and device, electronic equipment and computer readable storage medium
CN115860080B (en) Computing core, accelerator, computing method, apparatus, device, medium, and system
CN111047037A (en) Data processing method, device, equipment and storage medium
CN112506677B (en) TensorFlow distributed matrix calculation implementation method and system
CN111340224B (en) Accelerated design method of CNN (computer network) suitable for low-resource embedded chip
WO2021217502A1 (en) Computing architecture
CN111242832B (en) System C-based GPU texture mapping period accurate joint simulation device and method
CN109522125B (en) Acceleration method and device for matrix product transposition and processor
CN112905954A (en) CNN model convolution operation accelerated calculation method using FPGA BRAM
CN112639747A (en) Addressing method of processor, movable platform and electronic equipment
US11714649B2 (en) RISC-V-based 3D interconnected multi-core processor architecture and working method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant