WO2019136751A1 - Artificial intelligence parallel processing method and apparatus, computer readable storage medium, and terminal - Google Patents

Artificial intelligence parallel processing method and apparatus, computer readable storage medium, and terminal Download PDF

Info

Publication number
WO2019136751A1
WO2019136751A1 PCT/CN2018/072663 CN2018072663W WO2019136751A1 WO 2019136751 A1 WO2019136751 A1 WO 2019136751A1 CN 2018072663 W CN2018072663 W CN 2018072663W WO 2019136751 A1 WO2019136751 A1 WO 2019136751A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
module
artificial intelligence
storage module
matrix
Prior art date
Application number
PCT/CN2018/072663
Other languages
French (fr)
Chinese (zh)
Inventor
肖梦秋
Original Assignee
深圳鲲云信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳鲲云信息科技有限公司 filed Critical 深圳鲲云信息科技有限公司
Priority to PCT/CN2018/072663 priority Critical patent/WO2019136751A1/en
Priority to CN201880002151.7A priority patent/CN109416755B/en
Publication of WO2019136751A1 publication Critical patent/WO2019136751A1/en
Priority to US16/929,819 priority patent/US11874898B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention relates to the field of artificial intelligence, and in particular to an artificial intelligence parallel processing method, apparatus, readable storage medium, and terminal.
  • AI Artificial Intelligence
  • the artificial intelligence algorithm is a neural network model algorithm that simulates the human brain. Its computational complexity is very large. AlphaGo, which also uses artificial intelligence algorithms, requires thousands of traditional processors (CPUs) and hundreds of graphics processors (GPUs). It is clear that today, as artificial intelligence ushers in a new wave of revival, traditional processors are becoming a bottleneck that hinders the spread of artificial intelligence.
  • the object of the present invention is to provide an artificial intelligence parallel processing method and an artificial intelligence processing device for solving the technical problems such as insufficient parallelism of the artificial intelligence algorithm processing in the prior art.
  • the present invention provides an artificial intelligence parallel processing method, which is applied to a processing module, and the method includes: causing a data transmission module to take out multiple channel data from an external storage module according to a preset data size; And causing the data transmission module to transmit the channel data extracted according to the preset data size to the convolution operation module; wherein the convolution operation module includes a plurality of convolution kernel matrices for parallelizing the channel data Convolution operation.
  • the data transmission module is configured to take out the plurality of channel data from the external storage module according to the preset data size, and specifically includes: using each of the channel data according to a 1*1 data size.
  • the external storage module is taken out to the first storage module; each of the channel data is taken out from the first storage module to the second storage module according to a pv*1 data size; wherein, pv is a data transmission parallelism,
  • the number of columns of channel data is an integer multiple of pv; each channel data is extracted from the second storage module to the matrix module according to a pv*k data size; wherein k is a size of the convolution kernel matrix;
  • Each of the channel data is fetched from the matrix module in accordance with a pv*k*k data size to perform a parallel convolution operation with the plurality of convolution kernel matrices.
  • extracting each of the channel data from the second storage module to the matrix module according to a pv*k data size specifically, including: causing the channel data to perform a set of data per k
  • the data transmission module sequentially performs the following operations on each group of data: in each clock cycle, the first to-be-processed data of the data size pv*k is sequentially taken out from the group of data until all the data of the group is taken out.
  • each of the channel data is taken out from the matrix module according to a data size of pv*k*k, and specifically includes: a second location taken out from each group of data Starting with the first to-be-processed data, each of the first to-be-processed data is combined with the last two columns of the previous first to-be-processed data to form a second to-be-processed data of (pv+2)*k data size; Each of the second to-be-processed data is matrix-extracted with a step size of 1, to obtain pv k*k third to-be-processed data; wherein each of the third to-be-processed data is used for convolution with the plurality of The kernel matrix performs parallel convolution operations.
  • the plurality of convolution kernel matrices includes a plurality of weight matrices with different weights, and respectively perform convolution operations simultaneously with the third to-be-processed data.
  • an artificial intelligence parallel processing apparatus including: an external storage module that stores a plurality of channel data; a processing module that communicatively connects the external storage module; and a data transmission module Extracting and transmitting the plurality of channel data from the external storage module according to a preset data size; the convolution operation module includes a plurality of convolution kernel matrices for paralleling the channel data taken according to the preset data size Convolution operation.
  • the artificial intelligence parallel processing device includes a first storage module for storing the channel data from the external storage module.
  • the artificial intelligence parallel processing device includes a second storage module for storing the channel data from the first storage module.
  • the artificial intelligence parallel processing device includes a matrix module for storing the channel data from the second storage module.
  • the present invention provides a computer readable storage medium having stored thereon a computer program that implements the artificial intelligence parallel processing method when executed by a processor.
  • an artificial intelligence processing terminal including: a processor and a memory; the memory is for storing a computer program, and the processor is configured to execute the computer program of the memory storage, And causing the terminal to execute the artificial intelligence parallel processing method.
  • the artificial intelligence parallel processing method, apparatus, readable storage medium, and terminal of the present invention have the following advantageous effects: the present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the next convolution kernel
  • the convolution operation of the matrix, and the present invention realizes the parallel convolution operation by hardware devices such as a convolution operation circuit, especially in the face of a large amount of data calculation, and the convolution operation efficiency is greatly improved compared with the software calculation. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method.
  • FIG. 1 is a flow chart showing a method for parallel processing of artificial intelligence according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram showing a data matrix to be processed in an embodiment of the present invention.
  • FIG. 3 is a schematic diagram showing data to be processed by a data transmission module according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram showing data to be processed by a data transmission module according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram showing an artificial intelligence parallel processing apparatus according to an embodiment of the present invention.
  • the artificial intelligence parallel processing method is applied to a processing module, and the processing module may be, for example, an ARM module, an MCU module, or a Soc module or the like.
  • the artificial intelligence parallel processing method specifically includes:
  • the data transmission module is configured to take out multiple channel data from the external storage module according to a preset data size.
  • the data transmission module can transmit data by means of DMA.
  • the DMA is called Direct Memory Access, which is a direct memory access, and is used for data transmission between the external memory and the Programmable Logic terminal.
  • DMA transfer is a high-speed data transfer operation that allows direct read and write operations between external devices and memory without the need for CPU intervention.
  • the external storage module may be, for example, a DDR memory, and is disposed outside the Programmable Logic terminal for storing a plurality of channel data.
  • the channel data is data to be processed, and is usually stored in a memory in the form of a data matrix.
  • the data transmission module is configured to transmit the extracted channel data to a convolution operation module for parallel convolution operation with multiple convolution kernel matrices.
  • the convolution operation module is a convolution operation circuit, and may be a circuit composed of a multiplier and an adder.
  • the convolution operation module includes a plurality of convolution kernel matrices, and each of the convolution kernel matrices has different weights.
  • the image has three channel data of R, G, and B, that is, three two-dimensional matrices, each of which has a length and width set to K*K, assuming that K is an odd number 3; further, assuming the data transmission
  • the module fetches the channel data according to the data size of the 8*3*3 matrix, that is, the data transmission module takes out 8 3*3 matrices at a time.
  • the three two-dimensional matrices of R, G, and B are not subjected to parallel convolution operations, it is necessary to undergo three consecutive calculations to complete the calculation, which is time-consuming and computationally inefficient.
  • the three two-dimensional matrices of R, G, and B are convoluted in parallel with the eight 3*3 matrices so that each set of 8 3*3 matrices obtains 8*3. Convolution result value.
  • the present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the convolution operation of the next convolution kernel matrix, and the present invention implements a parallel convolution operation by a hardware device such as a convolution operation circuit, especially in the face of a large amount of
  • the data calculation greatly improves the efficiency of convolution calculations compared to software calculations. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method.
  • the data transmission module is taken out from the external storage module to the first storage module according to a 1*1 data size.
  • the first storage module may be a RAM or a ROM memory, such as three generations, four generations of DDR SDRAM, or the like.
  • FIG. 2 a schematic diagram of channel data in an embodiment of the present invention is shown.
  • the data transmission module is taken out from the first storage module to the second storage module according to a pv*1 data size.
  • pv is a data transmission parallelism, and is used to indicate the number of columns of the data transmission module to be processed each time, and the size thereof is associated with the efficiency of the artificial intelligence parallel processing method; the number of columns of the channel data is an integer of pv Times.
  • the schematic diagram of the transmission module extracting channel data according to the 8*1 data size is described below in conjunction with a specific illustration.
  • FIG. 3 a schematic diagram of the data transmission module taking out channel data in an embodiment of the present invention is shown.
  • the data transmission module starts from the leftmost side of the first row of data to be processed, and takes out 8*1 data each time until all the data to be processed in the first row is taken out. Based on the same principle, the data transmission module continues to take the second row, the third row... until the entire 34*40 matrix is taken out.
  • the data transmission module After the data transmission module stores the 34*40 matrix in the first storage module, and according to the pv*k data size, k is the size of the convolution kernel matrix, and the convolution kernel matrix is A weight matrix for a convolution operation; the convolution kernel matrix may be set to an odd-order matrix, and in the present embodiment, the convolution kernel matrix is set to a 3*3 matrix. That is, the data transmission module takes out the 34*40 matrix from the second storage module in batches and puts it into the matrix module in an 8*3 matrix for data combination.
  • the data transmission module sequentially extracts 8*3 matrices from the first three rows of the 34*40 matrix in order from left to right in each clock cycle. That is, a total of five 8*3 matrices can be taken out in the first three rows. Based on the same principle as described above, the data transmission module continues to fetch the pending data of the subsequent row after the first three rows are taken.
  • the rectangular dotted frame R1 R R5 in FIG. 2 represents a total of five 8*3 matrices in the first three rows.
  • FIG. 4 a schematic diagram of data removal by the data transmission module in an embodiment of the present invention is shown.
  • the first 8*3 matrix M1 taken out by the data transmission module from the second storage module is generally used to improve the pipeline of the artificial intelligence calculation, and the first one is taken out of each row.
  • the 8*3 matrix can only obtain convolution result values less than 8 by convolution operation. Therefore, the first 8*3 matrix extracted per line is set as invalid data to improve the pipeline operation degree of artificial intelligence processing.
  • the convolution result of the 8*3 matrix M1 is an invalid value.
  • the data transmission module takes out a second 8*3 matrix M2, and the 8*3 matrix M2 and the last two columns of the 8*3 matrix M1 are combined into a 10*3 matrix M12.
  • a line L1 is used to represent matrix data combined with each other.
  • the data matrix M2 is combined with the last two columns of the data matrix M1 to obtain a data matrix M12 of (pv+2), that is, 10 columns.
  • the 10*3 matrix M12 can perform matrix extraction according to the step size 1, thereby obtaining eight 3*3 matrices.
  • the rectangular dotted frame R6 takes the matrix covered in FIG. 4 as a starting position, moves to the right column by column according to the step size 1, and obtains a matrix of size 3*3 for each column moved.
  • the rectangular dashed box R6 can be moved a total of 7 times in the 10*3 matrix M12, for a total of 8 3*3 matrices, that is, pv k*k matrices.
  • the eight 3*3 matrices are used for transmission to the convolution operation module to perform parallel convolution operations with the three 3*3 convolution kernel matrices respectively, thereby obtaining 3*8 calculation result values.
  • the data transmission module takes out a third 8*3 matrix M3, and the 8*3 matrix M3 and the last two columns of the 8*3 matrix M2 are combined into 10* 3 matrix M23, in which the line L2 represents matrix data combined with each other.
  • the data matrix M3 is combined with the last two columns of the data matrix M2 to obtain a data matrix M23 having a column number of 10.
  • the 10*3 matrix M23 can perform matrix extraction according to the step size 1 to obtain 8 3*3 matrices; the 8 3*3 fifth to-be-processed data matrices are used for transmission to the convolution operation module, A convolution operation is performed with three of the 3*3 convolution kernel matrices and 3*8 calculation result values are obtained.
  • the data transmission module is based on the same principle, and can process data processing of the entire 34*40 matrix after a plurality of clock cycles.
  • an artificial intelligence parallel processing apparatus includes: a first storage module 51, a second storage module 52, a data transmission module 53, a processing module 54, and a matrix module 55.
  • the first storage module 51, the second storage module 52, the data transmission module 53, the matrix module 55, and the convolution operation module 56 are collectively disposed on the Programmable Logic terminal 50 of the FPGA, which is generally referred to as a PL terminal.
  • the data transmission module is specifically configured to transmit the channel data from the external storage module 57 to the first storage module 51 according to the 1*1 data size through the system bus, and then take out the first storage module 51 and follow the pv*1 data.
  • the size is transferred to the second storage module 52, and is taken out from the second storage module 52 and transmitted to the matrix module according to the pv*k data size, and then taken out from the matrix module and transmitted to the pv*k 2 data size to Convolution operation module 56.
  • the convolution operation module 56 is provided with a plurality of convolution kernel matrices for parallel convolution operations.
  • the plurality of convolution kernel matrices are specifically: a convolution kernel matrix 1, a convolution kernel matrix 2, ..., a convolution kernel matrix n.
  • the first storage module 51 may be, for example, a BRAM memory, that is, a block RAM, which is a RAM storage resource of an FPGA (Field-Programmable Gate Array) field programmable gate array.
  • the processing module 54 can be, for example, an ARM module, an MCU module, or a Soc module, and the like.
  • the implementation manner of the artificial intelligence processing device is similar to the implementation manner of the artificial intelligence parallel processing method, and therefore will not be described again. Those skilled in the art should be able to understand the artificial intelligence processing based on the artificial intelligence parallel processing method. The principle and implementation of the device.
  • the aforementioned computer program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
  • the present invention also provides an artificial intelligence processing terminal, comprising: a processor and a memory; the memory is for storing a computer program, the processor is configured to execute the computer program stored by the memory, so that the terminal performs the manual Intelligent parallel processing method.
  • the above memory may include random access memory (RAM), and may also include non-volatile memory, such as at least one disk storage.
  • RAM random access memory
  • non-volatile memory such as at least one disk storage.
  • the above processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP for short), and the like; or a digital signal processor (DSP), an application specific integrated circuit (DSP). ApplicationSpecificIntegratedCircuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • CPU central processing unit
  • NP Network Processor
  • DSP digital signal processor
  • DSP application specific integrated circuit
  • ASIC ApplicationSpecificIntegratedCircuit
  • FPGA Field-Programmable Gate Array
  • the present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the convolution operation of the next convolution kernel matrix, and the present invention implements a parallel convolution operation by a hardware device such as a convolution operation circuit, especially in the face of a large amount of
  • the data calculation greatly improves the efficiency of convolution calculations compared to software calculations. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Multi Processors (AREA)

Abstract

An artificial intelligence parallel processing method, for use in a processing module (54), the method comprising: making a data transmission module extract a plurality of channel data from an external storage module according to a preset data size (S101); making the data transmission module transmit the extracted channel data to a convolution module for implementing parallel convolution operations with a plurality of convolution kernel matrices (S102). The present method does not need to wait for the convolution operation of one convolution kernel matrix to finish to then implement the convolution operation of the next convolution kernel matrix and implements parallel convolution operations by means of a hardware device such as a convolution operation circuit and, particularly when faced with a large amount of data calculation, greatly improves the efficiency of convolution operations compared to software calculation. Thus, processing parallelism is greatly improved and calculation efficiency is improved by means of the present artificial intelligence parallel processing method.

Description

人工智能并行处理方法、装置、可读存储介质、及终端Artificial intelligence parallel processing method, device, readable storage medium, and terminal 技术领域Technical field
本发明涉及人工智能领域,特别是涉及人工智能并行处理方法、装置、可读存储介质、及终端。The present invention relates to the field of artificial intelligence, and in particular to an artificial intelligence parallel processing method, apparatus, readable storage medium, and terminal.
背景技术Background technique
人工智能(Artificial Intelligence),英文缩写为AI。它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学。Artificial Intelligence, abbreviated as AI in English. It is a new technical science that studies and develops theories, methods, techniques, and applications for simulating, extending, and extending human intelligence.
人工智能算法是模拟人脑的神经网络模型算法,其运算量非常巨大,同样采用了人工智能算法的AlphaGo,需要用到上千块传统处理器(CPU)和上百块图形处理器(GPU);很显然,在人工智能迎来新一波复兴的今天,传统处理器正成为阻碍人工智能普及的瓶颈。The artificial intelligence algorithm is a neural network model algorithm that simulates the human brain. Its computational complexity is very large. AlphaGo, which also uses artificial intelligence algorithms, requires thousands of traditional processors (CPUs) and hundreds of graphics processors (GPUs). It is clear that today, as artificial intelligence ushers in a new wave of revival, traditional processors are becoming a bottleneck that hinders the spread of artificial intelligence.
但是,目前人工智能算法处理的并行度不够,导致人工智能算法的效率低下。因此,如何实现高并行度的人工智能处理方式成为人工智能技术领域的关键技术。However, the degree of parallelism of artificial intelligence algorithms is currently insufficient, resulting in inefficiency of artificial intelligence algorithms. Therefore, how to realize high parallelism artificial intelligence processing has become a key technology in the field of artificial intelligence technology.
发明内容Summary of the invention
鉴于以上所述现有技术的缺点,本发明的目的在于提供人工智能并行处理方法及人工智能处理装置,用于解决现有技术中人工智能算法处理的并行度不够等技术问题。In view of the above-mentioned shortcomings of the prior art, the object of the present invention is to provide an artificial intelligence parallel processing method and an artificial intelligence processing device for solving the technical problems such as insufficient parallelism of the artificial intelligence algorithm processing in the prior art.
为实现上述目的及其他相关目的,本发明提供一种人工智能并行处理方法,应用于处理模块,所述方法包括:令数据传输模块按照预设数据尺寸从外部存储模块中取出多个通道数据;令所述数据传输模块将所述按照预设数据尺寸取出的通道数据传输至卷积运算模块;其中,所述卷积运算模块包括多个卷积核矩阵,用于与所述通道数据进行并行卷积运算。To achieve the above and other related objects, the present invention provides an artificial intelligence parallel processing method, which is applied to a processing module, and the method includes: causing a data transmission module to take out multiple channel data from an external storage module according to a preset data size; And causing the data transmission module to transmit the channel data extracted according to the preset data size to the convolution operation module; wherein the convolution operation module includes a plurality of convolution kernel matrices for parallelizing the channel data Convolution operation.
于本发明的一实施例中,所述令数据传输模块按照预设数据尺寸从外部存储模块中取出多个通道数据,具体包括:将每个所述通道数据按照1*1数据尺寸从所述外部存储模块中取出至第一存储模块;将每个所述通道数据按照pv*1数据尺寸从所述第一存储模块中取出至第二存储模块;其中,pv为数据传输并行度,所述通道数据的列数为pv的整数倍;将每个所述通道数据按照pv*k数据尺寸从所述第二存储模块中取出至矩阵模块;其中,k为所述卷积核矩阵的尺寸;将每个所述通道数据按照pv*k*k数据尺寸从所述矩阵模块中取出,以与所述多个卷积核矩阵进行并行卷积运算。In an embodiment of the present invention, the data transmission module is configured to take out the plurality of channel data from the external storage module according to the preset data size, and specifically includes: using each of the channel data according to a 1*1 data size. The external storage module is taken out to the first storage module; each of the channel data is taken out from the first storage module to the second storage module according to a pv*1 data size; wherein, pv is a data transmission parallelism, The number of columns of channel data is an integer multiple of pv; each channel data is extracted from the second storage module to the matrix module according to a pv*k data size; wherein k is a size of the convolution kernel matrix; Each of the channel data is fetched from the matrix module in accordance with a pv*k*k data size to perform a parallel convolution operation with the plurality of convolution kernel matrices.
于本发明的一实施例中,将每个所述通道数据按照pv*k数据尺寸从所述第二存储模块中取出至矩阵模块,具体包括:令所述通道数据以每k行为一组数据;通过数据传输模块依次 对每一组数据进行如下操作:在每个时钟周期内,从该组数据中依次取出数据尺寸为pv*k的第一待处理数据,直至该组数据全部被取出。In an embodiment of the present invention, extracting each of the channel data from the second storage module to the matrix module according to a pv*k data size, specifically, including: causing the channel data to perform a set of data per k The data transmission module sequentially performs the following operations on each group of data: in each clock cycle, the first to-be-processed data of the data size pv*k is sequentially taken out from the group of data until all the data of the group is taken out.
于本发明的一实施例中,将每个所述通道数据按照pv*k*k数据尺寸从所述矩阵模块中取出,具体包括:针对所述每一组数据,从取出的第二个所述第一待处理数据开始,每个所述第一待处理数据均与前一个第一待处理数据的最后2列组合以形成(pv+2)*k数据尺寸的第二待处理数据;针对每个所述第二待处理数据,以步长为1进行矩阵提取,得到pv个k*k第三待处理数据;其中,各所述第三待处理数据用于与所述多个卷积核矩阵进行并行卷积运算。In an embodiment of the present invention, each of the channel data is taken out from the matrix module according to a data size of pv*k*k, and specifically includes: a second location taken out from each group of data Starting with the first to-be-processed data, each of the first to-be-processed data is combined with the last two columns of the previous first to-be-processed data to form a second to-be-processed data of (pv+2)*k data size; Each of the second to-be-processed data is matrix-extracted with a step size of 1, to obtain pv k*k third to-be-processed data; wherein each of the third to-be-processed data is used for convolution with the plurality of The kernel matrix performs parallel convolution operations.
于本发明的一实施例中,所述多个卷积核矩阵包括多个权重不同的权重矩阵,分别与所述第三待处理数据同时进行卷积运算。In an embodiment of the invention, the plurality of convolution kernel matrices includes a plurality of weight matrices with different weights, and respectively perform convolution operations simultaneously with the third to-be-processed data.
为实现上述目的及其他相关目的,本发明提供一种人工智能并行处理装置,其包括:外部存储模块,存储有多个通道数据;处理模块,通信连接所述外部存储模块;数据传输模块,用于按照预设数据尺寸从外部存储模块中取出所述多个通道数据并传输;卷积运算模块,包括多个卷积核矩阵,用于与按照预设数据尺寸取出的所述通道数据进行并行卷积运算。To achieve the above and other related objects, the present invention provides an artificial intelligence parallel processing apparatus, including: an external storage module that stores a plurality of channel data; a processing module that communicatively connects the external storage module; and a data transmission module Extracting and transmitting the plurality of channel data from the external storage module according to a preset data size; the convolution operation module includes a plurality of convolution kernel matrices for paralleling the channel data taken according to the preset data size Convolution operation.
于本发明的一实施例中,所述人工智能并行处理装置包括第一存储模块,用于存储来自所述外部存储模块的所述通道数据。In an embodiment of the invention, the artificial intelligence parallel processing device includes a first storage module for storing the channel data from the external storage module.
于本发明的一实施例中,所述人工智能并行处理装置包括第二存储模块,用于存储来自所述第一存储模块的所述通道数据。In an embodiment of the invention, the artificial intelligence parallel processing device includes a second storage module for storing the channel data from the first storage module.
于本发明的一实施例中,所述人工智能并行处理装置包括矩阵模块,用于存储来自所述第二存储模块的所述通道数据。In an embodiment of the invention, the artificial intelligence parallel processing device includes a matrix module for storing the channel data from the second storage module.
为实现上述目的及其他相关目的,本发明提供一种计算机可读存储介质,其上存储有计算机程序该程序被处理器执行时实现所述人工智能并行处理方法。To achieve the above and other related objects, the present invention provides a computer readable storage medium having stored thereon a computer program that implements the artificial intelligence parallel processing method when executed by a processor.
为实现上述目的及其他相关目的,本发明提供一种人工智能处理终端,包括:处理器及存储器;所述存储器用于存储计算机程序,所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行所述人工智能并行处理方法。To achieve the above and other related objects, the present invention provides an artificial intelligence processing terminal, including: a processor and a memory; the memory is for storing a computer program, and the processor is configured to execute the computer program of the memory storage, And causing the terminal to execute the artificial intelligence parallel processing method.
如上所述,本发明的人工智能并行处理方法、装置、可读存储介质、及终端,具有以下有益效果:本发明无需等待一个卷积核矩阵的卷积运算结束之后再进行下一个卷积核矩阵的卷积运算,且本发明通过卷积运算电路等硬件设备实现并行卷积运算,特别是面对大量的数据计算,相比于软件计算更是大幅度地提升了卷积运算效率。因此,本发明通过人工智能并行处理的方法大幅提升处理并行度且提升计算效率。As described above, the artificial intelligence parallel processing method, apparatus, readable storage medium, and terminal of the present invention have the following advantageous effects: the present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the next convolution kernel The convolution operation of the matrix, and the present invention realizes the parallel convolution operation by hardware devices such as a convolution operation circuit, especially in the face of a large amount of data calculation, and the convolution operation efficiency is greatly improved compared with the software calculation. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method.
附图说明DRAWINGS
图1显示为本发明一实施例中人工智能并行处理方法的流程图。FIG. 1 is a flow chart showing a method for parallel processing of artificial intelligence according to an embodiment of the present invention.
图2显示为本发明一实施例中待处理数据矩阵的示意图。2 is a schematic diagram showing a data matrix to be processed in an embodiment of the present invention.
图3显示为本发明一实施例中数据传输模块取出待处理数据的示意图。FIG. 3 is a schematic diagram showing data to be processed by a data transmission module according to an embodiment of the present invention.
图4显示为本发明一实施例中数据传输模块取出待处理数据的示意图。FIG. 4 is a schematic diagram showing data to be processed by a data transmission module according to an embodiment of the present invention.
图5显示为本发明一实施例中人工智能并行处理装置的示意图。FIG. 5 is a schematic diagram showing an artificial intelligence parallel processing apparatus according to an embodiment of the present invention.
元件标号说明Component label description
R1~R6         矩形虚线框R1~R6 rectangular dotted frame
D1~D3         8*1数据D1~D3 8*1 data
M1             8*3矩阵M1 8*3 matrix
M2             8*3矩阵M2 8*3 matrix
M3             8*3矩阵M3 8*3 matrix
M12            10*3矩阵M12 10*3 matrix
M23            10*3矩阵M23 10*3 matrix
L1             直线L1 straight line
L2             直线L2 straight line
T1             时钟周期T1 clock cycle
T2             时钟周期T2 clock cycle
T3             时钟周期T3 clock cycle
50             Programmable Logic端50 Programmable Logic side
51             第一存储模块51 first storage module
52             第二存储模块52 second storage module
53             数据传输模块53 data transmission module
54             处理模块54 processing module
55             矩阵模块55 matrix module
56             卷积运算模块56 convolution operation module
57             外部存储模块57 external storage module
S101~S102     步骤S101~S102 steps
具体实施方式Detailed ways
以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露 的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。需说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below by way of specific specific examples, and those skilled in the art can readily understand other advantages and functions of the present invention from the disclosure of the present disclosure. The present invention may be embodied or applied in various other specific embodiments, and various modifications and changes can be made without departing from the spirit and scope of the invention. It should be noted that the features in the following embodiments and embodiments may be combined with each other without conflict.
需要说明的是,以下实施例中所提供的图示仅以示意方式说明本发明的基本构想,遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention in a schematic manner, and only the components related to the present invention are shown in the drawings, rather than the number and shape of components in actual implementation. Dimensional drawing, the actual type of implementation of each component's type, number and proportion can be a random change, and its component layout can be more complicated.
如图1所示,展示本发明一实施例中的人工智能并行处理方法的流程图。所述人工智能并行处理方法应用于处理模块,所述处理模块例如可以是ARM模块、MCU模块、或者Soc模块等等。所述人工智能并行处理方法具体包括:As shown in FIG. 1, a flow chart of a method for parallel processing of artificial intelligence in an embodiment of the present invention is shown. The artificial intelligence parallel processing method is applied to a processing module, and the processing module may be, for example, an ARM module, an MCU module, or a Soc module or the like. The artificial intelligence parallel processing method specifically includes:
S101:令数据传输模块按照预设数据尺寸从外部存储模块中取出多个通道数据。S101: The data transmission module is configured to take out multiple channel data from the external storage module according to a preset data size.
所述数据传输模块可通过DMA方式传输数据。所述DMA的全称为Direct Memory Access,也即直接内存存取,用于在外部存储器与Programmable Logic端之间进行数据传输。DMA传输是一种高速的数据传输操作,允许在外部设备和存储器之间直接进行读写操作,整个过程无需CPU干预。The data transmission module can transmit data by means of DMA. The DMA is called Direct Memory Access, which is a direct memory access, and is used for data transmission between the external memory and the Programmable Logic terminal. DMA transfer is a high-speed data transfer operation that allows direct read and write operations between external devices and memory without the need for CPU intervention.
所述外部存储模块例如可以是DDR存储器,设于所述Programmable Logic端之外,用于存储多个通道数据。所述通道数据为待处理数据,通常以数据矩阵的形式存储于存储器中。The external storage module may be, for example, a DDR memory, and is disposed outside the Programmable Logic terminal for storing a plurality of channel data. The channel data is data to be processed, and is usually stored in a memory in the form of a data matrix.
S102:令所述数据传输模块将取出的所述通道数据传输至卷积运算模块,供与多个卷积核矩阵进行并行卷积运算。S102: The data transmission module is configured to transmit the extracted channel data to a convolution operation module for parallel convolution operation with multiple convolution kernel matrices.
所述卷积运算模块,是一种卷积运算电路,可以是由乘法器和加法器相连组成的电路。所述卷积运算模块包括多个卷积核矩阵,各所述卷积核矩阵的权重不同。举例来讲,图像有R、G、B三个通道数据,也即三个二维矩阵,每个二维矩阵长宽设为K*K,假设K是奇数3;此外,假设所述数据传输模块按照8*3*3矩阵的数据尺寸取出所述通道数据,也即所述数据传输模块每次取出8个3*3矩阵。The convolution operation module is a convolution operation circuit, and may be a circuit composed of a multiplier and an adder. The convolution operation module includes a plurality of convolution kernel matrices, and each of the convolution kernel matrices has different weights. For example, the image has three channel data of R, G, and B, that is, three two-dimensional matrices, each of which has a length and width set to K*K, assuming that K is an odd number 3; further, assuming the data transmission The module fetches the channel data according to the data size of the 8*3*3 matrix, that is, the data transmission module takes out 8 3*3 matrices at a time.
若所述R、G、B三个二维矩阵不进行并行卷积运算,则需经历连续3次的计算才能完成运算,其计算耗时长且计算效率低下。而在本发明中,优选的,所述R、G、B三个二维矩阵并行与所述8个3*3矩阵进行卷积运算以使每组8个3*3矩阵均得到8*3个卷积结果值。本发明无需等待一个卷积核矩阵的卷积运算结束之后再进行下一个卷积核矩阵的卷积运算,且本发明通过卷积运算电路等硬件设备实现并行卷积运算,特别是面对大量的数据计算,相比 于软件计算更是大幅度地提升了卷积运算效率。因此,本发明通过人工智能并行处理的方法大幅提升处理并行度且提升计算效率。If the three two-dimensional matrices of R, G, and B are not subjected to parallel convolution operations, it is necessary to undergo three consecutive calculations to complete the calculation, which is time-consuming and computationally inefficient. In the present invention, preferably, the three two-dimensional matrices of R, G, and B are convoluted in parallel with the eight 3*3 matrices so that each set of 8 3*3 matrices obtains 8*3. Convolution result value. The present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the convolution operation of the next convolution kernel matrix, and the present invention implements a parallel convolution operation by a hardware device such as a convolution operation circuit, especially in the face of a large amount of The data calculation greatly improves the efficiency of convolution calculations compared to software calculations. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method.
下面以具体的实施例说明所述数据传输模块按照预设数据尺寸从外部存储模块中取出通道数据的原理。The principle of the data transmission module extracting channel data from the external storage module according to the preset data size will be described below with reference to a specific embodiment.
所述数据传输模块按照1*1数据尺寸从所述外部存储模块中取出至第一存储模块。所述第一存储模块,可以是RAM或ROM存储器,例如三代、四代DDR SDRAM等等。The data transmission module is taken out from the external storage module to the first storage module according to a 1*1 data size. The first storage module may be a RAM or a ROM memory, such as three generations, four generations of DDR SDRAM, or the like.
如图2所示,展示本发明一实施例中通道数据的示意图。所述数据传输模块按照pv*1数据尺寸从所述第一存储模块中取出至第二存储模块。其中,pv为数据传输并行度,用于表示所述数据传输模块每一次传输待处理数据的列数,其大小与人工智能并行处理方法的效率关联;所述通道数据的列数为pv的整数倍。于本实施例中,令所述数据传输并行度pv=8,所述通道数据为34*40矩阵,故所述数据传输模块将所述34*40矩阵按照8*1数据尺寸从所述第一存储模块中取出至第二存储模块。下面结合具体图示说明所述传输模块按照8*1数据尺寸取出通道数据的示意图。As shown in FIG. 2, a schematic diagram of channel data in an embodiment of the present invention is shown. The data transmission module is taken out from the first storage module to the second storage module according to a pv*1 data size. Wherein, pv is a data transmission parallelism, and is used to indicate the number of columns of the data transmission module to be processed each time, and the size thereof is associated with the efficiency of the artificial intelligence parallel processing method; the number of columns of the channel data is an integer of pv Times. In this embodiment, the data transmission parallelism is pv=8, and the channel data is a 34*40 matrix, so the data transmission module divides the 34*40 matrix according to the 8*1 data size from the first A storage module is taken out to the second storage module. The schematic diagram of the transmission module extracting channel data according to the 8*1 data size is described below in conjunction with a specific illustration.
如图3所示,展示本发明一实施例中数据传输模块取出通道数据的示意图。所述数据传输模块从第一行待处理数据的最左侧开始,每次取出8*1个数据,直至第一行的待处理数据全部取出。基于同样的原理,所述数据传输模块继续取第二行,第三行…,直至整个34*40矩阵全部被取出为止。As shown in FIG. 3, a schematic diagram of the data transmission module taking out channel data in an embodiment of the present invention is shown. The data transmission module starts from the leftmost side of the first row of data to be processed, and takes out 8*1 data each time until all the data to be processed in the first row is taken out. Based on the same principle, the data transmission module continues to take the second row, the third row... until the entire 34*40 matrix is taken out.
具体的,以第一行为例,所述数据传输模块将第一个8*1矩阵D1取出后置入第二存储模块中地址Addr=0的位置,将第二个8*1矩阵D2取出后置入地址Addr=1的位置,将第三个8*1矩阵D3取出后置入地址Addr=2的位置,以此类推将全部所述34*40矩阵全部从所述第一存储模块中取出并置入第二存储模块中。Specifically, in the first behavior example, the data transmission module takes out the first 8*1 matrix D1 and places it into the location of the address Addr=0 in the second storage module, and takes out the second 8*1 matrix D2. Positioning the address Addr=1, taking out the third 8*1 matrix D3 and placing it in the address Addr=2, and so on, all the 34*40 matrices are all taken out from the first storage module. And placed in the second storage module.
所述数据传输模块将所述34*40矩阵存入所述第一存储模块中后,又按行且按照pv*k数据尺寸,k为卷积核矩阵的尺寸,所述卷积核矩阵是用于卷积运算的权重矩阵;所述卷积核矩阵可设为奇数阶矩阵,于本实施例中将所述卷积核矩阵设为3*3矩阵。也即,所述数据传输模块按照8*3矩阵,分批将所述34*40矩阵从所述第二存储模块中取出并置入矩阵模块中以进行数据组合。After the data transmission module stores the 34*40 matrix in the first storage module, and according to the pv*k data size, k is the size of the convolution kernel matrix, and the convolution kernel matrix is A weight matrix for a convolution operation; the convolution kernel matrix may be set to an odd-order matrix, and in the present embodiment, the convolution kernel matrix is set to a 3*3 matrix. That is, the data transmission module takes out the 34*40 matrix from the second storage module in batches and puts it into the matrix module in an 8*3 matrix for data combination.
如图2所示,所述数据传输模块在每个时钟周期内按照从左到右的顺序,依次从34*40矩阵的前三行中取出8*3矩阵。也即,前三行共可取出5个8*3矩阵。基于上述相同的原理,所述数据传输模块在取完前三行后继续取出后续行的待处理数据。为方便本领域技术人员理解,图2中用矩形虚线框R1~R5表示前3行共5个8*3矩阵。As shown in FIG. 2, the data transmission module sequentially extracts 8*3 matrices from the first three rows of the 34*40 matrix in order from left to right in each clock cycle. That is, a total of five 8*3 matrices can be taken out in the first three rows. Based on the same principle as described above, the data transmission module continues to fetch the pending data of the subsequent row after the first three rows are taken. For the convenience of those skilled in the art, the rectangular dotted frame R1 R R5 in FIG. 2 represents a total of five 8*3 matrices in the first three rows.
如图4所示,展示本发明一实施例中数据传输模块取出通道数据的示意图。在第一个时钟周期T1内,所述数据传输模块从所述第二存储模块中取出的第一个8*3矩阵M1,通常为提升人工智能计算的流水度,因每行取出的第一个8*3矩阵通过卷积运算只能得到数量小于8的卷积结果值,故为提升人工智能处理的流水操作度将每行取出的第一个8*3矩阵设为无效数据,也即所述8*3矩阵M1的卷积结果为无效值。As shown in FIG. 4, a schematic diagram of data removal by the data transmission module in an embodiment of the present invention is shown. In the first clock cycle T1, the first 8*3 matrix M1 taken out by the data transmission module from the second storage module is generally used to improve the pipeline of the artificial intelligence calculation, and the first one is taken out of each row. The 8*3 matrix can only obtain convolution result values less than 8 by convolution operation. Therefore, the first 8*3 matrix extracted per line is set as invalid data to improve the pipeline operation degree of artificial intelligence processing. The convolution result of the 8*3 matrix M1 is an invalid value.
在第二个时钟周期T2内,所述数据传输模块取出第二个8*3矩阵M2,所述8*3矩阵M2与所述8*3矩阵M1的最后两列组合成10*3矩阵M12,图中用直线L1代表相互组合的矩阵数据。所述数据矩阵M2通过与数据矩阵M1的最后两列相互组合,得到(pv+2)也即10列的数据矩阵M12。In the second clock cycle T2, the data transmission module takes out a second 8*3 matrix M2, and the 8*3 matrix M2 and the last two columns of the 8*3 matrix M1 are combined into a 10*3 matrix M12. In the figure, a line L1 is used to represent matrix data combined with each other. The data matrix M2 is combined with the last two columns of the data matrix M1 to obtain a data matrix M12 of (pv+2), that is, 10 columns.
所述10*3矩阵M12能够按照步长1进行矩阵提取,从而得到8个3*3矩阵。具体的,如图4中所示的矩形虚线框R6,以图4中覆盖的矩阵为起始位置,按照步长1逐列向右移动,每移动一列便得到一个尺寸为3*3的矩阵。由此可知,矩形虚线框R6可在所述所述10*3矩阵M12中总共移动7次,共计8个3*3矩阵,也即pv个k*k矩阵。所述8个3*3矩阵用于传输至卷积运算模块中,以分别与3个所述3*3卷积核矩阵进行并行卷积运算,从而得到3*8个计算结果值。The 10*3 matrix M12 can perform matrix extraction according to the step size 1, thereby obtaining eight 3*3 matrices. Specifically, as shown in FIG. 4, the rectangular dotted frame R6 takes the matrix covered in FIG. 4 as a starting position, moves to the right column by column according to the step size 1, and obtains a matrix of size 3*3 for each column moved. . It can be seen that the rectangular dashed box R6 can be moved a total of 7 times in the 10*3 matrix M12, for a total of 8 3*3 matrices, that is, pv k*k matrices. The eight 3*3 matrices are used for transmission to the convolution operation module to perform parallel convolution operations with the three 3*3 convolution kernel matrices respectively, thereby obtaining 3*8 calculation result values.
同理,在第三个时钟周期T3内,所述数据传输模块取出第三个8*3矩阵M3,所述8*3矩阵M3与所述8*3矩阵M2的最后两列组合成10*3矩阵M23,图中用直线L2代表相互组合的矩阵数据。所述数据矩阵M3通过与数据矩阵M2的最后两列相互组合,得到列数为10的数据矩阵M23。所述10*3矩阵M23能够按照步长1进行矩阵提取,从而得到8个3*3矩阵;所述8个3*3的第五待处理数据矩阵用于传输至卷积运算模块中,以与3个所述3*3卷积核矩阵进行卷积运算并得到3*8个计算结果值。以此类推,所述数据传输模块基于同样的原理,在经历多个时钟周期后可完成处理整个所述34*40矩阵的数据处理。Similarly, in the third clock cycle T3, the data transmission module takes out a third 8*3 matrix M3, and the 8*3 matrix M3 and the last two columns of the 8*3 matrix M2 are combined into 10* 3 matrix M23, in which the line L2 represents matrix data combined with each other. The data matrix M3 is combined with the last two columns of the data matrix M2 to obtain a data matrix M23 having a column number of 10. The 10*3 matrix M23 can perform matrix extraction according to the step size 1 to obtain 8 3*3 matrices; the 8 3*3 fifth to-be-processed data matrices are used for transmission to the convolution operation module, A convolution operation is performed with three of the 3*3 convolution kernel matrices and 3*8 calculation result values are obtained. By analogy, the data transmission module is based on the same principle, and can process data processing of the entire 34*40 matrix after a plurality of clock cycles.
如图5所示,展示本发明一实施例中的人工智能并行处理装置,其包括:第一存储模块51、第二存储模块52、数据传输模块53、处理模块54、以及矩阵模块55。其中,所述第一存储模块51、第二存储模块52、数据传输模块53、矩阵模块55与卷积运算模块56共同设于FPGA的Programmable Logic端50,也即通常称为PL端。As shown in FIG. 5, an artificial intelligence parallel processing apparatus according to an embodiment of the present invention includes: a first storage module 51, a second storage module 52, a data transmission module 53, a processing module 54, and a matrix module 55. The first storage module 51, the second storage module 52, the data transmission module 53, the matrix module 55, and the convolution operation module 56 are collectively disposed on the Programmable Logic terminal 50 of the FPGA, which is generally referred to as a PL terminal.
所述数据传输模块具体用于将所述通道数据通过系统总线从外部存储模块57按照1*1数据尺寸传输至第一存储模块51,再从第一存储模块51中取出并按照pv*1数据尺寸传输至第二存储模块52,又从所述第二存储模块52取出并按照pv*k数据尺寸传输至矩阵模块中,后从所述矩阵模块中取出并以pv*k 2数据尺寸传输至卷积运算模块56。 The data transmission module is specifically configured to transmit the channel data from the external storage module 57 to the first storage module 51 according to the 1*1 data size through the system bus, and then take out the first storage module 51 and follow the pv*1 data. The size is transferred to the second storage module 52, and is taken out from the second storage module 52 and transmitted to the matrix module according to the pv*k data size, and then taken out from the matrix module and transmitted to the pv*k 2 data size to Convolution operation module 56.
所述卷积运算模块56,设有多个卷积核矩阵,用于并行卷积运算。所述多个卷积核矩阵具体为:卷积核矩阵1,卷积核矩阵2,…,卷积核矩阵n。The convolution operation module 56 is provided with a plurality of convolution kernel matrices for parallel convolution operations. The plurality of convolution kernel matrices are specifically: a convolution kernel matrix 1, a convolution kernel matrix 2, ..., a convolution kernel matrix n.
所述第一存储模块51例如可以是BRAM存储器,也即Block RAM,是FPGA(Field-Programmable Gate Array)现场可编程门阵列的RAM存储资源。所述处理模块54例如可以是ARM模块、MCU模块、或者Soc模块等等。The first storage module 51 may be, for example, a BRAM memory, that is, a block RAM, which is a RAM storage resource of an FPGA (Field-Programmable Gate Array) field programmable gate array. The processing module 54 can be, for example, an ARM module, an MCU module, or a Soc module, and the like.
所述人工智能处理装置的实施方式与所述人工智能并行处理方法的实施方式类似,故不再赘述,本领域技术人员应该能够在所述人工智能并行处理方法的基础上理解所述人工智能处理装置的原理及实施方式。The implementation manner of the artificial intelligence processing device is similar to the implementation manner of the artificial intelligence parallel processing method, and therefore will not be described again. Those skilled in the art should be able to understand the artificial intelligence processing based on the artificial intelligence parallel processing method. The principle and implementation of the device.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过计算机程序相关的硬件来完成。前述的计算机程序可以存储于一计算机可读存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。One of ordinary skill in the art will appreciate that all or part of the steps to implement the various method embodiments described above can be accomplished by hardware associated with a computer program. The aforementioned computer program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
本发明还提供一种人工智能处理终端,包括:处理器及存储器;所述存储器用于存储计算机程序,所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行所述人工智能并行处理方法。The present invention also provides an artificial intelligence processing terminal, comprising: a processor and a memory; the memory is for storing a computer program, the processor is configured to execute the computer program stored by the memory, so that the terminal performs the manual Intelligent parallel processing method.
上述存储器可能包含随机存取存储器(RandomAccessMemory,简称RAM),也可能还包括非易失性存储器(non-volatilememory),例如至少一个磁盘存储器。The above memory may include random access memory (RAM), and may also include non-volatile memory, such as at least one disk storage.
上述的处理器可以是通用处理器,包括中央处理器(CentralProcessingUnit,简称CPU)、网络处理器(NetworkProcessor,简称NP)等;还可以是数字信号处理器(DigitalSignalProcessing,简称DSP)、专用集成电路(ApplicationSpecificIntegratedCircuit,简称ASIC)、现场可编程门阵列(Field-ProgrammableGateArray,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP for short), and the like; or a digital signal processor (DSP), an application specific integrated circuit (DSP). ApplicationSpecificIntegratedCircuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
综上所述,。本发明无需等待一个卷积核矩阵的卷积运算结束之后再进行下一个卷积核矩阵的卷积运算,且本发明通过卷积运算电路等硬件设备实现并行卷积运算,特别是面对大量的数据计算,相比于软件计算更是大幅度地提升了卷积运算效率。因此,本发明通过人工智能并行处理的方法大幅提升处理并行度且提升计算效率。所以,本发明有效克服了现有技术中的种种缺点而具高度产业利用价值。In summary,. The present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the convolution operation of the next convolution kernel matrix, and the present invention implements a parallel convolution operation by a hardware device such as a convolution operation circuit, especially in the face of a large amount of The data calculation greatly improves the efficiency of convolution calculations compared to software calculations. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.
上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下,对上述实施例进行修饰或改变。因此,举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等 效修饰或改变,仍应由本发明的权利要求所涵盖。The above-described embodiments are merely illustrative of the principles of the invention and its effects, and are not intended to limit the invention. Modifications or variations of the above-described embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, all equivalent modifications or changes made by those skilled in the art without departing from the spirit and scope of the inventions are still to be covered by the appended claims.

Claims (11)

  1. 一种人工智能并行处理方法,其特征在于,应用于处理模块,所述方法包括:An artificial intelligence parallel processing method is characterized in that it is applied to a processing module, and the method includes:
    令数据传输模块按照预设数据尺寸从外部存储模块中取出多个通道数据;Having the data transmission module take out a plurality of channel data from the external storage module according to a preset data size;
    令所述数据传输模块将取出的通道数据传输至卷积运算模块;And causing the data transmission module to transmit the extracted channel data to the convolution operation module;
    其中,所述卷积运算模块包括多个卷积核矩阵,用于与所述通道数据进行并行卷积运算。The convolution operation module includes a plurality of convolution kernel matrices for performing parallel convolution operations on the channel data.
  2. 根据权利要求1所述的人工智能并行处理方法,其特征在于,所述令数据传输模块按照预设数据尺寸从外部存储模块中取出多个通道数据,具体包括:The artificial intelligence parallel processing method according to claim 1, wherein the data transmission module is configured to take out a plurality of channel data from the external storage module according to the preset data size, and specifically includes:
    将每个所述通道数据按照1*1数据尺寸从所述外部存储模块中取出至第一存储模块;Extracting each of the channel data from the external storage module to the first storage module according to a 1*1 data size;
    将每个所述通道数据按照pv*1数据尺寸从所述第一存储模块中取出至第二存储模块;其中,pv为数据传输并行度,所述通道数据的列数为pv的整数倍;Extracting each of the channel data from the first storage module to the second storage module according to a pv*1 data size; wherein, pv is a data transmission parallelism, and the number of columns of the channel data is an integer multiple of pv;
    将每个所述通道数据按照pv*k数据尺寸从所述第二存储模块中取出至矩阵模块;其中,k为所述卷积核矩阵的尺寸;Extracting each of the channel data from the second storage module to a matrix module according to a pv*k data size; wherein k is a size of the convolution kernel matrix;
    将每个所述通道数据按照pv*k*k数据尺寸从所述矩阵模块中取出,以与所述多个卷积核矩阵进行并行卷积运算。Each of the channel data is fetched from the matrix module in accordance with a pv*k*k data size to perform a parallel convolution operation with the plurality of convolution kernel matrices.
  3. 根据权利要求2所述的人工智能并行处理方法,其特征在于,将每个所述通道数据按照pv*k数据尺寸从所述第二存储模块中取出至矩阵模块,具体包括:The artificial intelligence parallel processing method according to claim 2, wherein each of the channel data is extracted from the second storage module to the matrix module according to a pv*k data size, and specifically includes:
    令所述通道数据以每k行为一组数据;Let the channel data be a set of data per k;
    通过数据传输模块依次对每一组数据进行如下操作:在每个时钟周期内,从该组数据中依次取出数据尺寸为pv*k的第一待处理数据,直至该组数据全部被取出。The data transmission module sequentially performs the following operations on each group of data: in each clock cycle, the first to-be-processed data having the data size of pv*k is sequentially taken out from the group of data until all the data of the group is taken out.
  4. 根据权利要求3所述的人工智能并行处理方法,其特征在于,将每个所述通道数据按照pv*k*k数据尺寸从所述矩阵模块中取出,具体包括:The method of parallel processing of artificial intelligence according to claim 3, wherein each of the channel data is taken out from the matrix module according to a data size of pv*k*k, which specifically includes:
    针对所述每一组数据,从取出的第二个所述第一待处理数据开始,每个所述第一待处理数据均与前一个第一待处理数据的最后2列组合以形成(pv+2)*k数据尺寸的第二待处理数据;For each of the sets of data, starting from the extracted second of the first to-be-processed data, each of the first to-be-processed data is combined with the last two columns of the previous first to-be-processed data to form (pv +2) second pending data of *k data size;
    针对每个所述第二待处理数据,以步长为1进行矩阵提取,得到pv个k*k第三待处理数据;其中,各所述第三待处理数据用于与所述多个卷积核矩阵进行并行卷积运算。For each of the second to-be-processed data, performing matrix extraction with a step size of 1 to obtain pv k*k third to-be-processed data; wherein each of the third to-be-processed data is used for the plurality of volumes The kernel matrix performs parallel convolution operations.
  5. 根据权利要求4所述的人工智能并行处理方法,其特征在于,所述多个卷积核矩阵包括多个权重不同的权重矩阵,分别与所述第三待处理数据同时进行卷积运算。The artificial intelligence parallel processing method according to claim 4, wherein the plurality of convolution kernel matrices comprise a plurality of weight matrices with different weights, and respectively perform convolution operations simultaneously with the third to-be-processed data.
  6. 一种人工智能并行处理装置,其特征在于,包括:An artificial intelligence parallel processing device, comprising:
    外部存储模块,存储有多个通道数据;An external storage module that stores multiple channel data;
    处理模块,通信连接所述外部存储模块;Processing a module, communicably connecting the external storage module;
    数据传输模块,用于按照预设数据尺寸从外部存储模块中取出所述多个通道数据并传输;a data transmission module, configured to take out the multiple channel data from the external storage module according to a preset data size and transmit the data;
    卷积运算模块,包括多个卷积核矩阵,用于与按照预设数据尺寸取出的所述通道数据进行并行卷积运算。The convolution operation module includes a plurality of convolution kernel matrices for performing parallel convolution operations on the channel data fetched according to a preset data size.
  7. 根据权利要求6所述的人工智能处理装置,其特征在于,包括:The artificial intelligence processing device according to claim 6, comprising:
    第一存储模块,用于存储来自所述外部存储模块的所述通道数据。And a first storage module, configured to store the channel data from the external storage module.
  8. 根据权利要求7所述的人工智能处理装置,其特征在于,包括:The artificial intelligence processing device according to claim 7, comprising:
    第二存储模块,用于存储来自所述第一存储模块的所述通道数据。a second storage module, configured to store the channel data from the first storage module.
  9. 根据权利要求8所述的人工智能处理装置,其特征在于,包括:The artificial intelligence processing device according to claim 8, comprising:
    矩阵模块,用于存储来自所述第二存储模块的所述通道数据。a matrix module for storing the channel data from the second storage module.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1至5中任一项所述的人工智能并行处理方法。A computer readable storage medium having stored thereon a computer program, wherein the program is executed by a processor to implement the artificial intelligence parallel processing method according to any one of claims 1 to 5.
  11. 一种人工智能处理终端,其特征在于,包括:处理器及存储器;An artificial intelligence processing terminal, comprising: a processor and a memory;
    所述存储器用于存储计算机程序,所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行如权利要求1至5中任一项所述的人工智能并行处理方法。The memory is for storing a computer program for executing the computer program stored by the memory to cause the terminal to execute the artificial intelligence parallel processing method according to any one of claims 1 to 5.
PCT/CN2018/072663 2018-01-15 2018-01-15 Artificial intelligence parallel processing method and apparatus, computer readable storage medium, and terminal WO2019136751A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2018/072663 WO2019136751A1 (en) 2018-01-15 2018-01-15 Artificial intelligence parallel processing method and apparatus, computer readable storage medium, and terminal
CN201880002151.7A CN109416755B (en) 2018-01-15 2018-01-15 Artificial intelligence parallel processing method and device, readable storage medium and terminal
US16/929,819 US11874898B2 (en) 2018-01-15 2020-07-15 Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/072663 WO2019136751A1 (en) 2018-01-15 2018-01-15 Artificial intelligence parallel processing method and apparatus, computer readable storage medium, and terminal

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072665 Continuation-In-Part WO2019136752A1 (en) 2018-01-15 2018-01-15 Artificial intelligence convolution processing method and device, readable storage medium and terminal

Related Child Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2018/072665 Continuation-In-Part WO2019136752A1 (en) 2018-01-15 2018-01-15 Artificial intelligence convolution processing method and device, readable storage medium and terminal
US16/929,819 Continuation-In-Part US11874898B2 (en) 2018-01-15 2020-07-15 Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal

Publications (1)

Publication Number Publication Date
WO2019136751A1 true WO2019136751A1 (en) 2019-07-18

Family

ID=65462117

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072663 WO2019136751A1 (en) 2018-01-15 2018-01-15 Artificial intelligence parallel processing method and apparatus, computer readable storage medium, and terminal

Country Status (2)

Country Link
CN (1) CN109416755B (en)
WO (1) WO2019136751A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112132275A (en) * 2020-09-30 2020-12-25 南京风兴科技有限公司 Parallel computing method and device
CN112306949A (en) * 2019-07-31 2021-02-02 中科寒武纪科技股份有限公司 Data processing method and device and related product

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298441B (en) * 2019-05-24 2022-01-11 深圳云天励飞技术有限公司 Data processing method, electronic device and computer readable storage medium
CN110928216B (en) * 2019-11-14 2020-12-15 深圳云天励飞技术有限公司 Artificial intelligence device
CN113705795A (en) * 2021-09-16 2021-11-26 深圳思谋信息科技有限公司 Convolution processing method and device, convolution neural network accelerator and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090132794A1 (en) * 2007-11-16 2009-05-21 Paul Michael Ebert Method and apparatus for performing complex calculations in a multiprocessor array
CN106530210A (en) * 2016-10-31 2017-03-22 北京大学 Equipment and method for realizing parallel convolution calculation based on resistive random access memory array
CN106875012A (en) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA
CN106909970A (en) * 2017-01-12 2017-06-30 南京大学 A kind of two-value weight convolutional neural networks hardware accelerator computing module based on approximate calculation

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328644A1 (en) * 2015-05-08 2016-11-10 Qualcomm Incorporated Adaptive selection of artificial neural networks
CN106228238B (en) * 2016-07-27 2019-03-22 中国科学技术大学苏州研究院 Accelerate the method and system of deep learning algorithm on field programmable gate array platform
CN106845635A (en) * 2017-01-24 2017-06-13 东南大学 CNN convolution kernel hardware design methods based on cascade form
CN106951395B (en) * 2017-02-13 2018-08-17 上海客鹭信息技术有限公司 Parallel convolution operations method and device towards compression convolutional neural networks
CN106970896B (en) * 2017-03-30 2020-05-12 中国人民解放军国防科学技术大学 Vector processor-oriented vectorization implementation method for two-dimensional matrix convolution

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090132794A1 (en) * 2007-11-16 2009-05-21 Paul Michael Ebert Method and apparatus for performing complex calculations in a multiprocessor array
CN106530210A (en) * 2016-10-31 2017-03-22 北京大学 Equipment and method for realizing parallel convolution calculation based on resistive random access memory array
CN106909970A (en) * 2017-01-12 2017-06-30 南京大学 A kind of two-value weight convolutional neural networks hardware accelerator computing module based on approximate calculation
CN106875012A (en) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306949A (en) * 2019-07-31 2021-02-02 中科寒武纪科技股份有限公司 Data processing method and device and related product
CN112306949B (en) * 2019-07-31 2022-11-01 中科寒武纪科技股份有限公司 Data processing method and device and related product
CN112132275A (en) * 2020-09-30 2020-12-25 南京风兴科技有限公司 Parallel computing method and device

Also Published As

Publication number Publication date
CN109416755A (en) 2019-03-01
CN109416755B (en) 2021-11-23

Similar Documents

Publication Publication Date Title
WO2019136751A1 (en) Artificial intelligence parallel processing method and apparatus, computer readable storage medium, and terminal
JP6977239B2 (en) Matrix multiplier
JP7329533B2 (en) Method and accelerator apparatus for accelerating operations
CN112214726B (en) Operation accelerator
US9886377B2 (en) Pipelined convolutional operations for processing clusters
WO2017185389A1 (en) Device and method for use in executing matrix multiplication operations
WO2019136762A1 (en) Artificial intelligence processor and processing method applied thereto
US11544191B2 (en) Efficient hardware architecture for accelerating grouped convolutions
TWI827432B (en) Computing apparatus, machine learning computing apparatus, combined processing apparatus, neural network chip, electronic device, board, and computing method
JP2021521516A (en) Accelerators and systems for accelerating operations
CN108388537B (en) Convolutional neural network acceleration device and method
WO2018107383A1 (en) Neural network convolution computation method and device, and computer-readable storage medium
WO2019136764A1 (en) Convolutor and artificial intelligent processing device applied thereto
WO2019136752A1 (en) Artificial intelligence convolution processing method and device, readable storage medium and terminal
WO2019136750A1 (en) Artificial intelligence-based computer-aided processing device and method, storage medium, and terminal
US11550586B2 (en) Method and tensor traversal engine for strided memory access during execution of neural networks
Jeon et al. HMC-MAC: Processing-in memory architecture for multiply-accumulate operations with hybrid memory cube
CN115066692A (en) Apparatus and method for representing sparse matrices in neural networks
CN110837483B (en) Tensor dimension transformation method and device
CN109726822B (en) Operation method, device and related product
WO2019127507A1 (en) Data processing method and device, dma controller, and computer readable storage medium
WO2024027039A1 (en) Data processing method and apparatus, and device and readable storage medium
KR20210014561A (en) Method and apparatus for extracting image data in parallel from multiple convolution windows, device, and computer-readable storage medium
US11874898B2 (en) Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal
WO2020103883A1 (en) Method for executing matrix multiplication, circuit and soc

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18899322

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.11.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18899322

Country of ref document: EP

Kind code of ref document: A1