WO2019136751A1

WO2019136751A1 - Artificial intelligence parallel processing method and apparatus, computer readable storage medium, and terminal

Info

Publication number: WO2019136751A1
Application number: PCT/CN2018/072663
Authority: WO
Inventors: 肖梦秋
Original assignee: 深圳鲲云信息科技有限公司
Priority date: 2018-01-15
Filing date: 2018-01-15
Publication date: 2019-07-18
Also published as: CN109416755A; CN109416755B

Abstract

An artificial intelligence parallel processing method, for use in a processing module (54), the method comprising: making a data transmission module extract a plurality of channel data from an external storage module according to a preset data size (S101); making the data transmission module transmit the extracted channel data to a convolution module for implementing parallel convolution operations with a plurality of convolution kernel matrices (S102). The present method does not need to wait for the convolution operation of one convolution kernel matrix to finish to then implement the convolution operation of the next convolution kernel matrix and implements parallel convolution operations by means of a hardware device such as a convolution operation circuit and, particularly when faced with a large amount of data calculation, greatly improves the efficiency of convolution operations compared to software calculation. Thus, processing parallelism is greatly improved and calculation efficiency is improved by means of the present artificial intelligence parallel processing method.

Description

Artificial intelligence parallel processing method, device, readable storage medium, and terminal

Technical field

The present invention relates to the field of artificial intelligence, and in particular to an artificial intelligence parallel processing method, apparatus, readable storage medium, and terminal.

Background technique

Artificial Intelligence, abbreviated as AI in English. It is a new technical science that studies and develops theories, methods, techniques, and applications for simulating, extending, and extending human intelligence.

The artificial intelligence algorithm is a neural network model algorithm that simulates the human brain. Its computational complexity is very large. AlphaGo, which also uses artificial intelligence algorithms, requires thousands of traditional processors (CPUs) and hundreds of graphics processors (GPUs). It is clear that today, as artificial intelligence ushers in a new wave of revival, traditional processors are becoming a bottleneck that hinders the spread of artificial intelligence.

However, the degree of parallelism of artificial intelligence algorithms is currently insufficient, resulting in inefficiency of artificial intelligence algorithms. Therefore, how to realize high parallelism artificial intelligence processing has become a key technology in the field of artificial intelligence technology.

Summary of the invention

In view of the above-mentioned shortcomings of the prior art, the object of the present invention is to provide an artificial intelligence parallel processing method and an artificial intelligence processing device for solving the technical problems such as insufficient parallelism of the artificial intelligence algorithm processing in the prior art.

To achieve the above and other related objects, the present invention provides an artificial intelligence parallel processing method, which is applied to a processing module, and the method includes: causing a data transmission module to take out multiple channel data from an external storage module according to a preset data size; And causing the data transmission module to transmit the channel data extracted according to the preset data size to the convolution operation module; wherein the convolution operation module includes a plurality of convolution kernel matrices for parallelizing the channel data Convolution operation.

In an embodiment of the present invention, the data transmission module is configured to take out the plurality of channel data from the external storage module according to the preset data size, and specifically includes: using each of the channel data according to a 1*1 data size. The external storage module is taken out to the first storage module; each of the channel data is taken out from the first storage module to the second storage module according to a pv*1 data size; wherein, pv is a data transmission parallelism, The number of columns of channel data is an integer multiple of pv; each channel data is extracted from the second storage module to the matrix module according to a pv*k data size; wherein k is a size of the convolution kernel matrix; Each of the channel data is fetched from the matrix module in accordance with a pv*k*k data size to perform a parallel convolution operation with the plurality of convolution kernel matrices.

In an embodiment of the present invention, extracting each of the channel data from the second storage module to the matrix module according to a pv*k data size, specifically, including: causing the channel data to perform a set of data per k The data transmission module sequentially performs the following operations on each group of data: in each clock cycle, the first to-be-processed data of the data size pv*k is sequentially taken out from the group of data until all the data of the group is taken out.

In an embodiment of the present invention, each of the channel data is taken out from the matrix module according to a data size of pv*k*k, and specifically includes: a second location taken out from each group of data Starting with the first to-be-processed data, each of the first to-be-processed data is combined with the last two columns of the previous first to-be-processed data to form a second to-be-processed data of (pv+2)*k data size; Each of the second to-be-processed data is matrix-extracted with a step size of 1, to obtain pv k*k third to-be-processed data; wherein each of the third to-be-processed data is used for convolution with the plurality of The kernel matrix performs parallel convolution operations.

In an embodiment of the invention, the plurality of convolution kernel matrices includes a plurality of weight matrices with different weights, and respectively perform convolution operations simultaneously with the third to-be-processed data.

To achieve the above and other related objects, the present invention provides an artificial intelligence parallel processing apparatus, including: an external storage module that stores a plurality of channel data; a processing module that communicatively connects the external storage module; and a data transmission module Extracting and transmitting the plurality of channel data from the external storage module according to a preset data size; the convolution operation module includes a plurality of convolution kernel matrices for paralleling the channel data taken according to the preset data size Convolution operation.

In an embodiment of the invention, the artificial intelligence parallel processing device includes a first storage module for storing the channel data from the external storage module.

In an embodiment of the invention, the artificial intelligence parallel processing device includes a second storage module for storing the channel data from the first storage module.

In an embodiment of the invention, the artificial intelligence parallel processing device includes a matrix module for storing the channel data from the second storage module.

To achieve the above and other related objects, the present invention provides a computer readable storage medium having stored thereon a computer program that implements the artificial intelligence parallel processing method when executed by a processor.

To achieve the above and other related objects, the present invention provides an artificial intelligence processing terminal, including: a processor and a memory; the memory is for storing a computer program, and the processor is configured to execute the computer program of the memory storage, And causing the terminal to execute the artificial intelligence parallel processing method.

As described above, the artificial intelligence parallel processing method, apparatus, readable storage medium, and terminal of the present invention have the following advantageous effects: the present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the next convolution kernel The convolution operation of the matrix, and the present invention realizes the parallel convolution operation by hardware devices such as a convolution operation circuit, especially in the face of a large amount of data calculation, and the convolution operation efficiency is greatly improved compared with the software calculation. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method.

DRAWINGS

FIG. 1 is a flow chart showing a method for parallel processing of artificial intelligence according to an embodiment of the present invention.

2 is a schematic diagram showing a data matrix to be processed in an embodiment of the present invention.

FIG. 3 is a schematic diagram showing data to be processed by a data transmission module according to an embodiment of the present invention.

FIG. 4 is a schematic diagram showing data to be processed by a data transmission module according to an embodiment of the present invention.

FIG. 5 is a schematic diagram showing an artificial intelligence parallel processing apparatus according to an embodiment of the present invention.

Component label description

R1～R6 rectangular dotted frame

D1～D3 8*1 data

M1 8*3 matrix

M2 8*3 matrix

M3 8*3 matrix

M12 10*3 matrix

M23 10*3 matrix

L1 straight line

L2 straight line

T1 clock cycle

T2 clock cycle

T3 clock cycle

50 Programmable Logic side

51 first storage module

52 second storage module

53 data transmission module

54 processing module

55 matrix module

56 convolution operation module

57 external storage module

S101～S102 steps

Detailed ways

The embodiments of the present invention are described below by way of specific specific examples, and those skilled in the art can readily understand other advantages and functions of the present invention from the disclosure of the present disclosure. The present invention may be embodied or applied in various other specific embodiments, and various modifications and changes can be made without departing from the spirit and scope of the invention. It should be noted that the features in the following embodiments and embodiments may be combined with each other without conflict.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention in a schematic manner, and only the components related to the present invention are shown in the drawings, rather than the number and shape of components in actual implementation. Dimensional drawing, the actual type of implementation of each component's type, number and proportion can be a random change, and its component layout can be more complicated.

As shown in FIG. 1, a flow chart of a method for parallel processing of artificial intelligence in an embodiment of the present invention is shown. The artificial intelligence parallel processing method is applied to a processing module, and the processing module may be, for example, an ARM module, an MCU module, or a Soc module or the like. The artificial intelligence parallel processing method specifically includes:

S101: The data transmission module is configured to take out multiple channel data from the external storage module according to a preset data size.

The data transmission module can transmit data by means of DMA. The DMA is called Direct Memory Access, which is a direct memory access, and is used for data transmission between the external memory and the Programmable Logic terminal. DMA transfer is a high-speed data transfer operation that allows direct read and write operations between external devices and memory without the need for CPU intervention.

The external storage module may be, for example, a DDR memory, and is disposed outside the Programmable Logic terminal for storing a plurality of channel data. The channel data is data to be processed, and is usually stored in a memory in the form of a data matrix.

S102: The data transmission module is configured to transmit the extracted channel data to a convolution operation module for parallel convolution operation with multiple convolution kernel matrices.

The convolution operation module is a convolution operation circuit, and may be a circuit composed of a multiplier and an adder. The convolution operation module includes a plurality of convolution kernel matrices, and each of the convolution kernel matrices has different weights. For example, the image has three channel data of R, G, and B, that is, three two-dimensional matrices, each of which has a length and width set to K*K, assuming that K is an odd number 3; further, assuming the data transmission The module fetches the channel data according to the data size of the 8*3*3 matrix, that is, the data transmission module takes out 8 3*3 matrices at a time.

If the three two-dimensional matrices of R, G, and B are not subjected to parallel convolution operations, it is necessary to undergo three consecutive calculations to complete the calculation, which is time-consuming and computationally inefficient. In the present invention, preferably, the three two-dimensional matrices of R, G, and B are convoluted in parallel with the eight 3*3 matrices so that each set of 8 3*3 matrices obtains 8*3. Convolution result value. The present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the convolution operation of the next convolution kernel matrix, and the present invention implements a parallel convolution operation by a hardware device such as a convolution operation circuit, especially in the face of a large amount of The data calculation greatly improves the efficiency of convolution calculations compared to software calculations. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method.

The principle of the data transmission module extracting channel data from the external storage module according to the preset data size will be described below with reference to a specific embodiment.

The data transmission module is taken out from the external storage module to the first storage module according to a 1*1 data size. The first storage module may be a RAM or a ROM memory, such as three generations, four generations of DDR SDRAM, or the like.

As shown in FIG. 2, a schematic diagram of channel data in an embodiment of the present invention is shown. The data transmission module is taken out from the first storage module to the second storage module according to a pv*1 data size. Wherein, pv is a data transmission parallelism, and is used to indicate the number of columns of the data transmission module to be processed each time, and the size thereof is associated with the efficiency of the artificial intelligence parallel processing method; the number of columns of the channel data is an integer of pv Times. In this embodiment, the data transmission parallelism is pv=8, and the channel data is a 34*40 matrix, so the data transmission module divides the 34*40 matrix according to the 8*1 data size from the first A storage module is taken out to the second storage module. The schematic diagram of the transmission module extracting channel data according to the 8*1 data size is described below in conjunction with a specific illustration.

As shown in FIG. 3, a schematic diagram of the data transmission module taking out channel data in an embodiment of the present invention is shown. The data transmission module starts from the leftmost side of the first row of data to be processed, and takes out 8*1 data each time until all the data to be processed in the first row is taken out. Based on the same principle, the data transmission module continues to take the second row, the third row... until the entire 34*40 matrix is taken out.

Specifically, in the first behavior example, the data transmission module takes out the first 8*1 matrix D1 and places it into the location of the address Addr=0 in the second storage module, and takes out the second 8*1 matrix D2. Positioning the address Addr=1, taking out the third 8*1 matrix D3 and placing it in the address Addr=2, and so on, all the 34*40 matrices are all taken out from the first storage module. And placed in the second storage module.

After the data transmission module stores the 34*40 matrix in the first storage module, and according to the pv*k data size, k is the size of the convolution kernel matrix, and the convolution kernel matrix is A weight matrix for a convolution operation; the convolution kernel matrix may be set to an odd-order matrix, and in the present embodiment, the convolution kernel matrix is set to a 3*3 matrix. That is, the data transmission module takes out the 34*40 matrix from the second storage module in batches and puts it into the matrix module in an 8*3 matrix for data combination.

As shown in FIG. 2, the data transmission module sequentially extracts 8*3 matrices from the first three rows of the 34*40 matrix in order from left to right in each clock cycle. That is, a total of five 8*3 matrices can be taken out in the first three rows. Based on the same principle as described above, the data transmission module continues to fetch the pending data of the subsequent row after the first three rows are taken. For the convenience of those skilled in the art, the rectangular dotted frame R1 R R5 in FIG. 2 represents a total of five 8*3 matrices in the first three rows.

As shown in FIG. 4, a schematic diagram of data removal by the data transmission module in an embodiment of the present invention is shown. In the first clock cycle T1, the first 8*3 matrix M1 taken out by the data transmission module from the second storage module is generally used to improve the pipeline of the artificial intelligence calculation, and the first one is taken out of each row. The 8*3 matrix can only obtain convolution result values less than 8 by convolution operation. Therefore, the first 8*3 matrix extracted per line is set as invalid data to improve the pipeline operation degree of artificial intelligence processing. The convolution result of the 8*3 matrix M1 is an invalid value.

In the second clock cycle T2, the data transmission module takes out a second 8*3 matrix M2, and the 8*3 matrix M2 and the last two columns of the 8*3 matrix M1 are combined into a 10*3 matrix M12. In the figure, a line L1 is used to represent matrix data combined with each other. The data matrix M2 is combined with the last two columns of the data matrix M1 to obtain a data matrix M12 of (pv+2), that is, 10 columns.

The 10*3 matrix M12 can perform matrix extraction according to the step size 1, thereby obtaining eight 3*3 matrices. Specifically, as shown in FIG. 4, the rectangular dotted frame R6 takes the matrix covered in FIG. 4 as a starting position, moves to the right column by column according to the step size 1, and obtains a matrix of size 3*3 for each column moved. . It can be seen that the rectangular dashed box R6 can be moved a total of 7 times in the 10*3 matrix M12, for a total of 8 3*3 matrices, that is, pv k*k matrices. The eight 3*3 matrices are used for transmission to the convolution operation module to perform parallel convolution operations with the three 3*3 convolution kernel matrices respectively, thereby obtaining 3*8 calculation result values.

Similarly, in the third clock cycle T3, the data transmission module takes out a third 8*3 matrix M3, and the 8*3 matrix M3 and the last two columns of the 8*3 matrix M2 are combined into 10* 3 matrix M23, in which the line L2 represents matrix data combined with each other. The data matrix M3 is combined with the last two columns of the data matrix M2 to obtain a data matrix M23 having a column number of 10. The 10*3 matrix M23 can perform matrix extraction according to the step size 1 to obtain 8 3*3 matrices; the 8 3*3 fifth to-be-processed data matrices are used for transmission to the convolution operation module, A convolution operation is performed with three of the 3*3 convolution kernel matrices and 3*8 calculation result values are obtained. By analogy, the data transmission module is based on the same principle, and can process data processing of the entire 34*40 matrix after a plurality of clock cycles.

As shown in FIG. 5, an artificial intelligence parallel processing apparatus according to an embodiment of the present invention includes: a first storage module 51, a second storage module 52, a data transmission module 53, a processing module 54, and a matrix module 55. The first storage module 51, the second storage module 52, the data transmission module 53, the matrix module 55, and the convolution operation module 56 are collectively disposed on the Programmable Logic terminal 50 of the FPGA, which is generally referred to as a PL terminal.

The data transmission module is specifically configured to transmit the channel data from the external storage module 57 to the first storage module 51 according to the 1*1 data size through the system bus, and then take out the first storage module 51 and follow the pv*1 data. The size is transferred to the second storage module 52, and is taken out from the second storage module 52 and transmitted to the matrix module according to the pv*k data size, and then taken out from the matrix module and transmitted to the pv*k ² data size to Convolution operation module 56.

The convolution operation module 56 is provided with a plurality of convolution kernel matrices for parallel convolution operations. The plurality of convolution kernel matrices are specifically: a convolution kernel matrix 1, a convolution kernel matrix 2, ..., a convolution kernel matrix n.

The first storage module 51 may be, for example, a BRAM memory, that is, a block RAM, which is a RAM storage resource of an FPGA (Field-Programmable Gate Array) field programmable gate array. The processing module 54 can be, for example, an ARM module, an MCU module, or a Soc module, and the like.

The implementation manner of the artificial intelligence processing device is similar to the implementation manner of the artificial intelligence parallel processing method, and therefore will not be described again. Those skilled in the art should be able to understand the artificial intelligence processing based on the artificial intelligence parallel processing method. The principle and implementation of the device.

One of ordinary skill in the art will appreciate that all or part of the steps to implement the various method embodiments described above can be accomplished by hardware associated with a computer program. The aforementioned computer program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

The present invention also provides an artificial intelligence processing terminal, comprising: a processor and a memory; the memory is for storing a computer program, the processor is configured to execute the computer program stored by the memory, so that the terminal performs the manual Intelligent parallel processing method.

The above memory may include random access memory (RAM), and may also include non-volatile memory, such as at least one disk storage.

The above processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP for short), and the like; or a digital signal processor (DSP), an application specific integrated circuit (DSP). ApplicationSpecificIntegratedCircuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In summary,. The present invention does not need to wait for the convolution operation of a convolution kernel matrix to complete the convolution operation of the next convolution kernel matrix, and the present invention implements a parallel convolution operation by a hardware device such as a convolution operation circuit, especially in the face of a large amount of The data calculation greatly improves the efficiency of convolution calculations compared to software calculations. Therefore, the present invention greatly improves the processing parallelism and improves the calculation efficiency by the artificial intelligence parallel processing method. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

The above-described embodiments are merely illustrative of the principles of the invention and its effects, and are not intended to limit the invention. Modifications or variations of the above-described embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, all equivalent modifications or changes made by those skilled in the art without departing from the spirit and scope of the inventions are still to be covered by the appended claims.

Claims

An artificial intelligence parallel processing method is characterized in that it is applied to a processing module, and the method includes:

Having the data transmission module take out a plurality of channel data from the external storage module according to a preset data size;

And causing the data transmission module to transmit the extracted channel data to the convolution operation module;

The convolution operation module includes a plurality of convolution kernel matrices for performing parallel convolution operations on the channel data.
The artificial intelligence parallel processing method according to claim 1, wherein the data transmission module is configured to take out a plurality of channel data from the external storage module according to the preset data size, and specifically includes:

Extracting each of the channel data from the external storage module to the first storage module according to a 1*1 data size;

Extracting each of the channel data from the first storage module to the second storage module according to a pv*1 data size; wherein, pv is a data transmission parallelism, and the number of columns of the channel data is an integer multiple of pv;

Extracting each of the channel data from the second storage module to a matrix module according to a pv*k data size; wherein k is a size of the convolution kernel matrix;

Each of the channel data is fetched from the matrix module in accordance with a pv*k*k data size to perform a parallel convolution operation with the plurality of convolution kernel matrices.
The artificial intelligence parallel processing method according to claim 2, wherein each of the channel data is extracted from the second storage module to the matrix module according to a pv*k data size, and specifically includes:

Let the channel data be a set of data per k;

The data transmission module sequentially performs the following operations on each group of data: in each clock cycle, the first to-be-processed data having the data size of pv*k is sequentially taken out from the group of data until all the data of the group is taken out.
The method of parallel processing of artificial intelligence according to claim 3, wherein each of the channel data is taken out from the matrix module according to a data size of pv*k*k, which specifically includes:

For each of the sets of data, starting from the extracted second of the first to-be-processed data, each of the first to-be-processed data is combined with the last two columns of the previous first to-be-processed data to form (pv +2) second pending data of *k data size;

For each of the second to-be-processed data, performing matrix extraction with a step size of 1 to obtain pv k*k third to-be-processed data; wherein each of the third to-be-processed data is used for the plurality of volumes The kernel matrix performs parallel convolution operations.
The artificial intelligence parallel processing method according to claim 4, wherein the plurality of convolution kernel matrices comprise a plurality of weight matrices with different weights, and respectively perform convolution operations simultaneously with the third to-be-processed data.
An artificial intelligence parallel processing device, comprising:

An external storage module that stores multiple channel data;

Processing a module, communicably connecting the external storage module;

a data transmission module, configured to take out the multiple channel data from the external storage module according to a preset data size and transmit the data;

The convolution operation module includes a plurality of convolution kernel matrices for performing parallel convolution operations on the channel data fetched according to a preset data size.
The artificial intelligence processing device according to claim 6, comprising:

And a first storage module, configured to store the channel data from the external storage module.
The artificial intelligence processing device according to claim 7, comprising:

a second storage module, configured to store the channel data from the first storage module.
The artificial intelligence processing device according to claim 8, comprising:

a matrix module for storing the channel data from the second storage module.
A computer readable storage medium having stored thereon a computer program, wherein the program is executed by a processor to implement the artificial intelligence parallel processing method according to any one of claims 1 to 5.
An artificial intelligence processing terminal, comprising: a processor and a memory;

The memory is for storing a computer program for executing the computer program stored by the memory to cause the terminal to execute the artificial intelligence parallel processing method according to any one of claims 1 to 5.