WO2022088140A1 - 一种ai芯片及邻接表采样方法 - Google Patents

一种ai芯片及邻接表采样方法 Download PDF

Info

Publication number
WO2022088140A1
WO2022088140A1 PCT/CN2020/125656 CN2020125656W WO2022088140A1 WO 2022088140 A1 WO2022088140 A1 WO 2022088140A1 CN 2020125656 W CN2020125656 W CN 2020125656W WO 2022088140 A1 WO2022088140 A1 WO 2022088140A1
Authority
WO
WIPO (PCT)
Prior art keywords
adjacency list
adjacency
random numbers
npu
list
Prior art date
Application number
PCT/CN2020/125656
Other languages
English (en)
French (fr)
Inventor
李承扬
朱幸尔
杜霄鹏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202080106754.9A priority Critical patent/CN116529709A/zh
Priority to PCT/CN2020/125656 priority patent/WO2022088140A1/zh
Publication of WO2022088140A1 publication Critical patent/WO2022088140A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/58Random or pseudo-random number generators

Definitions

  • the present application relates to the field of graph neural networks, in particular to an artificial intelligence (artificial intelligence, AI) chip and an adjacency list sampling method.
  • AI artificial intelligence
  • Step 1 Transpose the row and column of the adjacency list
  • Step 2 Generate random numbers
  • Step 3 Then transpose the adjacency list according to the generated random numbers.
  • step 4 perform row-column transposition on the adjacency list obtained by the out-of-order rearrangement
  • step 5 perform matrix segmentation on the transposed adjacency list to obtain a sampled adjacency list.
  • the combination of steps 2 and 3 is a random shuffle operation. Based on different hardware, the deployment scheme is different.
  • the embodiments of the present application provide an AI chip and an adjacency list sampling method, and based on the characteristics of the AI chip structure, the adjacency list sampling process is redesigned to reduce computation time and memory overhead.
  • an embodiment of the present application provides an AI chip, the AI chip includes: a random number generator and a neural network processor NPU, wherein the CPU is connected to the NPU;
  • a random number generator for generating K random numbers
  • the NPU is used to transpose the rows and columns of the input first adjacency list to obtain the second adjacency list.
  • the scale of the first adjacency list is M*N, and both M and N are integers greater than 0; the scale of the second adjacency list is N*M; the second adjacency list is rearranged in random order according to K random numbers to obtain the third adjacency list, and the scale of the third adjacency list is K*M; the target adjacency list is obtained according to the third adjacency list, the target The size of the adjacency list is M*S, where S is an integer less than N.
  • the sampling process of the adjacency list has been redesigned: random numbers will be generated by the CPU or DSA, and the row and column transposition and out-of-order rearrangement of the adjacency list will be realized by the NPU, which is equivalent to the sampling of the adjacency list.
  • the fusion of sub-subs reduces computational time and memory overhead.
  • the random number generator is a CPU or a domain-specific accelerator (DSA).
  • DSA domain-specific accelerator
  • the value range of the K random numbers is [0, N-1], and the second adjacency list is rearranged out of order according to the K random numbers to obtain the third adjacency list.
  • the NPU is specifically used for:
  • the K first vectors are in one-to-one correspondence with the K random numbers
  • the elements in the jth first vector in the K first vectors include: In the element of the i-th row in the second adjacency table, the value of i is the value of the random number corresponding to the j-th first vector in the K random numbers;
  • a vector is arranged to obtain the third adjacency list.
  • generating K random numbers by the random number generator is performed before the NPU performs row-column transposition on the first adjacency list, so that when sampling the adjacency list multiple times, the generation of random numbers can be The process is hidden in the calculation process of the NPU, which further reduces the sampling time.
  • the NPU is specifically used for:
  • the random number generator can generate according to user instructions, and the number of random numbers is S; it can also be generated by default, and the number of random numbers is N.
  • an embodiment of the present application provides an adjacency list sampling method, and the method is applied to an AI chip.
  • the AI chip includes a random number generator and an NPU, including:
  • the random number generator generates K random numbers; the NPU obtains the first adjacency list, the size of the first adjacency list is M*N, and both M and N are integers greater than 0; the NPU transposes the first adjacency list to obtain The second adjacency list, the size of the second adjacency list is N*M; the NPU rearranges the second adjacency list out of order according to K random numbers to obtain the third adjacency list, and the size of the third adjacency list is K* M; the NPU obtains the target adjacency list according to the third adjacency list, the scale of the target adjacency list is M*S, and S is an integer less than N.
  • the value range of the K random numbers is [0, N-1], and the NPU shuffles the second adjacency list according to the K random numbers to obtain the third adjacency list, including :
  • the NPU obtains K first vectors from the second adjacency table according to the K random numbers, the K first vectors correspond to the K random numbers one-to-one, and the elements in the jth first vector in the K first vectors Including the element of the i-th row in the second adjacency table, the value of i is the value of the random number corresponding to the j-th first vector among the K random numbers;
  • the first vectors are arranged to obtain the third adjacency list.
  • generating K random numbers by the random number generator is performed before the NPU performs row-column transposition on the first adjacency list.
  • the NPU pair obtains the target adjacency list according to the third adjacency list, including:
  • an embodiment of the present application provides an adjacency list sampling device, the adjacency list sampling device has the function of implementing the second aspect above, and the function can be implemented by hardware or by executing corresponding software in hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • embodiments of the present application provide a computer program product, including computer instructions, which, when the computer instructions are executed on an electronic device, cause the electronic device to execute part or all of the method described in the second aspect.
  • an embodiment of the present application provides a computer storage medium for storing computer software instructions used in the AI chip described in the first aspect or the adjacency list sampling device described in the third aspect, which includes using to carry out the program designed in the above-mentioned aspects.
  • Fig. 1a is a kind of adjacency list sampling flow schematic diagram
  • Figure 1b is a schematic diagram of an adjacency list sampling algorithm running on a CPU
  • Figure 1c is a schematic diagram of an adjacency list sampling algorithm running on a CPU+GPU architecture
  • 2a is a schematic diagram of the architecture of an AI chip provided by an embodiment of the application.
  • 2b is a schematic flowchart of an AI chip-based adjacency list sampling process provided by an embodiment of the application
  • 2c is a schematic flowchart of another AI chip-based adjacency list sampling process provided by the embodiment of the present application.
  • 2d is a schematic diagram of the execution sequence of each operation of performing sampling of the adjacency list twice;
  • FIG. 2e is a schematic diagram of the architecture of another AI chip provided by an embodiment of the application.
  • 3a is a schematic flowchart of another AI chip-based adjacency list sampling process provided by an embodiment of the application.
  • 3b is a schematic flowchart of another AI chip-based adjacency list sampling process provided by an embodiment of the application.
  • FIG. 4 is a schematic flowchart of a method for sampling an adjacency list according to an embodiment of the present application.
  • An adjacency list is a graph storage structure, and each vertex of the graph has an adjacency list.
  • the adjacency list is a linear list, and the adjacency list of a vertex v in the graph contains all the vertices that are adjacent to the vertex v.
  • the size of the adjacency list specifically refers to the size of the matrix when the adjacency list is stored in the form of a matrix, so the adjacency list can also be regarded as a matrix.
  • FIG. 2a is a schematic diagram of an AI chip architecture provided by an embodiment of the present application.
  • the AI chip includes a random number generator 201 and a neural network processor 202, wherein the random number generator 201 is connected to the neural network processor 202;
  • a random number generator 201 configured to generate K random numbers, where K is an integer greater than 0, and the value range of the K random numbers is [0, N];
  • the neural network processor 202 is configured to perform row-column transposition on the input first adjacency list to obtain a second adjacency list, the scale of the first adjacency list is M*N, and the scale of the second adjacency list is N*M; according to K random numbers are used to rearrange the second adjacency list out of order to obtain the third adjacency list, the scale of which is K*M; the target adjacency list is obtained according to the third adjacency list, and the scale of the target adjacency list is M*S, where S is an integer less than N.
  • the random number generator 201 may be a CPU or a DSA.
  • the neural network processor 220 is specifically configured to:
  • the K first vectors are in one-to-one correspondence with the K random numbers
  • the elements in the jth first vector in the K first vectors include: In the element of the i-th row in the second adjacency table, the value of i is the value of the random number corresponding to the j-th first vector in the K random numbers;
  • a vector is arranged to obtain the third adjacency list.
  • the neural network processor 220 obtains K first vectors from the second adjacency table according to the K random numbers, which specifically includes: the neural network processor 220 maps each random number in the K random numbers to a random number address, where the random address is the storage address of the element in the first column in the second adjacency list in the cache; the neural network processor 220 obtains the random number from the cache according to the random address and the number of columns in the second adjacency list. A vector, and then K first vectors are obtained.
  • the values of the K random numbers may be partially the same, and may also be different from each other.
  • the random number generator 201 generates three random numbers successively, which are 2, 0 and 1 respectively.
  • the neural network processor 202 obtains three first vectors from the second adjacency list according to the three random numbers, and the three first vectors They are: (C1, C2, C3, C4), (A1, A2, A3, A4) and (B1, B2, B3, B4), according to the order of generating the three random numbers, arrange the above three first vectors , get the third adjacency list, the third adjacency list can be expressed as:
  • the neural network processor 220 is specifically configured to:
  • the random number generator 201 may generate K random numbers in a default manner, where K is equal to N, and the size of the fourth adjacency list obtained by the neural network processor 202 in the above manner is M*N. Divide, realize the sampling of the adjacency list, and obtain the above-mentioned target adjacency list, as shown in Figure 2b; when the random number generator 201 can also generate K random numbers based on the user's instruction, K is the number of columns in the adjacency list obtained by sampling, so The fourth adjacency list obtained in the above manner is the above-mentioned target adjacency list, as shown in FIG. 2c.
  • the random number generator 201 when the user does not instruct the random number generator 201 to generate the number of random numbers, the random number generator 201 will generate a default number of random numbers in a default manner. Assuming that the size of the first adjacency list is M*N, then The default number here is generally N; when the user determines that the size of the sampled adjacency list is M*S, the user will inform the random number generator 201 of S, and then the random number generator 201 generates S random numbers. It can be seen from the above description that the size of the third adjacency list obtained by out-of-order rearrangement is determined by the number of random numbers, so whether to perform matrix segmentation can be determined according to the number of random numbers generated by the random number generator 201.
  • the random number generator 201 generates K random numbers before the neural network processor 202 performs the row-column transposition of the first adjacency list.
  • the sampling of the adjacency list is repeated.
  • Figure 2d shows the execution sequence of each operation of sampling the adjacency list twice. As shown in Figure 2d, after the NPU executes the first After the second row-column transposition operation, the NPU performs the third row-column transposition operation to enter the next sampling of the adjacency list.
  • the generation of random numbers is time-consuming, and for the NPU, It is time-consuming to perform the row-column transposition operation; in order to reduce the time-consuming, the random number generator performs the second random number generation operation before the NPU performs the third row-column transposition operation, that is, when the NPU is executing the first At the same time as the second out-of-order rearrangement operation and the second row-column transposition operation, or while the NPU performs the second row-column transposition operation, the random number generator performs the second random number generation operation, which is equivalent to converting the entire random number.
  • the number generation process is hidden in the NPU calculation process, which reduces the sampling time of the adjacency list.
  • the above-mentioned neural network processor 202 can implement the above-mentioned process through a graph neural network, wherein the number K of random numbers is a hyperparameter of the graph neural network, and the first adjacency table and the K random numbers are For the input of the graph neural network, the neural network processor 202 invokes the graph neural network to execute the actions performed by the neural network processor 202, thereby obtaining the target adjacency list.
  • the AI chip 200 includes a neural network processor 202, a central processing unit 203, and a domain-specific accelerator 204; wherein, the central processing unit 203 and the domain-specific accelerator 204 are both connected to
  • the neural network processor 202 is connected; the central processing unit 203 or the domain-specific accelerator 204 is used to generate the above K random numbers; the specific implementation process of the neural network processor 202 can refer to the relevant description of FIG. 2a, which is not described here.
  • the CPU or DSA in the AI chip generates 25 random numbers, and stores the 25 random numbers in high bandwidth memory (HBM)/DDR ; NPU obtains 25 random numbers from HBM, and maps 25 random numbers to 25 random addresses; NPU obtains the first adjacency list from HBM/DDR, and the size of the first adjacency list is 5120*100. Perform row and column transposition of the adjacency list to obtain the second adjacency list.
  • the above 25 random addresses are the storage addresses of the elements in the first column of the second adjacency list in the cache; the NPU reads 25 random addresses from the cache according to 25 random addresses and 5120.
  • the scale of the third adjacency list is 25* 5120, and then transpose the rows and columns of the third adjacency list to obtain the fourth adjacency list.
  • the scale of the fourth adjacency list is 5120*25.
  • the fourth adjacency list is the target adjacency list, that is, the adjacency list obtained after sampling. .
  • the sampling process of the adjacency list is redesigned: random numbers are generated by the CPU or DSA, and the row and column transposition and out-of-order rearrangement of the adjacency list are realized by the NPU.
  • the NPU which is equivalent to performing the sampling of the adjacency list by the NPU, which realizes the fusion of operators, reduces the calculation time and memory overhead, and hides the random number generation in the calculation process of the NPU, thereby further reducing the sampling. time.
  • FIG. 4 is a schematic flowchart of an adjacency list sampling method provided by an embodiment of the present application.
  • the method is applied to AI chips, which include random number generators and NPUs. As shown in Figure 4, the method includes:
  • a random number generator generates K random numbers.
  • the NPU acquires the first adjacency list, and performs row and column transposition on the first adjacency list to obtain the second adjacency list.
  • the value range of the K random numbers is [0, N-1], and the NPU shuffles the second adjacency list according to the K random numbers to obtain the third adjacency list, including :
  • the NPU obtains K first vectors from the second adjacency table according to the K random numbers, the K first vectors correspond to the K random numbers one-to-one, and the elements in the jth first vector in the K first vectors Including the element of the i-th row in the second adjacency table, the value of i is the value of the random number corresponding to the j-th first vector among the K random numbers;
  • the first vectors are arranged to obtain the third adjacency list.
  • generating K random numbers by the random number generator is performed before the NPU performs row-column transposition on the first adjacency list.
  • the NPU pair obtains the target adjacency list according to the third adjacency list, including:
  • the NPU performs out-of-order rearrangement on the second adjacency list according to the K random numbers to obtain a third adjacency list; and obtains a target adjacency list according to the third adjacency list.
  • Embodiments of the present application further provide a computer storage medium, wherein the computer storage medium can store a program, and when the program is executed, it can implement part or all of the steps of any adjacency list sampling method described in the above method embodiments.
  • the aforementioned storage media include: U disk, read-only memory (English: read-only memory), random access memory (English: random access memory, RAM), mobile hard disks, magnetic disks or optical disks, etc. medium.
  • the disclosed apparatus may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Complex Calculations (AREA)

Abstract

一种人工智能AI芯片(200)和邻接表采样方法。其中,AI芯片(200)包括随机数生成器(201)和NPU(202);随机数生成器(201),用于生成K个随机数;NPU(202),用于对输入到的第一邻接表进行行列转置,得到第二邻接表,第一邻接表的规模为M*N,第二邻接表的规模为N*M;根据K个随机数对第二邻接表进行乱序重排,以得到第三邻接表,第三邻接表的规模为K*M;根据第三邻接表得到目标邻接表,目标邻接表的规模为M*S,S为小于N的整数。基于AI芯片(200)结构的特点,重新设计了邻接表采样流程,降低了计算耗时和内存开销。

Description

一种AI芯片及邻接表采样方法 技术领域
本申请涉及图神经网络领域,具体涉及一种人工智能(artificial intelligence,AI)芯片及邻接表采样方法。
背景技术
推荐搜索、智慧城市、智慧交通等场景下,通常需要对非欧数据提取特征。例如社交网络中,个体与事件构成了一个图结构;又如电子商务,个体与商品构成了一个图结构。这些图结构中有上亿的结点,结点与结点间又存在大量的关系构成为边。图神经网络主要学习的是结点聚集邻居结点和相邻边特征的方法。因为结点的相邻边理论上没有上界,在训练或推理时可能会耗费大量时间并耗费足以溢出的内存开销。因此,业界通常会对邻接表进行采样,限制结点相邻边的数量,从而将训练或推理时对时间和内存的开销限制在一个范围内。
当前对邻接表进行采样的算法流程如图1a所示,步骤1:对邻接表进行行列转置;步骤2:生成随机数;步骤3:然后根据生成的随机数对行列转置后的邻接表进行乱序重排;步骤4:对乱序重排得到的邻接表进行行列转置;步骤5:对转置后的邻接表进行矩阵切分,从而得采样的后的邻接表。其中,步骤2和步骤3合在一起就是一个random shuffle操作。基于硬件的不同,部署方案有差异。当基于CPU实现上述采样方案时,如图1b所示,步骤1-步骤5均在CPU上执行,但是使用CPU执行行列转置和矩阵切分操作性能较差;如图1c所示,当基于GPU+CPU硬件架构实现时,步骤1、步骤4和步骤5在GPU执行,步骤2和步骤3在CPU上实现,但是由于涉及GPU->CPU->GPU的数据搬运,因此这种方案会带来额外的时间开销。
发明内容
本申请实施例提供了AI芯片及邻接表采样方法,基于AI芯片结构的特点,重新设计了邻接表采样流程,降低了计算耗时和内存开销。
第一方面,本申请实施例提供了一种AI芯片,AI芯片包括;随机数发生器和神经网络处理器NPU,其中,CPU与NPU连接;
随机数生成器,用于生成K个随机数;
NPU,用于对输入到的第一邻接表进行行列转置,得到第二邻接表,第一邻接表的规模为M*N,M和N均为大于0的整数;第二邻接表的规模为N*M;根据K个随机数对第二邻接表进行乱序重排,以得到第三邻接表,第三邻接表的规模为K*M;根据第三邻接表得到目标邻接表,目标邻接表的规模为M*S,S为小于N的整数。
重新设计了邻接表采样流程:将通过CPU或DSA生成随机数,通过NPU实现邻接表的行列转置和乱序重排,相当于将对邻接表进行采样都由NPU来执行,实现了对算子的融合,降低了计算耗时和内存开销。
在一个可行的实施例中,随机数生成器为CPU或者领域特异性加速器(domain  specifical accelator,DSA)。
在一个可行的实施例中,K个随机数的取值范围为[0,N-1],在根据K个随机数对第二邻接表进行乱序重排,以得到第三邻接表的方面,NPU具体用于:
根据K个随机数,从第二邻接表中获取K个第一向量,K个第一向量与K个随机数一一对应,K个第一向量中的第j个第一向量中的元素包括第二邻接表中第i行的元素,i的取值为K个随机数中与第j个第一向量对应的随机数的取值;按照K个随机数生成的先后顺序,对K个第一向量进行排列,以得到第三邻接表。
在一个可行的实施例中,随机数生成器生成K个随机数是在NPU对第一邻接表进行行列转置之前执行的,使得在进行多次邻接表的采样时,可以将随机数的生成过程隐藏在NPU的计算过程有,进一步降低了采样时间。
在一个可行的实施例中,在根据第三邻接表得到目标邻接表的方面,NPU具体用于:
对第三邻接表进行行列转置,以得到第四邻接表,第四邻接表的规模为M*K;当K等于N时,对第四邻接表进行矩阵切分,以得到目标邻接表;当K小于N时,第四邻接表为目标邻接表,且S等于K。
对于随机数的数量,随机数生成器可根据用户指令生成,随机数的数量为S;也可以按照默认方式生成,随机数的数量为N。
第二方面,本申请实施例提供一种邻接表采样方法,该方法应用于AI芯片,AI芯片包括随机数生成器和NPU,包括:
随机数生成器生成K个随机数;NPU获取第一邻接表,第一邻接表的规模为M*N,M和N均为大于0的整数;NPU对第一邻接表进行行列转置,得到第二邻接表,该第二邻接表的规模为N*M;NPU根据K个随机数对第二邻接表进行乱序重排,以得到第三邻接表,第三邻接表的规模为K*M;NPU根据第三邻接表得到目标邻接表,目标邻接表的规模为M*S,S为小于N的整数。
在一个可行的实施例中,K个随机数的取值范围为[0,N-1],NPU根据K个随机数对第二邻接表进行乱序重排,以得到第三邻接表,包括:
NPU根据K个随机数,从第二邻接表中获取K个第一向量,K个第一向量与K个随机数一一对应,K个第一向量中的第j个第一向量中的元素包括第二邻接表中第i行的元素,i的取值为K个随机数中与第j个第一向量对应的随机数的取值;NPU按照K个随机数生成的先后顺序,对K个第一向量进行排列,以得到第三邻接表。
在一个可行的实施例中,随机数生成器生成K个随机数是在NPU对第一邻接表进行行列转置之前执行的。
在一个可行的实施例中,NPU对根据第三邻接表得到目标邻接表,包括:
对第三邻接表进行行列转置,以得到第四邻接表,第四邻接表的规模为M*K;当K等于N时,对第四邻接表进行矩阵切分,以得到目标邻接表;当K小于N时,第四邻接表为目标邻接表,且S等于K。
第三方面,本申请实施例提供一种邻接表采样装置,该邻接表采样装置具有实现上述第二方面的功能,该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。
第四方面,本申请实施例提供一种计算机程序产品,包括计算机指令,当所述计算机指令在电子设备上运行时,使得电子设备执行如第二方面所述方法的部分或全部。
第五方面,本申请实施例提供一种计算机存储介质,用于储存为上述用于第一方面所述的AI芯片或第三方面所述的邻接表采样装置所用的计算机软件指令,其包含用于执行上述方面所设计的程序。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。
图1a为一种邻接表采样流程示意图;
图1b为一种邻接表采样算法在CPU上运行示意图;
图1c为一种邻接表采样算法在CPU+GPU架构上运行示意图;
图2a为本申请实施例提供的一种AI芯片的架构示意图;
图2b为本申请实施例提供的一种基于AI芯片的邻接表采样流程示意图;
图2c为本申请实施例提供的另一种基于AI芯片的邻接表采样流程示意图;
图2d为执行两次邻接表的采样各操作的执行顺序示意图;
图2e为本申请实施例提供的另一种AI芯片的架构示意图;
图3a为本申请实施例提供的另一种基于AI芯片的邻接表采样流程示意图;
图3b为本申请实施例提供的另一种基于AI芯片的邻接表采样流程示意图;
图4为本申请实施例提供的一种邻接表采样方法的流程示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
首先本申请的一些词汇进行解释说明。
邻接表(adjacency list),是一种图的存储结构,图的每一个顶点都有一个邻接表。邻接表是一个线性表,图中一个顶点v的邻接表包含所有邻接于该顶点v的顶点。
邻接表的规模具体是指在以矩阵的形式存储邻接表时该矩阵的尺寸,因此邻接表也可以看成一个矩阵。
首先参见图2a,图2a为本申请实施例提供的一种AI芯片架构的示意图。如图2a所示,AI芯片包括随机数生成器201和神经网络处理器202,其中,随机数生成器201与神经网络处理器202连接;
随机数生成器201,用于生成K个随机数,K为大于0的整数,该K个随机数的取值范围为[0,N];
神经网络处理器202,用于对输入的第一邻接表进行行列转置,以得到第二邻接表,第一邻接表的规模为M*N,第二邻接表的规模为N*M;根据K个随机数对第二邻接表进行乱序重排,以得到第三邻接表,该第三邻接表的规模为K*M;根据第三邻接表得到目标 邻接表,目标邻接表的规模为M*S,S为小于N的整数。
可选地,随机数生成器201可以为CPU或者DSA。
在一个可行的实施例中,在根据K个随机数对第二邻接表进行乱序重排,以得到第三邻接表的方面,神经网络处理器220具体用于:
根据K个随机数,从第二邻接表中获取K个第一向量,K个第一向量与K个随机数一一对应,K个第一向量中的第j个第一向量中的元素包括第二邻接表中第i行的元素,i的取值为K个随机数中与第j个第一向量对应的随机数的取值;按照K个随机数生成的先后顺序,对K个第一向量进行排列,以得到第三邻接表。
具体地,神经网络处理器220根据K个随机数从第二邻接表中获取K个第一向量,具体包括:神经网络处理器220将K个随机数中的每个随机数映射到一个随机数地址,该随机地址为第二邻接表中第一列中元素在缓存中的存储地址;神经网络处理器220根据该随机地址及第二邻接表的列数从缓存中获取该随机数对应的第一向量,进而得到K个第一向量。
在此需要说明的是,K个随机数的取值可以部分相同,还可以互不相同。
举例说明,假设第二邻接表为:
Figure PCTCN2020125656-appb-000001
随机数生成器201先后生成3个随机数,分别为2,0和1,神经网络处理器202根据该3个随机数从第二邻接表中获取3个第一向量,该3个第一向量分别为:(C1,C2,C3,C4),(A1,A2,A3,A4)和(B1,B2,B3,B4),按照3个随机数生成顺序,对上述3个第一向量进行排列,得到第三邻接表,第三邻接表可表示为:
Figure PCTCN2020125656-appb-000002
在一个可行的实施例中,在根据第三邻接表得到目标邻接表的方面,神经网络处理器220具体用于:
对第三邻接表进行行列转置,以得到第四邻接表,该第四邻接表的规模为M*K;当K等于N时,对第四邻接表进行矩阵切分,以得到目标邻接表;当K小于N时,第四邻接表为目标邻接表,且S等于K。
随机数生成器201可以是按照默认方式生成K个随机数,K等于N,神经网络处理器202按照上述方式得到的第四邻接表的规模为M*N,因需要对第四邻接表信息矩阵切分,实现邻接表的采样,得到上述目标邻接表,如图2b所示;当随机数生成器201还可以基于用户的指令生成K个随机数,K为采样得到邻接表的列数,因此按照上述方式得到的第四邻接表即为上述目标邻接表,如图2c所示。
换言之,在用户没有指示随机数生成器201生成的随机数的数量时,随机数生成器201就会按照默认的方式生成默认数量的随机数,假设第一邻接表的规模为M*N,则这里的默认数量一般为N;在用户确定采样后的邻接表的规模为M*S时,用户会将S告知随机数生成器201,进而随机数生成器201生成S个随机数。由上述描述可知,乱序重排得到的第三邻接表的规模由随机数的数量决定,因此可以根据随机数生成器201生成的随机数的数 量来决定是否进行矩阵切分。
在一个可行的实施例中,随机数生成器201生成K个随时数是在神经网络处理器202对第一邻接表进行行列转置之前执行的。对于随机数生成器和NPU来说,对邻接表的采样是重复进行的,图2d示意出了表示进行两次邻接表的采样各操作的执行顺序,如图2d所示,在NPU执行完第二次行列转置操作后,NPU执行第三次行列转置操作进入下一次邻接表的采样,由于对于随机数生成器来说,随机数的生成是比较耗时的,并且对于NPU来说,执行行列转置操作是比较耗时的;为了降低耗时,在NPU执行第三次行列转置操作之前,随机数生成器执行第二次随机数生成操作,也即是在NPU在执行第一次乱序重排操作和第二次行列转置操作的同时,或者在NPU在执行第二次行列转置操作的同时,随机数生成器执行第二次随机数生成操作,相当于将整个随机数的生成过程隐藏在NPU计算过程中,从而降低了邻接表的采样时间。
在一个可行的实施例中,上述神经网络处理器202实现上述过程可通过图神经网络来实现,其中,随机数的数量K为图神经网络的超参,第一邻接表和K个随机数为图神经网络的输入,神经网络处理器202通过调用图神经网络,以执行上述神经网络处理器202所执行的动作,从而得到上述目标邻接表。
在一个可选的实施例中,如图2e所示,AI芯片200包括神经网络处理器202、中央处理器203和领域特异性加速器204;其中,中央处理器203和领域特异性加速器204均与神经网络处理器202连接;中央处理器203或领域特异性加速器204用于生成上述K个随机数;神经网络处理器202的具体实现过程可参见图2a的相关描述,在此不再叙述。
在一个具体的示例中,如图3a和图3b所示,AI芯片中的CPU或者DSA生成25个随机数,并将该25个随机数保存至高宽带存储器(high bandwidth memory,HBM)/DDR中;NPU从HBM中获取25个随机数,并将25个随机数映射到25个随机地址;NPU从HBM/DDR中获取第一邻接表,第一邻接表的规模为5120*100,对第一邻接表进行行列转置,得到第二邻接表,上述25个随机地址为第二邻接表中第一列中元素在缓存中的存储地址;NPU根据25个随机地址及5120从缓存中读取25个第一向量;然后基于25个随机数的生成顺序,对25个第一向量进行排列,从而得到的第三邻接表,该过程也可称为随机搬运,第三邻接表的规模为25*5120,再对第三邻接表进行行列转置,以得到第四邻接表,第四邻接表的规模为5120*25,该第四邻接表即为目标邻接表,也就是采样后得到的邻接表。
可以看出,本申请实施例的方案中,基于AI芯片结构的特点,重新设计了邻接表采样流程:将通过CPU或DSA生成随机数,通过NPU实现邻接表的行列转置和乱序重排,相当于将对邻接表进行采样都由NPU来执行,实现了对算子的融合,降低了计算耗时和内存开销,同时将随机数的生成隐藏在NPU的计算过程中从而进一步降低了采样时间。
参见图4,图4为本申请实施例提供的一种邻接表采样方法的流程示意图。该方法应用于AI芯片,AI芯片包括随机数生成器和NPU。如图4所示,该方法包括:
S401、随机数生成器生成K个随机数。
S402、NPU获取第一邻接表,对第一邻接表进行行列转置,得到第二邻接表。
在一个可行的实施例中,K个随机数的取值范围为[0,N-1],NPU根据K个随机数对第二邻接表进行乱序重排,以得到第三邻接表,包括:
NPU根据K个随机数,从第二邻接表中获取K个第一向量,K个第一向量与K个随机数一一对应,K个第一向量中的第j个第一向量中的元素包括第二邻接表中第i行的元素,i的取值为K个随机数中与第j个第一向量对应的随机数的取值;NPU按照K个随机数生成的先后顺序,对K个第一向量进行排列,以得到第三邻接表。
在一个可行的实施例中,随机数生成器生成K个随机数是在NPU对第一邻接表进行行列转置之前执行的。
在一个可行的实施例中,NPU对根据第三邻接表得到目标邻接表,包括:
对第三邻接表进行行列转置,以得到第四邻接表,第四邻接表的规模为M*K;当K等于N时,对第四邻接表进行矩阵切分,以得到目标邻接表;当K小于N时,第四邻接表为目标邻接表,且S等于K。
S403、NPU根据K个随机数对第二邻接表进行乱序重排,以得到第三邻接表;根据第三邻接表得到目标邻接表。
在此需要指出的是,步骤S401-S403的具体实现过程可参见上述图2a、图2b、图2c、图2d、图3a和图3b所示实施例的相关描述,在此不再叙述。
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质可存储有程序,该程序执行时可以实现包括上述方法实施例中记载的任何邻接表采样方法的部分或全部步骤。前述的存储介质包括:U盘、只读存储器(英文:read-only memory)、随机存取存储器(英文:random access memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (9)

  1. 一种人工智能AI芯片,其特征在于,所述AI芯片包括随机数发生器和神经网络处理器NPU,其中,所述CPU与所述NPU连接;
    所述随机数生成器,用于生成K个随机数;
    所述NPU,用于对输入到的第一邻接表进行行列转置,得到第二邻接表,所述第一邻接表的规模为M*N,所述M和N均为大于0的整数;所述第二邻接表的规模为N*M;根据所述K个随机数对所述第二邻接表进行乱序重排,以得到第三邻接表,所述第三邻接表的规模为K*M;根据所述第三邻接表得到目标邻接表,所述目标邻接表的规模为M*S,所述S为小于N的整数。
  2. 根据权利要求1所述的芯片,其特征在于,所述随机数生成器为中央处理器CPU或者领域特异性加速器DSA。
  3. 根据权利要求1或2所述的芯片,其特征在于,所述随机数生成器生成所述K个随机数是在所述NPU对所述第一邻接表进行行列转置之前执行的。
  4. 根据权利要求1-3任一项所述的芯片,其特征在于,所述K个随机数的取值范围为[0,N-1],在所述根据所述K个随机数对所述第二邻接表进行乱序重排,以得到第三邻接表的方面,所述NPU具体用于:
    根据所述K个随机数,从所述第二邻接表中获取K个第一向量,所述K个第一向量与所述K个随机数一一对应,所述K个第一向量中的第j个第一向量中的元素包括所述第二邻接表中第i行的元素,所述i的取值为所述K个随机数中与所述第j个第一向量对应的随机数的取值;
    按照K个随机数生成的先后顺序,对所述K个第一向量进行排列,以得到所述第三邻接表。
  5. 根据权利要求1-4任一项所述的芯片,其特征在于,在所述根据所述第三邻接表得到目标邻接表的方面,所述NPU具体用于:
    对所述第三邻接表进行行列转置,以得到第四邻接表,所述第四邻接表的规模为M*K;
    当所述K等于N时,对所述第四邻接表进行矩阵切分,以得到所述目标邻接表;
    当所述K小于N时,所述第四邻接表为所述目标邻接表,且所述S等于所述K。
  6. 一种邻接表采样方法,其特征在于,所述方法应用于人工智能AI芯片,所述AI芯片包括随机数生成器和神经网络芯片,所述方法包括:
    所述随机数生成器生成K个随机数;
    所述NPU获取第一邻接表,所述第一邻接表的规模为M*N,所述M和N均为大于0的整数;
    所述NPU对第一所述邻接表进行行列转置,得到第二邻接表,所述第二邻接表的规模为N*M;
    所述NPU根据所述K个随机数对所述第二邻接表进行乱序重排,以得到第三邻接表,所述第三邻接表的规模为K*M;
    所述NPU根据所述第三邻接表得到目标邻接表,所述目标邻接表的规模为M*S,所述S为小于N的整数。
  7. 根据权利要求6所述的方法,其特征在于,所述随机数生成器生成所述K个随机数是在所述NPU对所述第一邻接表进行行列转置之前执行的。
  8. 根据权利要求6或7所述的方法,其特征在于,所述K个随机数的取值范围为[0,N-1],所述NPU根据所述K个随机数对所述第二邻接表进行乱序重排,以得到第三邻接表,包括:
    所述NPU根据所述K个随机数,从所述第二邻接表中获取K个第一向量,所述K个第一向量与所述K个随机数一一对应,所述K个第一向量中的第j个第一向量中的元素包括所述第二邻接表中第i行的元素,所述i的取值为所述K个随机数中与所述第j个第一向量对应的随机数的取值;
    所述NPU按照K个随机数生成的先后顺序,对所述K个第一向量进行排列,以得到所述第三邻接表。
  9. 根据权利要求6-8任一项所述的方法,其特征在于,所述NPU对根据所述第三邻接表得到目标邻接表,包括:
    对所述第三邻接表进行行列转置,以得到第四邻接表,所述第四邻接表的规模为M*K;
    当所述K等于N时,对所述第四邻接表进行矩阵切分,以得到所述目标邻接表;
    当所述K小于N时,所述第四邻接表为所述目标邻接表,且所述S等于所述K。
PCT/CN2020/125656 2020-10-31 2020-10-31 一种ai芯片及邻接表采样方法 WO2022088140A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080106754.9A CN116529709A (zh) 2020-10-31 2020-10-31 一种ai芯片及邻接表采样方法
PCT/CN2020/125656 WO2022088140A1 (zh) 2020-10-31 2020-10-31 一种ai芯片及邻接表采样方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/125656 WO2022088140A1 (zh) 2020-10-31 2020-10-31 一种ai芯片及邻接表采样方法

Publications (1)

Publication Number Publication Date
WO2022088140A1 true WO2022088140A1 (zh) 2022-05-05

Family

ID=81381642

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/125656 WO2022088140A1 (zh) 2020-10-31 2020-10-31 一种ai芯片及邻接表采样方法

Country Status (2)

Country Link
CN (1) CN116529709A (zh)
WO (1) WO2022088140A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7080287B2 (en) * 2002-07-11 2006-07-18 International Business Machines Corporation First failure data capture
US7398276B2 (en) * 2002-05-30 2008-07-08 Microsoft Corporation Parallel predictive compression and access of a sequential list of executable instructions
CN102075974A (zh) * 2011-01-10 2011-05-25 张俊虎 无线传感器网络高邻接度资源搜索方法
CN102880739A (zh) * 2012-07-31 2013-01-16 中国兵器科学研究院 一种基于邻接表的网络最小路集确定方法
CN103345508A (zh) * 2013-07-04 2013-10-09 北京大学 一种适用于社会网络图的数据存储方法及系统
CN109145133A (zh) * 2018-07-26 2019-01-04 昆明理工大学 一种基于结构一致性的加权图聚集方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7398276B2 (en) * 2002-05-30 2008-07-08 Microsoft Corporation Parallel predictive compression and access of a sequential list of executable instructions
US7080287B2 (en) * 2002-07-11 2006-07-18 International Business Machines Corporation First failure data capture
CN102075974A (zh) * 2011-01-10 2011-05-25 张俊虎 无线传感器网络高邻接度资源搜索方法
CN102880739A (zh) * 2012-07-31 2013-01-16 中国兵器科学研究院 一种基于邻接表的网络最小路集确定方法
CN103345508A (zh) * 2013-07-04 2013-10-09 北京大学 一种适用于社会网络图的数据存储方法及系统
CN109145133A (zh) * 2018-07-26 2019-01-04 昆明理工大学 一种基于结构一致性的加权图聚集方法

Also Published As

Publication number Publication date
CN116529709A (zh) 2023-08-01

Similar Documents

Publication Publication Date Title
CN107025206B (zh) 一种量子傅立叶变换实现量子线路设计的方法
WO2017167095A1 (zh) 一种模型的训练方法和装置
Penkovsky et al. Efficient design of hardware-enabled reservoir computing in FPGAs
US20220255721A1 (en) Acceleration unit and related apparatus and method
Campeanu A mapping study on microservice architectures of Internet of Things and cloud computing solutions
JP2022502762A (ja) ニューラルネットワーク捜索方法、装置、プロセッサ、電子機器、記憶媒体及びコンピュータプログラム
CN109844774B (zh) 一种并行反卷积计算方法、单引擎计算方法及相关产品
WO2022179075A1 (zh) 一种数据处理方法、装置、计算机设备及存储介质
WO2022088140A1 (zh) 一种ai芯片及邻接表采样方法
CN115862751B (zh) 基于边特征更新聚合注意力机制的量子化学性质计算方法
JP6698061B2 (ja) 単語ベクトル変換装置、方法、及びプログラム
Zheng et al. Stochastic synchronization for an array of hybrid neural networks with random coupling strengths and unbounded distributed delays
Gerdt et al. Some algorithms for calculating unitary matrices for quantum circuits
CN114936645A (zh) 基于多矩阵变换的cnot量子线路最近邻综合优化方法
WO2015143708A1 (zh) 后缀数组的构造方法及装置
WO2021179117A1 (zh) 神经网络通道数搜索方法和装置
CN114237548A (zh) 基于非易失性存储器阵列的复数点乘运算的方法及系统
JP2002157237A (ja) マルチレベル不完全ブロック分解による前処理を行う処理装置
Sun et al. Efficient knowledge graph embedding training framework with multiple gpus
JP2021005242A (ja) 情報処理装置、情報処理プログラム、及び情報処理方法
Tripathy et al. Distributed Matrix-Based Sampling for Graph Neural Network Training
Kasarkin et al. New iteration parallel-based method for solving graph NP-complete problems with reconfigurable computer systems
WO2022064602A1 (ja) 信号処理装置、方法及びプログラム
US20240037433A1 (en) Method and device for constructing quantum circuit of qram architecture, and method and device for parsing quantum address data
US20240095565A1 (en) Method, Device, Storage Medium and Electronic Device for Data Reading

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20959301

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202080106754.9

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20959301

Country of ref document: EP

Kind code of ref document: A1