CN104123178B

CN104123178B - Parallelism constraint detection method based on GPUs

Info

Publication number: CN104123178B
Application number: CN201410358441.9A
Authority: CN
Inventors: 许畅; 马晓星; 吕建; 眭骏
Original assignee: Nanjing University
Current assignee: CVIC Software Engineering Co Ltd
Priority date: 2014-07-25
Filing date: 2014-07-25
Publication date: 2017-05-17
Anticipated expiration: 2034-07-25
Also published as: CN104123178A

Abstract

The present invention is a method for detecting constraints in parallel based on a graphics processor. Steps: 1) using quantifiers as segmentation points, a constraint is divided into several processing units, and the recursive merger in the detection process is eliminated by scheduling these processing units. Maximize the degree of parallelism; 2) Generate a corresponding number of GPU threads according to the current processing unit and information set, each GPU thread calculates its corresponding variable assignment according to its own thread number, and processes the processing unit under this assignment. An assigned processing unit is called a parallel computing unit, which is the smallest unit that can be processed in parallel in the GPU; 3) The two-level storage strategy of the index-result pool, the non-fixed-length data generated by the nodes of all parallel computing units The results are stored in the result pool, and the starting address and length of the results generated by the nodes in the result pool are stored in the index. This strategy "serially allocates space and writes results in parallel", which can achieve a higher writing speed.

Description

Parallelization Constraint Detection Method Based on Graphics Processor

技术领域technical field

本发明涉及一种基于图形处理器的并行化约束检测方法。The invention relates to a parallelization constraint detection method based on a graphics processor.

背景技术Background technique

约束检测是一种常用的验证信息有效性的方法。一条约束反映了一条信息或者多条信息之间应该满足的关系。一般而言，一条约束由若干种节点联接而成：“全称量词”节点，“存在量词”节点，“与”节点，“或”节点，“蕴含”节点，“非”节点和“函数”节点。每种节点描述了一个特定的关系。检测约束即：将获取的信息与预定义的约束进行核对，违反了约束的一条信息或者一组信息是无效的。约束检测通常是结合其他应用中的。Constraint detection is a commonly used method to verify the validity of information. A constraint reflects a relationship that should be satisfied between one piece of information or multiple pieces of information. Generally speaking, a constraint is connected by several kinds of nodes: "universal quantifier" node, "existential quantifier" node, "and" node, "or" node, "implies" node, "not" node and "function" node . Each kind of node describes a specific relationship. Detecting constraints is to check the acquired information against predefined constraints, and a piece of information or a group of information that violates the constraints is invalid. Constraint detection is often combined with other applications.

当前约束检测的方式主要有两类：增量式检测和并行检测。但是，这两种方式都完全依赖于中央处理器(CPU)，因此会消耗大量本来应该用于其他应用的计算资源。本方法的计算不再依赖于CPU，相反，它主要依赖图形处理器(GPU)进行计算。因此，该方法在提高了约束检测的速度的同时，也保证了有充分的计算资源供其他应用使用。There are two main types of current constraint detection methods: incremental detection and parallel detection. However, both of these methods are completely dependent on the central processing unit (CPU), thus consuming a large amount of computing resources that should be used for other applications. The calculation of this method no longer depends on the CPU, on the contrary, it mainly relies on the graphics processing unit (GPU) for calculation. Therefore, while improving the speed of constraint detection, this method also ensures sufficient computing resources for other applications.

发明内容Contents of the invention

针对现有技术中存在的不足，对当前约束检测耗时过大，占用资源过多的缺点，本发明提出了一种基于GPU的约束检测方法。该方法的核心在于三个部分：约束预处理；并行策略；存储策略。Aiming at the deficiencies in the prior art, the present invention proposes a constraint detection method based on GPU, which takes too much time and occupies too many resources. The core of this method lies in three parts: constraint preprocessing; parallel strategy; storage strategy.

本发明的技术方案为：一种基于图形处理器的并行化约束检测方法，它包括：The technical solution of the present invention is: a parallel constraint detection method based on a graphics processor, which includes:

约束预处理，基于量词的约束分割方法；具体为：Constraint preprocessing, quantifier-based constraint segmentation method; specifically:

步骤1、指定约束头节点为当前节点，从当前节点开始分割；Step 1. Designate the constraint head node as the current node, and start splitting from the current node;

步骤2、若当前节点为“全称量词”或者“存在量词”节点，则将该舦灝分割成两个子部分，一个子部分以该量词节点结束，另一部分从该量词节点的子节点开始，指定该量词节点的子节点为当前节点继续分割；Step 2. If the current node is a "full name quantifier" or "existence quantifier" node, then divide the conglomerate into two sub-parts, one sub-part ends with the quantifier node, and the other starts with the sub-node of the quantifier node, specifying The child node of the quantifier node continues to split for the current node;

步骤3、若当前节点为“与”节点，“或”节点或者“蕴含”节点，则指定该节点的左子节点为当前节点继续分割，处理完左子节点后，指定该节点的右子节点为当前节点继续分割；Step 3. If the current node is an "and" node, an "or" node or an "implication" node, specify the left child node of the node as the current node to continue splitting. After processing the left child node, specify the right child node of the node Continue splitting for the current node;

步骤4、若当前节点为“非”节点，则指定该节点的子节点为当前节点继续分割；Step 4, if the current node is a "non" node, then specify the child node of the node to continue splitting for the current node;

步骤5、若当前节点为“函数”节点，则停止当前分支的递归；Step 5. If the current node is a "function" node, stop the recursion of the current branch;

经过分割后，一条约束被转变为若干处理单元，各个处理单元不相交，且所有处理单元共同构成该约束。After splitting, a constraint is transformed into several processing units, each processing unit is disjoint, and all processing units together constitute the constraint.

并行策略，基于处理单元的并行处理方法；具体为：Parallel strategy, a parallel processing method based on processing units; specifically:

步骤1、计算所需线程数N，设从当前处理单元的父节点开始，到约束头结点的路径中的变量<υ₁,υ₂,...υ_n>对应的上下文信息集合为<S_i,S₂,...S_n>，各个上下文信息集合中的信息条数为<I₁,I₂,...I_n>，则N＝I₁×I₂×...×I_n；若当前处理单元到头节点不包含任何变量，或者当前处理单元包含头节点，则N＝1；Step 1. Calculate the number of threads N required, and set the context information set corresponding to the variable <υ ₁ ,υ ₂ ,...υ _n > in the path from the parent node of the current processing unit to the constraint head node as < S _i , S ₂ ,...S _n >, the number of pieces of information in each context information set is <I ₁ , I ₂ ,...I _n >, then N=I ₁ ×I ₂ ×...× I _n ; if the current processing unit does not contain any variable to the head node, or the current processing unit contains the head node, then N=1;

步骤2、生成N个GPU线程，线程id从0至N-1(该id由GPU自动分配)；各个线程根据自身id独立计算其对应的赋值，设整数值M_i＝j表示变量υ_i取其对应集合S_i中第j条信息(0≤M_i<I_i)；则M_i的取值按以下步骤得出：Step 2, generate N GPU threads, the thread id is from 0 to N-1 (the id is automatically assigned by the GPU); each thread independently calculates its corresponding assignment according to its own id, and setting the integer value M _i =j means that the variable υ _i takes It corresponds to the jth piece of information in the set S _i (0≤M _i <I _i ); then the value of M _i is obtained according to the following steps:

i：设size＝1,cur＝n；i: set size=1, cur=n;

ii：若cur≥1,转子iii,否则结束；ii: If cur ≥ 1, rotor iii, otherwise end;

iii：size＝size*I_cur；cur＝cur-1,转子ii。iii: size=size*I _cur ; cur=cur-1, rotor ii.

步骤3、各个线程将所计算出的赋值映射到处理单元中，产生各个线程需要处理的并行计算单元；各个线程独立处理各个并行计算单元；Step 3. Each thread maps the calculated assignment to a processing unit to generate a parallel computing unit that each thread needs to process; each thread independently processes each parallel computing unit;

所述的所有GPU线程并发执行，且互相之间不存在依赖关系。All the GPU threads described above are executed concurrently, and there is no dependency relationship between them.

存储策略。索引-结果池的二层次存储方法，主要包含三个部分：1)索引数组，包含两个域：结果的起始位置pos和长度len；2)结果数组；3)结果数组位置指针Pointer(简称位置指针)，它只能被互斥地写。设n个线程产生的结果长度分别为l₁,l₂...l_i...l_n,所述索引-结果池的二层次存储方法具体为：storage strategy. The two-level storage method of the index-result pool mainly includes three parts: 1) index array, including two fields: the starting position pos of the result and the length len; 2) the result array; 3) the result array position pointer Pointer (referred to as location pointer), which can only be written exclusively. Assuming that the lengths of results generated by n threads are l ₁ , l ₂ ... l _i ... l _n , the two-level storage method of the index-result pool is specifically:

步骤1、各个线程根据其赋值计算各自当前节点在索引数组中的存储位置；Step 1. Each thread calculates the storage position of its current node in the index array according to its assigned value;

步骤2、各个线程互斥的获取结果数组位置指针，设第i个线程获取到该结果数组位置指针，则它将该节点的起始位置pos设为Pointer当前值，之后，Step 2. Each thread obtains the position pointer of the result array mutually exclusively. If the ith thread obtains the position pointer of the result array, it sets the starting position pos of the node as the current value of Pointer. After that,

步骤3、更新该结果数组位置指针的值：Pointer_new＝Pointer_old+l_i，其中，Pointer_old为位置指针初始值，l_i为第i个线程产生的结果长度；Step 3, update the value of the result array position pointer: Pointer _new =Pointer _old +l _i , wherein, Pointer _old is the initial value of the position pointer, and l _i is the length of the result produced by the ith thread;

步骤4、更新后释放该结果数组位置指针供其他线程使用，并将结果填入结果数组，结果长度填入该节点的结果长度len中。Step 4. After updating, release the result array position pointer for use by other threads, and fill the result into the result array, and fill the result length into the result length len of the node.

本发明的有益效果：本发明能高效地利用GPU进行约束检测：约束分割消除了约束处理过程中的递归，使之适应于GPU的工作方式；基于处理单元的并行策略，使得各个线程能够独立地定位和处理数据，提升了该方法的并发性；并发存储策略显著提升了存储效率。该发明在提升了约束检测的效率的同时，大幅降低对CPU资源的依赖，从而使得CPU资源能够更多的服务于其他应用。此外，由于GPU和CPU可以同时执行，不存在相互等待的情况，因此，该方法也能借此特性获得额外的效率上的增益。Beneficial effects of the present invention: the present invention can efficiently utilize GPU to carry out constraint detection: constraint segmentation eliminates recursion in the constraint processing process, making it suitable for the working mode of GPU; based on the parallel strategy of the processing unit, each thread can independently Locating and processing data improves the concurrency of the method; the concurrent storage strategy significantly improves storage efficiency. While improving the efficiency of constraint detection, the invention greatly reduces the dependence on CPU resources, so that more CPU resources can serve other applications. In addition, since the GPU and the CPU can execute at the same time, there is no waiting for each other, so this method can also use this feature to obtain additional efficiency gains.

附图说明Description of drawings

图1本发明的约束处理用例图。Fig. 1 is a use case diagram of constraint processing in the present invention.

图2本发明计算过程中的上下文映射和并行策略。Fig. 2 Context mapping and parallel strategy in the computing process of the present invention.

图3本发明二层次的存储策略。Figure 3 shows the two-level storage strategy of the present invention.

具体实施方式detailed description

以下结合附图和具体实施例对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.

本实施例的基于GPU的约束检测方法。该方法的核心在于三个部分：约束预处理；并行策略；存储策略。具体的说：The GPU-based constraint detection method of this embodiment. The core of this method lies in three parts: constraint preprocessing; parallel strategy; storage strategy. Specifically:

1.约束预处理。本发明提出了基于量词的约束分割方法，包含以下步骤：1. Constraint preprocessing. The present invention proposes a constraint segmentation method based on quantifiers, comprising the following steps:

a)指定约束头节点为当前节点，从当前节点开始分割；a) Designate the constraint head node as the current node, and start splitting from the current node;

b)若当前节点为“全称量词”或者“存在量词”节点，则将该部分分割成两个子部分，一个子部分以该量词节点结束，另一部分从该量词节点的子节点开始，指定该量词节点的子节点为当前节点继续分割；b) If the current node is a "full quantifier" or "existence quantifier" node, divide the part into two subparts, one subpart ends with the quantifier node, and the other starts from the subnode of the quantifier node, specifying the quantifier The child nodes of the node continue to split for the current node;

c)若当前节点为“与”节点，“或”节点或者“蕴含”节点，则指定该节点的左子节点为当前节点继续分割，处理完左子节点后，指定该节点的右子节点为当前节点继续分割；c) If the current node is an "and" node, an "or" node or an "implication" node, specify the left child node of the node as the current node to continue splitting. After processing the left child node, specify the right child node of the node as The current node continues to split;

d)若当前节点为“非”节点，则指定该节点的子节点为当前节点继续分割；d) If the current node is a "non-" node, specify the sub-node of the node as the current node to continue splitting;

e)若当前节点为“函数”节点，则停止当前分支的递归。经过分割后，一条约束被转变为若干处理单元，各个处理单元不相交，且所有处理单元共同构成该约束。图1展示了一条约束，它的含义如下：对于在城市A中的任何一辆出租车，它在一段时间内行驶的距离只能在一个合理的范围内。对于这条约束，它按照上述算法将被分割成三部分，如图中虚线所示。e) If the current node is a "function" node, stop the recursion of the current branch. After splitting, a constraint is transformed into several processing units, each processing unit is disjoint, and all processing units together constitute the constraint. Figure 1 shows a constraint, and its meaning is as follows: For any taxi in city A, the distance it travels within a period of time can only be within a reasonable range. For this constraint, it will be divided into three parts according to the above algorithm, as shown by the dotted line in the figure.

2.并行策略。基于处理单元的并行处理方法，包含以下步骤：2. Parallel strategy. The processing unit-based parallel processing method includes the following steps:

a)计算所需线程数N。设从当前处理单元的父节点开始，到约束头结点的路径中的变量<υ₁,υ₂,...υ_n>对应的上下文信息集合为<S₁,S₂,...S_n>，各个上下文信息集合中的信息条数为<I₁,I₂,...I_n>，则N＝I₁×I₂×...×I_n；若当前处理单元到头节点不包含任何变量，或者当前处理单元包含头节点，则N＝1；a) Calculate the required number of threads N. Let the context information set corresponding to variables <υ ₁ ,υ ₂ ,...υ _n > in the path from the parent node of the current processing unit to the constraint head node be <S ₁ , S ₂ ,...S _n >, the number of pieces of information in each context information set is <I ₁ , I ₂ ,...I _n >, then N=I ₁ ×I ₂ ×...×I _n ; if the current processing unit reaches the head node Contains any variable, or the current processing unit contains a head node, then N=1;

b)生成N个GPU线程，线程id从0至N-1(该id由GPU自动分配)；各个线程根据自身id独立计算其对应的赋值。设整数值M_i＝j表示变量υ_i取其对应集合S_i中第j条信息(0≤M_i<l_i)，则M_i的取值按以下步骤得出：b) Generate N GPU threads with thread ids ranging from 0 to N-1 (the id is automatically assigned by the GPU); each thread independently calculates its corresponding assignment according to its own id. Let the integer value M _i =j mean that the variable υ _i takes the jth piece of information in its corresponding set S _i (0≤M _i <l _i ), then the value of M _i is obtained according to the following steps:

i.令size＝1,cur＝n；i. let size=1, cur=n;

ii.若cur≥1,转iii,否则结束；ii. If cur≥1, go to iii, otherwise end;

iii.size＝size*I_cur；cur＝cur-1,转iiiii. size=size*I _cur ; cur=cur-1, transfer to ii

c)各个线程将所计算出的赋值映射到处理单元中，产生各个线程需要处理的并行计算单元；各个线程独立处理各个并行计算单元。c) Each thread maps the calculated assignment to a processing unit to generate a parallel computing unit that each thread needs to process; each thread independently processes each parallel computing unit.

所述的所有GPU线程并发执行，且互相之间不存在依赖关系。以图1所示约束为例。设当前接收到两条A城市中的出租车信息：出租车1和出租车2。则在计算处理单元1时，根据上述算法步骤a)，该处理单元到根节点有2个变量(a和b)，每个变量可以取两个值(出租车1和出租车2)，因此N＝4；根据步骤b)生成了4个线程(id分别为0,1,2,3)。各个线程独立计算其变量的取值。以计算id为3的线程的变量取值为例，两个变量对应的赋值计算过程为：All the GPU threads described above are executed concurrently, and there is no dependency relationship between them. Take the constraints shown in Figure 1 as an example. Assume that two taxi information in city A are currently received: taxi 1 and taxi 2. Then, when calculating the processing unit 1, according to the above algorithm step a), the processing unit has 2 variables (a and b) from the root node, and each variable can take two values (taxi 1 and taxi 2), so N=4; 4 threads (ids are 0, 1, 2, 3 respectively) are generated according to step b). Each thread independently calculates the value of its variable. Taking the calculation of the variable value of the thread whose id is 3 as an example, the assignment calculation process corresponding to the two variables is:

size＝1,cur＝2；size=1, cur=2;

size＝1×I_cur＝1×2＝2,cur＝cur-1＝1；size=1×I _cur =1×2=2, cur=cur-1=1;

由于cur≥1，继续该过程，可以得到M₁＝1；最终得出M₁＝1,M₂＝1,注意到信息是从0开始编号，因此M₁＝1,M₂＝1意味着第一个变量和第二个变量都取各自信息集合中第二条信息，即(a＝出租车2,b＝出租车2)。将取值映射到处理单元中，可以得到并行计算单元。图2的并行计算单元组1展示了生成的4个并行计算单元。Since cur ≥ 1, continue this process, you can get M ₁ = 1; finally get M ₁ = 1, M ₂ = 1, note that the information is numbered from 0, so M ₁ = 1, M ₂ = 1 means Both the first variable and the second variable take the second piece of information in their respective information sets, ie (a=taxi 2, b=taxi 2). Mapping the value to the processing unit can result in a parallel computing unit. Parallel computing unit group 1 in FIG. 2 shows four generated parallel computing units.

3.存储策略。索引-结果池的二层次存储方法，主要包含三个部分：1)索引数组，包含两个域：结果的起始位置pos和长度len；2)结果数组；3)结果数组位置指针Pointer(简称位置指针)，它只能被互斥地写。设n个线程产生的结果长度分别为l₁,l₂...l_i...l_n,则该方法存储过程如下：3. Storage strategy. The two-level storage method of the index-result pool mainly includes three parts: 1) index array, including two fields: the starting position pos of the result and the length len; 2) the result array; 3) the result array position pointer Pointer (referred to as location pointer), which can only be written exclusively. Assuming that the lengths of the results generated by n threads are l ₁ , l ₂ ... l _i ... l _n , the stored procedure of this method is as follows:

a)各个线程根据其赋值计算各自当前节点在索引数组中的存储位置；a) Each thread calculates the storage position of each current node in the index array according to its assignment;

b)各个线程互斥的获取位置指针，设第i个线程获取到该位置指针，则它将该节点的起始位置pos设为Pointer当前值，之后，b) Each thread obtains the position pointer mutually exclusive, if the i-th thread obtains the position pointer, then it sets the starting position pos of the node as the current value of Pointer, after that,

c)更新该位置指针的值：Pointer_new＝Pointer_old+l_i，其中，Pointer_old为位置指针初始值，l_i为第i个线程产生的结果长度。c) Update the value of the location pointer: Pointer _new = Pointer _old + l _i , wherein, Pointer _old is the initial value of the location pointer, and l _i is the length of the result generated by the i-th thread.

d)更新后释放该位置指针供其他线程使用，并将结果填入结果数组，结果长度填入该节点的结果长度len中。d) After updating, release the location pointer for use by other threads, and fill the result into the result array, and fill the result length into the result length len of the node.

图3中展示了该存储策略：设有三个线程同时在处理三个节点，它们都需要将结果写入存储器。设这三个节点的结果分别需要占用1,3,2个存储空间，则它们将同时申请访问位置指针(当前值为0)。假设线程t₁获取到该位置指针访问权，则它设置自己的结果的起始位置为Pointer当前值(即0)，并更新Pointer的值Pointer_new＝Pointer_old+1＝1，之后释放该指针供线程t₂和t₃访问；t₁更新位置指针值之后，将结果填入描述信息数组中，并更新索引数组中自己的长度(即1)。索引数组中第一个元素记录了t₁产生的结果的存储起始位置(0)及长度(1)。设t₂先于t₃获取Pointer访问权，由于t₁的更新，Pointer的当前值为1，因此，t₂的结果的起始位置为1，更新Pointer的值Pointer_new＝Pointer_old+3＝4，释放Pointer，结果填入描述信息数组中，并更新索引数组中自己的长度(即3)。这种策略串行的分配空间，但是多个线程同时写结果是可以并行并且无冲突的完成。The storage strategy is shown in Figure 3: There are three threads processing three nodes at the same time, and they all need to write results to memory. Assuming that the results of these three nodes need to occupy 1, 3, and 2 storage spaces respectively, they will apply for access location pointers (the current value is 0) at the same time. Assuming that thread _t1 obtains the right to access the position pointer, it sets the starting position of its own result as the current value of Pointer (ie 0), and updates the value of Pointer: Pointer _new = Pointer _old + 1 = 1, and then releases the pointer For access by threads t ₂ and t ₃ ; after t ₁ updates the value of the position pointer, fills the result into the description information array, and updates its own length in the index array (ie 1). The first element in the index array records the storage start position (0) and length (1) of the result generated by t ₁ . Assume that t ₂ obtains Pointer access right before t _3. Due to the update of t ₁ , the current value of Pointer is 1. Therefore, the starting position of the result of t ₂ is 1, and update the value of Pointer. Pointer _new = Pointer _old + 3 = 4. Release the Pointer, fill the result into the description information array, and update its own length in the index array (that is, 3). This strategy allocates space serially, but multiple threads can write results in parallel and without conflict.

上述1,2阐述了使用GPU并行检测约束的方法，3阐述了适用于GPU的高效的并行写结果的方法。The above 1, 2 expounded the method of using GPU parallel detection constraints, and 3 expounded the method of efficient parallel writing results suitable for GPU.

以上实施例只是对于本发明的部分功能进行描述，但实施例和附图并不是用来限定本发明的。在不脱离本发明之精神和范围内，所做的任何等效变化或润饰，同样属于本发明之保护范围。因此本发明的保护范围应当以本申请的权利要求所界定的内容为准。The above embodiments only describe part of the functions of the present invention, but the embodiments and drawings are not used to limit the present invention. Any equivalent changes or modifications made without departing from the spirit and scope of the present invention also belong to the protection scope of the present invention. Therefore, the scope of protection of the present invention should be determined by the content defined in the claims of the present application.

Claims

1. A parallelization constraint detection method based on a graphics processor, characterized in that it comprises:

quantifier-based constraint segmentation step;

Parallel processing steps based on processing units;

storage policy steps;

The quantifier-based constraint segmentation step is specifically:

Step 1. Designate the constraint head node as the current node, and start splitting from the current node;

Step 2. If the current node is a "full quantifier" or "existence quantifier" node, divide the node into two subparts, one subpart ends with the quantifier node, and the other part starts with the subnode of the quantifier node. Specify the The child nodes of the quantifier node continue to split for the current node;

Step 3. If the current node is an "and" node, an "or" node or an "implication" node, specify the left child node of the node as the current node to continue splitting. After processing the left child node, specify the right child node of the node Continue splitting for the current node;

Step 4, if the current node is a "non" node, then specify the child node of the node to continue splitting for the current node;

Step 5. If the current node is a "function" node, stop the recursion of the current branch;

After splitting, a constraint is transformed into several processing units, each processing unit is disjoint, and all processing units together constitute the constraint.

2. The parallelization constraint detection method according to claim 1, characterized in that: the parallel processing step based on the processing unit is specifically:

Step 1. Calculate the number of threads N required, and set the context information set corresponding to the variables <v ₁ ,v ₂ ,…v _n > in the path from the parent node of the current processing unit to the constraint head node as <S ₁ ,S ₂ ,…S _n >, the number of pieces of information in each context information set is <I ₁ ,I ₂ ,…I _n >, then N=I ₁ ×I ₂ ×…×I _n ; if the current processing unit reaches the end The node does not contain any variables, or the current processing unit contains the head node, then N=1;

Step 2, generate N GPU threads, the thread id is from 0 to N-1 (the id is automatically assigned by the GPU); each thread independently calculates its corresponding assignment according to its own id, and the integer value M _i =j indicates that the variable v _i takes It corresponds to the jth piece of information in the set S _i (0≤M _i <I _i );

Step 3. Each thread maps the calculated assignment to a processing unit to generate a parallel computing unit that each thread needs to process; each thread independently processes each parallel computing unit;

All the GPU threads described above are executed concurrently, and there is no dependency relationship between them.

3. the parallelization constraint detection method according to claim 2, is characterized in that: the value of _Mi among the step 2 draws by the following steps:

i. Sub-step i: set size=1, cur=n;

ii. Sub-step ii: if cur ≥ 1, go to rotor step iii, otherwise end;

Sub-step iii: size=size*I _cur ; cur=cur-1, rotor step ii.

4. The parallelization constraint detection method according to claim 1, characterized in that: the storage strategy step is specifically: a two-level storage method of an index-result pool, comprising three parts: 1) an index array, comprising two Domain: the starting position pos and the length len of the result; 2) the result array; 3) the result array position pointer Pointer, which can only be written mutually exclusive;

Assuming that the lengths of results generated by n threads are l ₁ , l ₂ ... l _i ... l _n , the two-level storage method of the index-result pool is specifically:

Step 1. Each thread calculates the storage position of its current node in the index array according to its assigned value;

Step 2. Each thread obtains the position pointer of the result array mutually exclusively. If the ith thread obtains the position pointer of the result array, it sets the starting position pos of the node as the current value of Pointer. After that,

Step 3, update the value of the result array position pointer: Pointer _new =Pointer _old +l _i , wherein, Pointer _old is the initial value of the position pointer, and l _i is the length of the result produced by the ith thread;

Step 4. After updating, release the result array position pointer for use by other threads, and fill the result into the result array, and fill the result length into the result length len of the node.