CN111626916A

CN111626916A - Information processing method, device and equipment

Info

Publication number: CN111626916A
Application number: CN202010484303.0A
Authority: CN
Inventors: 祝娟; 朱元昊
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2020-09-04
Anticipated expiration: 2040-06-01
Also published as: CN111626916B

Abstract

An information processing method, an information processing device and information processing equipment are disclosed, wherein the method comprises the following steps: a plurality of first thread blocks of first processing equipment respectively acquire corresponding object subsets in an object set, and obtain first objects in the corresponding object subsets by processing a plurality of objects included in the corresponding object subsets, wherein the plurality of object subsets corresponding to the plurality of first thread blocks are included in the object set; a second thread block of the first processing device processes a plurality of first objects obtained by the plurality of first thread blocks to obtain a second object in the plurality of first objects; and the first thread blocks respectively filter a plurality of objects in the corresponding object subset based on the second object obtained by the second thread block to obtain an updated corresponding object subset, wherein the updated object subsets corresponding to the first thread blocks are included in the updated object set.

Description

Information processing method, device and equipment

Technical Field

The present disclosure relates to computer vision technologies, and in particular, to an information processing method, apparatus, and device.

Background

In the process of target detection through a neural network model, a large number of candidate frames are formed. Screening of detection frames is currently commonly performed using Non-maximal Suppression (NMS) of these candidate frames.

With the increase of the demand of target detection and the diversification and complexity of detection scenes, how to improve the efficiency of non-maximum suppression processing becomes a problem to be solved urgently.

Disclosure of Invention

The present disclosure provides an information processing scheme.

According to an aspect of the present disclosure, there is provided an information processing method, the method including: a plurality of first thread blocks of a first processing device respectively acquire corresponding object subsets in an object set, and obtain a first object in the corresponding object subsets by processing a plurality of objects included in the corresponding object subsets, wherein the plurality of object subsets corresponding to the plurality of first thread blocks are included in the object set; a second thread block of the first processing device processes a plurality of first objects obtained by the plurality of first thread blocks to obtain a second object in the plurality of first objects; and the first thread blocks respectively filter a plurality of objects in the corresponding object subset based on the second object obtained by the second thread block to obtain an updated corresponding object subset, wherein the updated object subsets corresponding to the first thread blocks are included in the updated object set.

In connection with any of the embodiments provided by this disclosure, each object included in the subset of objects is processed by a separate thread in the first thread block.

In combination with any embodiment provided by the present disclosure, the filtering, by the first thread blocks, the plurality of objects in the corresponding object subset based on the second object obtained by the second thread block, respectively, to obtain the updated corresponding object subset, including: the first thread block determining a weight coefficient for each object of a plurality of objects contained in a corresponding subset of objects based on the second object; the first thread block performs filtering processing on the plurality of objects based on a weight coefficient of each of the plurality of objects.

In combination with any embodiment provided by the present disclosure, obtaining a first object in the corresponding object subset by processing a plurality of objects included in the corresponding object subset includes: in response to receiving a first task execution instruction sent by a second processing device, the plurality of first thread blocks respectively obtain a first object in a corresponding object subset by processing a plurality of objects included in the corresponding object subset, wherein the first task execution instruction includes the plurality of object subsets; the method further comprises the following steps: and the first processing equipment sends a first task execution response to the second processing equipment, wherein the first task execution response comprises the first objects obtained by the plurality of first thread blocks respectively.

In combination with any one of the embodiments provided by the present disclosure, the obtaining, by the second thread block of the first processing device, a second object in the plurality of first objects by processing the plurality of first objects obtained by the plurality of first thread blocks includes: in response to receiving a second task execution instruction sent by the second processing device, the second thread block obtains a second object in the plurality of first objects by processing the plurality of first objects obtained by the plurality of first thread blocks, where the second task execution instruction includes the plurality of first objects; the method further comprises the following steps: and the first processing device sends a second task execution response to the second processing device, wherein the second task execution response comprises the second object obtained by the second thread block.

In combination with any embodiment provided by the present disclosure, the filtering, by the first thread blocks, the plurality of objects in the corresponding object subset respectively based on the second object obtained by the second thread block includes: in response to receiving a third task execution instruction sent by the second processing device, the multiple first thread blocks respectively perform filtering processing on multiple objects in corresponding object subsets based on the second object obtained by the second thread block, where the third task execution instruction includes the second object; the method further comprises the following steps: and the first processing equipment sends a third task execution response to the second processing equipment, wherein the third task execution response comprises the filtered plurality of object subsets.

In combination with any embodiment provided by the present disclosure, the method further comprises: and the first processing device synchronizes data among the plurality of first thread blocks and/or inside the second thread block in a memory fence mode.

In combination with any embodiment provided by the present disclosure, the method further comprises: the first processing device establishes a memory fence; the first processing device performs accumulation counting once when detecting that the plurality of first thread blocks finish write operation on the memory each time, wherein the write operation is used for writing a first object obtained by the first thread blocks into the memory; the second thread block of the first processing device reads the plurality of first objects from the memory if the count value reaches the number of the plurality of first thread blocks.

In connection with any embodiment provided by the disclosure, the first processing device includes a GPU and the second processing device includes a CPU.

According to an aspect of the present disclosure, there is provided an information processing method, the method including: the method comprises the steps that a second processing device obtains an object set, wherein the object set comprises a plurality of object subsets, and each object subset comprises a plurality of objects; sending a first task execution instruction to a first processing device, the first task execution instruction including the plurality of subsets of objects; and receiving a first task execution response sent by the second processing device, wherein the first task execution response comprises processing results of the second processing device on the plurality of object subsets through a plurality of thread blocks.

In connection with any embodiment provided by the present disclosure, each object included in the subset of objects is assigned to a separate thread in the thread block for processing.

In connection with any embodiment provided by the disclosure, the first task execution response includes a first object that each thread block of a plurality of first thread blocks of the first processing device derives from a corresponding subset of objects.

In combination with any embodiment provided by the present disclosure, the method further comprises: sending a second task execution instruction to the first processing device, where the second task execution instruction includes a plurality of first objects obtained by the plurality of first thread blocks; and receiving a second task execution response sent by the first processing device, wherein the second task execution response comprises the second object obtained by a second thread block of the first processing device.

In combination with any embodiment provided by the present disclosure, the method further comprises: sending a third task execution instruction to the first processing device, where the third task execution instruction includes the second object obtained by the second thread block; receiving a third task execution response sent by the first device, wherein the third task execution response comprises the filtered plurality of object subsets.

In connection with any embodiment provided by the present disclosure, the first task execution instruction is configured to instruct the first processing device to perform data synchronization within a thread block and/or between thread blocks by means of a memory fence; the first task execution response includes the updated plurality of object subsets obtained by filtering the plurality of object subsets by the first processing device.

According to an aspect of the present disclosure, there is provided an information processing apparatus, the apparatus including: an obtaining unit, configured to enable a plurality of first thread blocks of a first processing device to respectively obtain corresponding object subsets in an object set, and obtain a first object in the corresponding object subsets by processing a plurality of objects included in the corresponding object subsets, where the plurality of object subsets corresponding to the plurality of first thread blocks are included in the object set; the processing unit is used for enabling a second thread block of the first processing device to process a plurality of first objects obtained by the plurality of first thread blocks to obtain a second object in the plurality of first objects; a filtering unit, configured to enable the first thread blocks to filter, based on the second object obtained by the second thread block, a plurality of objects in corresponding object subsets, respectively, so as to obtain updated corresponding object subsets, where the updated object subsets corresponding to the first thread blocks are included in the updated object set.

In combination with any embodiment provided by the present disclosure, the filtration unit is specifically configured to: the first thread block determining a weight coefficient for each object of a plurality of objects contained in a corresponding subset of objects based on the second object; the first thread block performs filtering processing on the plurality of objects based on a weight coefficient of each of the plurality of objects.

In combination with any one of the embodiments provided by the present disclosure, the obtaining unit is specifically configured to: in response to receiving a first task execution instruction sent by a second processing device, the plurality of first thread blocks respectively obtain a first object in a corresponding object subset by processing a plurality of objects included in the corresponding object subset, wherein the first task execution instruction includes the plurality of object subsets; the apparatus further includes a first sending unit, configured to enable the first processing device to send a first task execution response to the second processing device, where the first task execution response includes the first objects obtained by the plurality of first thread blocks, respectively.

In combination with any one of the embodiments provided by the present disclosure, the processing unit is specifically configured to: in response to receiving a second task execution instruction sent by the second processing device, the second thread block obtains a second object in the plurality of first objects by processing the plurality of first objects obtained by the plurality of first thread blocks, where the second task execution instruction includes the plurality of first objects; the apparatus further includes a second sending unit, configured to send, by the first processing device, a second task execution response to the second processing device, where the second task execution response includes the second object obtained by the second thread block.

In combination with any embodiment provided by the present disclosure, the filtration unit is specifically configured to: in response to receiving a third task execution instruction sent by the second processing device, the multiple first thread blocks respectively perform filtering processing on multiple objects in corresponding object subsets based on the second object obtained by the second thread block, where the third task execution instruction includes the second object; the apparatus further includes a third sending unit, configured to send, by the first processing device, a third task execution response to the second processing device, where the third task execution response includes the filtered subset of the plurality of objects.

In combination with any one of the embodiments provided by the present disclosure, the apparatus further includes a first synchronization unit, configured to perform, by the first processing device, data synchronization between the plurality of first thread blocks and/or inside the second thread block in a memory fence manner.

In combination with any one of the embodiments provided by the present disclosure, the apparatus further includes a second synchronization unit, configured to establish a memory fence by the first processing device; the first processing device performs accumulation counting once when detecting that the plurality of first thread blocks finish write operation on the memory each time, wherein the write operation is used for writing a first object obtained by the first thread blocks into the memory; the second thread block of the first processing device reads the plurality of first objects from the memory if the count value reaches the number of the plurality of first thread blocks.

According to an aspect of the present disclosure, there is provided an information processing apparatus, the apparatus including: an acquisition unit configured to cause a second processing device to acquire an object set including a plurality of object subsets each including a plurality of objects; a sending unit, configured to send a first task execution instruction to a first processing device, where the first task execution instruction includes the plurality of object subsets; a receiving unit, configured to receive a first task execution response sent by the second processing device, where the first task execution response includes a processing result of the second processing device on the plurality of object subsets through a plurality of thread blocks.

In combination with any embodiment provided by the present disclosure, the apparatus further includes a second sending unit, configured to send a second task execution instruction to the first processing device, where the second task execution instruction includes a plurality of first objects obtained by the plurality of first thread blocks; and receiving a second task execution response sent by the first processing device, wherein the second task execution response comprises the second object obtained by a second thread block of the first processing device.

In combination with any embodiment provided by the present disclosure, the apparatus further includes a third sending unit, configured to send a third task execution instruction to the first processing device, where the third task execution instruction includes the second object obtained by the second thread block; receiving a third task execution response sent by the first device, wherein the third task execution response comprises the filtered plurality of object subsets.

According to an aspect of the present disclosure, there is provided an information processing apparatus including a memory for storing computer instructions executable on a processor, and the processor for executing the computer instructions to implement the information processing method according to any one of the embodiments of the present disclosure.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a first processing device, implements the information processing method according to any one of the embodiments of the present disclosure, and which, when executed by a second processing device, implements the information processing method according to any one of the embodiments of the present disclosure.

According to an aspect of the present disclosure, there is provided a computer system including a first processing device and a second processing device, wherein the first processing device and the second processing device are the information processing apparatus according to any one of the embodiments of the present disclosure; alternatively, the first processing device and the second processing device are processors that implement the information processing method according to any one of the embodiments of the present disclosure.

In an information processing method, an apparatus, and a device provided in any embodiment of the present disclosure, a plurality of first thread blocks of a first processing device respectively obtain corresponding object subsets in an object set, and obtain a first object in the corresponding object subsets by processing a plurality of objects included in the corresponding object subsets; a second thread block of the first processing device processes a plurality of first objects obtained by the plurality of first thread blocks to obtain a second object in the plurality of first objects; the first thread blocks respectively filter a plurality of objects in the corresponding object subset based on the second object obtained by the second thread block to obtain the updated corresponding object subset, so that the object set is updated in a parallel manner, and the processing efficiency is improved.

Drawings

In order to more clearly illustrate one or more embodiments or technical solutions in the prior art in the present specification, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present specification, and other drawings can be obtained by those skilled in the art without inventive exercise.

Fig. 1 is a flowchart of an information processing method according to at least one embodiment of the present disclosure;

fig. 2A is a schematic diagram of a process of obtaining a second object in an information processing method according to at least one embodiment of the present disclosure;

fig. 2B is a schematic process diagram of filtering a plurality of objects in a subset of objects in an information processing method according to at least one embodiment of the present disclosure;

fig. 3 is a flowchart of another information processing method according to at least one embodiment of the present disclosure;

fig. 4 is a flowchart of another information processing method according to at least one embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an information processing apparatus according to at least one embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of another information processing apparatus according to at least one embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device according to at least one embodiment of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in one or more embodiments of the present disclosure, the technical solutions in one or more embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in one or more embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all embodiments. All other embodiments that can be derived by one of ordinary skill in the art from one or more embodiments of the disclosure without making any creative effort shall fall within the scope of protection of the disclosure.

Fig. 1 is a flowchart of an information processing method according to at least one embodiment of the present disclosure. As shown in fig. 1, the method includes steps 101 to 103.

In step 101, a plurality of first thread blocks of a first processing device respectively obtain corresponding object subsets in an object set, and obtain a first object in the corresponding object subsets by processing a plurality of objects included in the corresponding object subsets, where the plurality of object subsets corresponding to the plurality of first thread blocks are included in the object set.

In the embodiment of the present disclosure, the first processing device may include an image processor (GPU), and other processors with parallel processing capability, such as various types of AI processors, but the embodiment of the present disclosure is not limited thereto.

In the first processing device, the smallest logical unit is a thread (thread), and a plurality of threads form a thread block (block), and the thread block is loaded into a hardware resource of the first processing device to run.

In a disclosed embodiment, the set of objects includes a plurality of processing objects of the information processing task, the types of which may vary based on different scenarios and information processing tasks. For example, in a scene in which an image is subjected to target detection by a neural network model, the object set may be a plurality of candidate frames obtained by the neural network model, that is, the object set is a candidate frame set, and accordingly, the information processing task is configured to perform non-maximum suppression processing on the candidate frame set.

According to the number of objects included in the object set, the object set may be divided into a plurality of object subsets, each object subset includes a plurality of objects, and a specific division manner may be determined based on different requirements, for example, average allocation, or allocation based on a specific policy, and the division manner may be specified by a user, or a preset rule, which is not limited in this disclosure. The plurality of object subsets contained in the object set are allocated to different thread blocks of the first processing device, so that each thread block respectively obtains a corresponding object subset in the object set, and each thread block can process the corresponding object subset. The thread block processes a plurality of objects included in the corresponding object subset, and a first object in the corresponding object subset can be obtained. The first object may be one or more objects selected from a plurality of objects included in the object subset according to an information processing task, for example, the first object is an object whose parameter value in the object subset satisfies a specific condition (for example, maximum or minimum), or the thread block may determine a score of each object in the object subset based on a certain rule, for example, the score of the object is an intersection ratio between the object and a reference object, and then determine the object with the largest score as the first object. In this step, for the purpose of distinguishing descriptions, the thread block for which the object subset is acquired is referred to as a first thread block.

In step 102, the second thread block of the first processing device obtains a second object of the plurality of first objects by processing the plurality of first objects obtained by the plurality of first thread blocks.

The second object may be obtained by processing the plurality of first objects obtained by the respective first thread blocks. The second object may be one or more objects selected from a plurality of first objects according to the information processing task, for example, the second object is an object satisfying a specific condition among the plurality of first objects, for example, a certain parameter value satisfies a specific condition, or a result obtained by calculating the parameter value satisfies a specific condition, and as an example, the second object may be an object having a largest score among the plurality of first objects, but the embodiment of the present disclosure is not limited thereto.

In step 103, the first thread blocks filter the objects in the corresponding object subsets based on the second object obtained by the second thread block, so as to obtain the updated corresponding object subsets, where the updated object subsets corresponding to the first thread blocks are included in the updated object set.

And each first thread block in the plurality of first thread blocks acquires the second object, and performs filtering processing on a plurality of objects in an object subset corresponding to the first thread block according to the second object to obtain an updated object subset. And filtering the object subset corresponding to each first thread block to obtain an updated object subset, so that the object set is also updated. As an example, a parameter value or a specific calculation result of each object in the object subset may be compared with a parameter value or a specific calculation result of the second object, and the object is determined to be retained or filtered according to a relationship between the parameter values, so as to implement filtering on the object subset. As another example, a parameter value or a specific calculation result of each object in the object subset may be updated based on the second object to obtain an update result, and the object may be determined to be retained or filtered based on the update result and a preset threshold, but the embodiment of the present disclosure is not limited thereto.

In the embodiment of the present disclosure, a plurality of first thread blocks of a first processing device respectively obtain corresponding object subsets in an object set, and obtain a first object in the corresponding object subsets by processing a plurality of objects included in the corresponding object subsets; a second thread block of the first processing device processes a plurality of first objects obtained by the plurality of first thread blocks to obtain a second object in the plurality of first objects; the first thread blocks respectively filter a plurality of objects in the corresponding object subset based on the second object obtained by the second thread block to obtain the updated corresponding object subset, so that the object set is updated in a parallel manner, and the processing efficiency is improved.

The information processing method provided by at least one embodiment of the present disclosure is described below by taking an example in which a GPU performs non-maximum suppression processing on a candidate frame set obtained in a target detection process. It will be appreciated by those skilled in the art that the method may also be applied to other scenarios, not just those described below.

During the target detection process, a series of candidate boxes are generated, each containing a location parameter and a Confidence Score (Confidence Score). From the generated series of candidate boxes, a set of candidate boxes (object set), which may be part or all of the generated series of candidate boxes, may be obtained. Dividing the candidate frame set into a plurality of candidate frame subsets (object subsets) according to the number of candidate frames contained in the candidate frame set, wherein each candidate frame set comprises a plurality of candidate frames (objects).

Taking the example that the candidate frame set includes 4 candidate frame subsets, each candidate frame subset includes 4 candidate frames, referring to the processing procedure shown in fig. 2A, the 4 first thread blocks 201 in the GPU respectively obtain one candidate frame subset (object subset) 211, and each first thread block 201 can obtain the candidate frame with the highest confidence score in the candidate frame subset, that is, the first object 212, by comparing the confidence scores of the 4 candidate frames included in the obtained candidate frame subset 211.

The second thread block 202 obtains the first objects 212 in the 4 first thread blocks 201, and compares the confidence scores of the obtained 4 first objects 212 to obtain the candidate box with the highest confidence score, i.e. the second object 213. The second object 213 is the frame candidate with the highest confidence in the entire set of frame candidates.

Referring to fig. 2B, in the next processing procedure, the 4 first thread blocks 201 respectively obtain the second objects 213, and respectively perform filtering processing on the respective candidate frame subsets 211 based on the second objects 213 to obtain updated candidate frame subsets 214, where the updated candidate frame subsets 214 are included in the updated candidate frame sets, that is, filtering processing is performed on each candidate frame subset, so that filtering processing on the entire candidate frame set is also implemented.

In some embodiments, all of the plurality of objects included in the subset of objects are processed by the same thread in the first thread block. At this time, the first thread block may optionally include one thread. Alternatively, some objects in the plurality of objects in the object subset are processed by the same thread in the first thread block, in this case, optionally, two or more objects processed by the same thread may be performed in a certain order, and the processing between different threads is performed in parallel, but the embodiment of the present disclosure is not limited thereto.

In some embodiments, each object included in the subset of objects is processed by a separate thread in the first thread block.

The objects contained in the object subset are distributed to different threads in the first thread block, so that each thread obtains one object, each object can be processed by one independent thread, the parallel processing process of obtaining the first object is realized, and the information processing efficiency is improved.

In some embodiments, the filtering of the plurality of objects in the subset of objects may be accomplished in the following manner.

First, the first thread block determines a weight coefficient for each of a plurality of objects included in a corresponding subset of objects based on the second object.

And allocating the second object to each thread of the first thread block to enable each thread to obtain the second object. The thread obtains an object in a subset of objects in a previous processing procedure, and a weight coefficient of the object may be determined according to a parameter value of the object and the parameter value of the second object, where the weight coefficient may be a function output result smaller than 1 obtained by using the parameter value of the object and the parameter value of the second object as inputs of a weight function. Still taking the non-maximum suppression processing on the candidate frame set as an example, first determining an intersection ratio between the global maximum candidate frame (second object) and a candidate frame (object) in a thread in the first thread block, and taking the obtained intersection ratio as an input of a weight function to obtain a weight coefficient corresponding to the candidate frame.

Next, the first thread block performs filtering processing on the plurality of objects based on the weight coefficient of each of the plurality of objects.

In one example, the weight coefficient of the object may be multiplied by a parameter value of the object to obtain an updated parameter value. And determining whether to retain the object by determining whether the updated parameter value satisfies a set condition. For example, for obtaining the weight coefficient corresponding to the candidate frame, the weight coefficient may be multiplied by the confidence score of the candidate frame to obtain an updated confidence score. During the process shown in FIG. 2B, the gray filled circles contained in the subset of objects 215

Representing the candidate box with the confidence score updated. In the case that the updated confidence score is less than a preset retention threshold, the candidate box is not retained, i.e., is not involved in subsequent calculations. As shown in fig. 2B, in which the circle with slashes therein

Representing candidate boxes that are not retained.

In the embodiment of the disclosure, the weight coefficient of each object is determined according to the second object, and the plurality of objects in the object subset are filtered according to the weight coefficient, so that the recall rate of the target object is improved.

The information processing method provided by the embodiment of the present disclosure can be divided into two parts as a whole, and a second object and an update object set are obtained, where the obtaining of the second object is implemented by executing steps 101 to 102, and the updating of the object set is implemented by executing step 103. And repeatedly carrying out the second object solving and the object set updating until all the objects to be processed are involved in the traversal task.

When the information processing method provided by the embodiment of the disclosure is implemented by the GPU, since the operations between the blocks of the GPU are asynchronous, the problem of how to implement data synchronization between the blocks when updating data needs to be considered.

In some embodiments, the first processing device updates and synchronizes data by receiving a call from a second processing device and sending a processing result to the second processing device.

In the embodiment of the present disclosure, the first processing device may include a Central Processing Unit (CPU) and other processors having similar processing capabilities.

Fig. 3 shows a flowchart of an information processing method according to at least one embodiment of the present disclosure. As shown in fig. 3, the method may include:

in step 301, a second processing device obtains a set of objects.

Wherein the set of objects comprises a plurality of subsets of objects, each subset of objects comprising a plurality of objects.

Step 302, the second processing device sends a first task execution instruction to the first processing device.

Wherein the first task execution instruction includes the plurality of subsets of objects.

In one example, the first task execution instruction is for the CPU to invoke the GPU core, assign a plurality of subsets of objects to a plurality of first thread blocks of the first processing device, respectively, and assign a plurality of objects included in the subsets of objects to a plurality of threads in the first thread blocks, respectively, such that each thread in each first thread block obtains one object.

Step 303, in response to receiving a first task execution instruction sent by a second processing device, the multiple first thread blocks respectively obtain a first object in the corresponding object subset by processing multiple objects included in the corresponding object subset.

In an example, each thread in the first thread block of the second processing device obtains one object in the object subset, and obtains the first object in the object subset by processing the object of each thread.

Step 304, the first processing device sends a first task execution response to the second processing device, where the first task execution response includes the first objects obtained by the plurality of first thread blocks, respectively.

Through steps 302 to 304, the second processing device completes one call of the first processing device, completes processing on a plurality of object subsets by the first processing device, and obtains a processing result, namely, the first object.

Step 305, the second processing device sends a second task execution instruction to the first processing device.

Wherein the second task execution instruction includes a plurality of the first objects obtained by the plurality of first thread blocks.

In one example, the second task execution instruction is used for calling the GPU kernel again by the CPU, and distributing the plurality of first objects obtained by the second processing device to the second thread block of the first processing device.

Step 306, in response to receiving a second task execution instruction sent by the second processing device, the second thread block obtains a second object of the plurality of first objects by processing the plurality of first objects obtained by the plurality of first thread blocks.

In one example, each thread in the second thread block of the second processing device obtains a first object, and the second object in the plurality of first objects is obtained by processing the first object of each thread.

Step 307, the first processing device sends a second task execution response to the second processing device, where the second task execution response includes the second object obtained by the second thread block.

Through steps 305 to 307, the second processing device completes the call of the first processing device again, completes the processing for the plurality of first objects with the first processing device, and obtains a processing result, i.e., the second object.

Through the above steps, the second object is obtained. For non-maximum suppression in the target detection task, the candidate box with the maximum global confidence is obtained. Next, the process of updating is entered.

Step 308, the second processing device sends a third task execution instruction to the first processing device.

Wherein the third task execution instruction comprises the second object obtained by the second thread block.

In one example, the second task execution instruction is for the CPU to call the GPU core a third time, and to allocate the second object obtained by the second processing device to a plurality of first thread blocks of the first device and to each thread of each first thread block.

Step 309, in response to receiving a third task execution instruction sent by the second processing device, the multiple first thread blocks respectively perform filtering processing on multiple objects in the corresponding object subset based on the second object obtained by the second thread block.

In one example, each thread of each first thread block of the first processing device obtains the second object, and the first thread block performs filtering processing on the objects of the threads according to the second object and obtains a plurality of filtered object subsets.

Step 310, the first processing device sends a third task execution response to the second processing device, where the third task execution response includes the filtered subset of the plurality of objects.

Through steps 308 to 310, the second processing device completes the third call of the first processing device, completes the filtering of the plurality of object subsets by using the first processing device, and obtains the filtered plurality of object subsets, that is, completes the updating of the object set.

In the embodiment of the present disclosure, the second processing device calls the first processing device for multiple times, and performs data allocation and receives the processing result of the first processing device on the data in each calling process, so that the updated data can be allocated to the first processing device, and the first processing device can perform parallel processing based on the updated data, thereby improving the processing efficiency and realizing data synchronization in the processing process of the second processing device.

In some embodiments, the first processing device synchronizes data between the plurality of first thread blocks and/or within the second thread block by means of a memory fence.

Taking the processing procedure in step 101 as an example, the plurality of first thread blocks of the first processing device respectively process the plurality of objects included in the corresponding object subset to obtain the first object in the object subset. Since the operations of the thread blocks are not visible to each other, it cannot be ensured that the next operation is performed when the processing of all thread blocks is completed, i.e. the synchronization of the thread blocks cannot be determined.

In order to solve the above problem, at least one embodiment of the present disclosure proposes a data synchronization method. As shown in fig. 4, the method may include:

in step 401, the second processing device obtains a set of objects.

Step 402, the second processing device sends a first task execution instruction to the first processing device.

Step 403, in response to receiving a first task execution instruction sent by a second processing device, the multiple first thread blocks respectively obtain a first object in the corresponding object subset by processing multiple objects included in the corresponding object subset.

In step 404, the first processing device establishes a memory fence.

By establishing a memory fence, the results of the write operation of the current thread to the global memory may be made visible to the threads in the remaining thread blocks.

Step 405, the first processing device performs an accumulation count each time it detects that the plurality of first thread blocks complete a write operation to the memory, where the write operation is used to write the first object obtained by the first thread block into the memory.

The accumulated count may be implemented by an atomic addition operation, so that the completion condition of writing the first object into the memory by each first thread block may be monitored.

Step 406, in a case that the count value reaches the number of the first thread blocks, a second thread block of the first processing device reads the first objects from the memory.

Under the condition that the first objects are ensured to be written into the memory by all the first thread blocks, the second thread blocks read the first objects from the memory, and data synchronization among the first thread blocks is realized.

Step 407, the second thread block obtains a second object by processing the plurality of first objects obtained by the plurality of first thread blocks. For a specific process of obtaining the second object, please refer to the description of step 102, which is not described herein again.

For the second object obtained in step 407, it is also necessary to ensure that the second thread block performs reading after completing the operation of writing the second object into the memory.

In step 408, the first processing device establishes a memory fence.

In step 409, the first processing device performs an accumulation count each time it detects that the second thread block completes a write operation to the memory, where the write operation is used to write the second object obtained by the second thread block into the memory.

In step 410, when the count value reaches the number of the second thread blocks, the plurality of first thread blocks of the first processing device reads the plurality of second objects from the memory.

The implementation process of steps 408-410 is similar to steps 404-406, and for the specific process, refer to the description of steps 404-406.

And under the condition that all threads of the second thread blocks are ensured to finish writing the second objects into the memory, the plurality of first thread blocks read the plurality of second objects from the memory, and the data synchronization in the second thread blocks is realized.

In step 411, the plurality of first thread blocks respectively filter the plurality of objects in the corresponding object subset based on the second object obtained by the second thread block.

For the second object to be calculated again after the object subset is updated, it can also be ensured that all the object subsets are updated by establishing the memory fence, so as to implement data synchronization between multiple first thread blocks.

In the embodiment of the present disclosure, the first processing device performs data synchronization between the plurality of first thread blocks and/or inside the second thread block in a memory fence manner, so that the processing efficiency is improved, and the resource consumption of interaction between the first processing device and the second processing device is reduced.

In some embodiments, the second processing device may be adapted to set at least one variable for each object in the set of objects, and to monitor the update status by changing the value of the variable during the update.

In one example, a reservation (keep) variable is set for each object in the object set to indicate whether the object is reserved in the updating process, for example, a variable value of 1 indicates that the object is reserved, and a variable value of 0 indicates that the object is not reserved (filtered out), that is, the object is not involved in subsequent calculation. The initial value of the reserved variable may be set to 1. An ordering (order) variable may also be set to indicate the ordering of the objects in the object set, and an initial value of the ordering variable for each object may be set to 0. It will be appreciated by those skilled in the art that the above arrangement of variable values is merely an example and may be arranged in other ways, and the disclosure is not limited thereto.

In the information processing method according to at least one embodiment of the present disclosure, in a process of processing the object set, since the first object is one of the objects obtained from the object subset corresponding to each first thread block, and the second object is one of the objects obtained from the first object corresponding to each first thread block, determination of the second object is performed through a process of comparing each object in the object subset and comparing each first object, and thus, each object participating in comparison may be sorted according to a certain parameter value of the object or a score determined based on a certain rule, and a value different from 0 is assigned to the sorting variable.

After the second object is determined, a stage of filtering the plurality of objects in each subset of objects is entered. For an object determined to be reserved, the value of a reserved variable of the object may be set to 1; for an object determined not to be reserved, the value of the reserved variable for that object may be set to 0. And updating the object subset is realized by performing retention or filtering processing on each object in the object subset.

After the update of the object subset is completed, if the sorting variables of all the objects to be processed are endowed with values which are not 0 or the values of the reserved variables are not 0, the processing process is ended, otherwise, the process of acquiring the second object and updating the object subset is executed again until the conditions are met.

At least one embodiment of the present disclosure also provides an information processing apparatus, as shown in fig. 5, the apparatus including: an obtaining unit 501, configured to enable a plurality of first thread blocks of a first processing device to respectively obtain corresponding object subsets in an object set, and obtain a first object in the corresponding object subsets by processing a plurality of objects included in the corresponding object subsets, where the plurality of object subsets corresponding to the plurality of first thread blocks are included in the object set; a processing unit 502, configured to enable a second thread block of a first processing device to obtain a second object in a plurality of first objects by processing the plurality of first objects obtained by the plurality of first thread blocks; a filtering unit 503, configured to enable the first thread blocks to filter, based on the second object obtained by the second thread block, a plurality of objects in corresponding object subsets, respectively, to obtain updated corresponding object subsets, where the updated object subsets corresponding to the first thread blocks are included in the updated object set.

In some embodiments, the filtration unit is specifically configured to: the first thread block determining a weight coefficient for each object of a plurality of objects contained in a corresponding subset of objects based on the second object; the first thread block performs filtering processing on the plurality of objects based on a weight coefficient of each of the plurality of objects.

In some embodiments, the obtaining unit is specifically configured to: in response to receiving a first task execution instruction sent by a second processing device, the plurality of first thread blocks respectively obtain a first object in a corresponding object subset by processing a plurality of objects included in the corresponding object subset, wherein the first task execution instruction includes the plurality of object subsets; the apparatus further includes a first sending unit, configured to enable the first processing device to send a first task execution response to the second processing device, where the first task execution response includes the first objects obtained by the plurality of first thread blocks, respectively.

In some embodiments, the processing unit is specifically configured to: in response to receiving a second task execution instruction sent by the second processing device, the second thread block obtains a second object in the plurality of first objects by processing the plurality of first objects obtained by the plurality of first thread blocks, where the second task execution instruction includes the plurality of first objects; the apparatus further includes a second sending unit, configured to send, by the first processing device, a second task execution response to the second processing device, where the second task execution response includes the second object obtained by the second thread block.

In some embodiments, the filtration unit is specifically configured to: in response to receiving a third task execution instruction sent by the second processing device, the multiple first thread blocks respectively perform filtering processing on multiple objects in corresponding object subsets based on the second object obtained by the second thread block, where the third task execution instruction includes the second object; the apparatus further includes a third sending unit, configured to send, by the first processing device, a third task execution response to the second processing device, where the third task execution response includes the filtered subset of the plurality of objects.

In some embodiments, the apparatus further includes a first synchronization unit configured to synchronize data between the plurality of first thread blocks and/or within the second thread block by the first processing device by means of a memory fence.

In some embodiments, the apparatus further comprises a second synchronization unit configured to establish a memory fence for the first processing device; the first processing device performs accumulation counting once when detecting that the plurality of first thread blocks finish write operation on the memory each time, wherein the write operation is used for writing a first object obtained by the first thread blocks into the memory; the second thread block of the first processing device reads the plurality of first objects from the memory if the count value reaches the number of the plurality of first thread blocks.

In some embodiments, the first processing device comprises a GPU and the second processing device comprises a CPU.

At least one embodiment of the present disclosure also provides an information processing apparatus, as shown in fig. 6, the apparatus including: an obtaining unit 601, configured to cause a second processing device to obtain an object set, where the object set includes a plurality of object subsets, and each object subset includes a plurality of objects; a sending unit 602, configured to send a first task execution instruction to a first processing device, where the first task execution instruction includes the plurality of object subsets; a receiving unit 603, configured to receive a first task execution response sent by the second processing device, where the first task execution response includes processing results of the plurality of object subsets by the second processing device through a plurality of thread blocks.

In some embodiments, each object included in the subset of objects is assigned to a separate thread in the thread block for processing.

In some embodiments, the first task execution response includes a first object that each of a plurality of first thread blocks of the first processing device derives from a corresponding subset of objects.

In some embodiments, the apparatus further includes a second sending unit, configured to send a second task execution instruction to the first processing device, where the second task execution instruction includes a plurality of the first objects obtained by the plurality of first thread blocks; and receiving a second task execution response sent by the first processing device, wherein the second task execution response comprises the second object obtained by a second thread block of the first processing device.

In some embodiments, the apparatus further includes a third sending unit, configured to send a third task execution instruction to the first processing device, where the third task execution instruction includes the second object obtained by the second thread block; receiving a third task execution response sent by the first device, wherein the third task execution response comprises the filtered plurality of object subsets.

In some embodiments, the first task execution instruction is to instruct the first processing device to perform data synchronization within and/or between thread blocks by way of a memory fence; the first task execution response includes the updated plurality of object subsets obtained by filtering the plurality of object subsets by the first processing device.

Fig. 7 is an electronic device provided in at least one embodiment of the present disclosure, and the electronic device includes a memory and a processor, where the memory is used to store computer instructions executable on the processor, and the processor is used to execute the computer instructions to implement the information processing method according to any embodiment of the present disclosure.

At least one embodiment of the present specification also provides a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the information processing method according to any one of the embodiments of the present specification.

At least one embodiment of the present specification further provides a computer system, including a first processing device and a second processing device, where the first processing device and the second processing device are the information processing apparatus according to any one of the embodiments of the present specification; alternatively, the first processing device and the second processing device are processors that implement the information processing method according to any embodiment of the present specification.

As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. An information processing method, characterized in that the method comprises:

a plurality of first thread blocks of a first processing device respectively acquire corresponding object subsets in an object set, and obtain a first object in the corresponding object subsets by processing a plurality of objects included in the corresponding object subsets, wherein the plurality of object subsets corresponding to the plurality of first thread blocks are included in the object set;

a second thread block of the first processing device processes a plurality of first objects obtained by the plurality of first thread blocks to obtain a second object in the plurality of first objects;

and the first thread blocks respectively filter a plurality of objects in the corresponding object subset based on the second object obtained by the second thread block to obtain an updated corresponding object subset, wherein the updated object subsets corresponding to the first thread blocks are included in the updated object set.

2. The method of claim 1, wherein each object included in the subset of objects is processed by a separate thread in the first thread block.

3. The method according to claim 1 or 2, wherein the filtering, by the first thread blocks, the plurality of objects in the corresponding object subset based on the second object obtained by the second thread block to obtain the updated corresponding object subset, respectively, includes:

the first thread block determining a weight coefficient for each object of a plurality of objects contained in a corresponding subset of objects based on the second object;

the first thread block performs filtering processing on the plurality of objects based on a weight coefficient of each of the plurality of objects.

4. The method according to any one of claims 1 to 3, wherein the obtaining a first object in the corresponding object subset by processing a plurality of objects included in the corresponding object subset comprises:

in response to receiving a first task execution instruction sent by a second processing device, the plurality of first thread blocks respectively obtain a first object in a corresponding object subset by processing a plurality of objects included in the corresponding object subset, wherein the first task execution instruction includes the plurality of object subsets;

the method further comprises the following steps:

and the first processing equipment sends a first task execution response to the second processing equipment, wherein the first task execution response comprises the first objects obtained by the plurality of first thread blocks respectively.

5. The method according to any one of claims 1 to 4, wherein the second thread block of the first processing device obtains the second object of the plurality of first objects by processing the plurality of first objects obtained by the plurality of first thread blocks, and comprises:

in response to receiving a second task execution instruction sent by the second processing device, the second thread block obtains a second object in the plurality of first objects by processing the plurality of first objects obtained by the plurality of first thread blocks, where the second task execution instruction includes the plurality of first objects;

the method further comprises the following steps:

and the first processing device sends a second task execution response to the second processing device, wherein the second task execution response comprises the second object obtained by the second thread block.

6. The method according to any one of claims 1 to 5, wherein the filtering, by the first thread blocks, on the basis of the second object obtained by the second thread blocks, of the plurality of objects in the corresponding object subset respectively comprises:

in response to receiving a third task execution instruction sent by the second processing device, the multiple first thread blocks respectively perform filtering processing on multiple objects in corresponding object subsets based on the second object obtained by the second thread block, where the third task execution instruction includes the second object;

the method further comprises the following steps:

and the first processing equipment sends a third task execution response to the second processing equipment, wherein the third task execution response comprises the filtered plurality of object subsets.

7. The method of claim 4, further comprising:

and the first processing device synchronizes data among the plurality of first thread blocks and/or inside the second thread block in a memory fence mode.

8. The method of claim 4, further comprising:

the first processing device establishes a memory fence;

the first processing device performs accumulation counting once when detecting that the plurality of first thread blocks finish write operation on the memory each time, wherein the write operation is used for writing a first object obtained by the first thread blocks into the memory;

the second thread block of the first processing device reads the plurality of first objects from the memory if the count value reaches the number of the plurality of first thread blocks.

9. The method of any of claims 1-8, wherein the first processing device comprises a GPU and the second processing device comprises a CPU.

10. An information processing method, characterized in that the method comprises:

the method comprises the steps that a second processing device obtains an object set, wherein the object set comprises a plurality of object subsets, and each object subset comprises a plurality of objects;

sending a first task execution instruction to a first processing device, the first task execution instruction including the plurality of subsets of objects;

and receiving a first task execution response sent by the second processing device, wherein the first task execution response comprises processing results of the second processing device on the plurality of object subsets through a plurality of thread blocks.

11. The method of claim 10, wherein each object included in the subset of objects is assigned to a separate thread in the thread block for processing.

12. The method of claim 10 or 11, wherein the first task execution response comprises a first object obtained from a corresponding subset of objects for each of a plurality of first thread blocks of the first processing device.

13. The method according to any one of claims 10 to 12, further comprising:

sending a second task execution instruction to the first processing device, where the second task execution instruction includes a plurality of first objects obtained by the plurality of first thread blocks;

and receiving a second task execution response sent by the first processing device, wherein the second task execution response comprises the second object obtained by a second thread block of the first processing device.

14. The method according to any one of claims 10 to 13, further comprising:

sending a third task execution instruction to the first processing device, where the third task execution instruction includes the second object obtained by the second thread block;

receiving a third task execution response sent by the first device, wherein the third task execution response comprises the filtered plurality of object subsets.

15. The method according to any one of claims 10 to 14, wherein the first task execution instruction is configured to instruct the first processing device to perform data synchronization within a thread block and/or between thread blocks by means of a memory fence;

the first task execution response includes the updated plurality of object subsets obtained by filtering the plurality of object subsets by the first processing device.

16. An information processing apparatus characterized in that the apparatus comprises:

an obtaining unit, configured to enable a plurality of first thread blocks of a first processing device to respectively obtain corresponding object subsets in an object set, and obtain a first object in the corresponding object subsets by processing a plurality of objects included in the corresponding object subsets, where the plurality of object subsets corresponding to the plurality of first thread blocks are included in the object set;

the processing unit is used for enabling a second thread block of the first processing device to process a plurality of first objects obtained by the plurality of first thread blocks to obtain a second object in the plurality of first objects;

a filtering unit, configured to enable the first thread blocks to filter, based on the second object obtained by the second thread block, a plurality of objects in corresponding object subsets, respectively, so as to obtain updated corresponding object subsets, where the updated object subsets corresponding to the first thread blocks are included in the updated object set.

17. An information processing apparatus characterized in that the apparatus comprises:

an acquisition unit configured to cause a second processing device to acquire an object set including a plurality of object subsets each including a plurality of objects;

a sending unit, configured to send a first task execution instruction to a first processing device, where the first task execution instruction includes the plurality of object subsets;

a receiving unit, configured to receive a first task execution response sent by the second processing device, where the first task execution response includes a processing result of the second processing device on the plurality of object subsets through a plurality of thread blocks.

18. An information processing apparatus comprising a memory for storing computer instructions executable on the processor and a processor for executing the computer instructions to implement the method of any one of claims 1 to 9 or to implement the method of any one of claims 10 to 15.

19. A computer-readable storage medium, on which a computer program is stored, which program, when executed by a first processing device, implements the method of any of claims 1 to 9, and, when executed by a second processing device, implements the method of any of claims 10 to 15.

20. A computer system comprising a first processing device and a second processing device, wherein,

the first processing device is the information processing apparatus of claim 16, and the second processing device is the information processing apparatus of claim 17; alternatively, the first and second electrodes may be,

the first processing device is a processor implementing the method of any one of claims 1 to 9, and the second processing device is a processor for implementing the method of any one of claims 10 to 15.