CN117311988B

CN117311988B - Protocol operation optimization method, device and equipment with mask and medium

Info

Publication number: CN117311988B
Application number: CN202311585448.XA
Authority: CN
Inventors: 武桓州; 周洲; 董兆华
Original assignee: Muxi Integrated Circuit Nanjing Co ltd
Current assignee: Muxi Integrated Circuit Nanjing Co ltd
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-03-12
Anticipated expiration: 2043-11-27
Also published as: CN117311988A

Abstract

The invention discloses a protocol operation optimization method with a mask, a device, equipment and a medium, which belong to the field of data processing. According to the method, data access among threads is utilized to replace shared memory read-write, grouping of threads is represented through setting of masks on bits, the grouping of a plurality of threads can be calculated in one calculation, the protocol operation efficiency is improved, and the storage space is saved.

Description

Protocol operation optimization method, device and equipment with mask and medium

Technical Field

The embodiment of the disclosure relates to the field of data processing, in particular to a method, a device, equipment and a medium for optimizing a protocol operation with a mask.

Background

The protocol is the way in which computations are parallelized in GPU programming to accomplish high performance computations, and the currently prevailing GPU programming model provides a protocol computation API for arithmetic summation, maximum and minimum, logical and, logical or and logical not. In GPU parallel programming, a protocol algorithm is typically operated on by a set of threads running in parallel with a shared memory, and the results obtained by the set of threads are also consistent. In the prior art, the protocol algorithm cannot complete the calculation of a plurality of thread groups in one call, occupies limited shared memory, and cannot better process the calculation of the thread groups in the prior protocol API. How to improve the efficiency of parallel operation and save the storage space in the protocol operation is a problem to be solved.

Disclosure of Invention

It is an object of the present invention to provide a masked protocol operation optimization method, apparatus, device and medium to at least partially solve the above-mentioned problems.

According to one aspect of the present disclosure, a masked protocol operation optimization method is provided, including:

step S1, setting a mask for each thread in the thread bundle based on the usage scenario, wherein the mask marks whether corresponding thread channels in the thread bundle participate in the protocol operation in sequence and transmits data of each thread to be subjected to the protocol operation,

step S2, each thread obtains a self channel value, the self channel value corresponds to the serial number of the thread in the mask, a first thread channel value in the thread bundle of the current thread is obtained based on the mask, the first thread channel value is the serial number corresponding to the first thread channel participating in the protocol calculation in sequence in the mask corresponding to the current thread,

step S3, each thread performs the protocol operation according to the step from large to small, and processes the data in the other thread corresponding to the protocol operation based on the mask and then operates with the data of the current thread,

step S4, each thread obtains the operation result in the first thread channel in the corresponding thread bundle,

step S5, returning a final value according to the mark of each thread in the mask.

In some embodiments, the method may further include the mask marking a continuous plurality of thread channels to participate in the protocol operation, or marking a plurality of thread channels to participate in the protocol operation at intervals.

In some embodiments, the method further includes the current thread accessing data of another thread based on inter-thread data.

In some embodiments, the method further comprises, said step S3 is specifically,

step S301, initializing a step variable stride value to be half of the warp total thread number, and entering step S302

Step S302, judging whether stride is greater than 0, if yes, proceeding to step S303, otherwise proceeding to step S4,

step S303, the value dest obtained by adding the current thread channel value and stride is taken as the thread channel value to be read, the data value is obtained from the thread channel and is recorded as temp, step S304 is entered,

step S304, determining whether the dest obtained in step S303 corresponds to a participation specification operation marked in the mask, if so, proceeding to step S306, if not, proceeding to step S305,

step S305, reassigning the temp acquired in step S303 to a default value, proceeding to step S306,

step S306, the data value and temp of the current thread are operated, the operation result is stored in the value, and the process proceeds to step S307,

in step S307, the stride value is shifted to the right by 1 bit, and the process proceeds to step S302.

In some embodiments, the method further comprises, said step S5 is specifically,

step S501, judging whether the current thread channel is marked as participating in the protocol operation in the mask, if so, entering step S502, and if not, entering step S503;

step S502, returning to the operation result obtained in the step S4;

step S503, returning to the original data value entered in step S1.

According to another aspect of the present disclosure, there is provided a masked protocol operation optimizing apparatus including:

an initialization module, setting a mask for each thread in the thread bundle based on the usage scenario, the mask marking in sequence whether the corresponding thread channel in the thread bundle participates in the reduction operation, and transmitting data of each thread to be subjected to the reduction operation,

the channel value acquisition module is used for acquiring a self channel value of each thread, wherein the self channel value corresponds to the serial number of the thread in the mask, and acquiring a first thread channel value in a thread bundle to which the current thread belongs based on the mask, and the first thread channel value refers to the serial number corresponding to a thread channel which sequentially participates in the protocol calculation and is first in the mask corresponding to the current thread;

the operation module is used for each thread to carry out the protocol operation from large to small according to the step length, processing the data in the other thread corresponding to the protocol operation based on the mask, and then carrying out the operation with the data of the current thread;

a result acquisition module, configured to acquire, by each thread, an operation result in a first thread channel in a corresponding thread bundle,

and the result return module is used for returning a final value according to the mark of each thread in the mask.

In some embodiments, the apparatus further comprises, the operation module specifically comprises,

an initialization step variable module for initializing a step variable stride value to be half of the total number of warp threads,

a first judging module for judging whether the stride is larger than 0,

a thread channel value obtaining module, configured to obtain a value dest obtained by adding a current thread channel value to a stride as a thread channel value to be read if stride is greater than 0, obtain a data value from the thread channel, and record the data value as temp,

a second judging module for judging whether the dest is marked as participation in the protocol operation in the mask,

a reassigning module for reassigning the acquired temp to a default value if the dest is marked as not participating in the protocol operation in the mask,

the second operation module is used for operating the data value and temp of the current thread, the operation result is stored in the value,

and the third operation module is used for shifting the stride value by 1 bit to the right.

In some embodiments, the apparatus further comprises, the result return module specifically comprising,

a third judging module for judging whether the current thread channel is marked as participating in the protocol operation in the mask,

the first return module is used for returning a corresponding operation result based on the judgment result of the third judgment module.

In some embodiments, the apparatus further comprises, the mask may mark a plurality of consecutive thread channels for participation in the reduction operation, or may mark a plurality of thread channels for participation in the reduction operation at intervals.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the steps in the method in any embodiment by calling the computer program stored in the memory.

The embodiment of the application also provides a computer readable storage medium storing a computer program, which is characterized in that: the computer program, when executed by a processor, performs the steps of the method of any of the embodiments above.

The method comprises the steps of setting a mask for each thread in a thread bundle based on a use scene, obtaining a self channel value corresponding to a sequence number of the thread in the mask, obtaining a first thread channel value in the thread bundle to which the current thread belongs based on the mask, carrying out a protocol operation according to a step size from large to small by each thread, processing data in another thread corresponding to the protocol operation based on the mask, then carrying out operation with the data of the current thread, obtaining an operation result in the first thread channel in the corresponding thread bundle by each thread, and returning a final value according to a mark of each thread in the mask. According to the method, data access among threads is utilized to replace shared memory read-write, grouping of threads is represented through setting of masks on bits, the grouping of a plurality of threads can be calculated through setting of the masks in one calculation, the efficiency of protocol operation is improved, and storage space is saved; meanwhile, the method and the device can meet the existing protocol calculation scene function, the API form is kept uniform with the mainstream GPU programming model, and the algorithm implementation is not dependent on the limitation of hardware.

Drawings

Fig. 1 is a schematic diagram of a method for optimizing a masked protocol operation according to an embodiment of the present application.

Fig. 2 is a schematic diagram of the use of a reduction API in a GPU according to an embodiment of the present application.

Fig. 3 is a schematic diagram of an overall execution flow of protocol operation optimization according to an embodiment of the present application.

Fig. 4 is a schematic diagram of a masked protocol operation optimization device according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings. The description of these embodiments is provided to assist understanding of the present invention, but is not intended to limit the present invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Example 1

Specifically, referring to fig. 1, a method for optimizing a masked protocol operation is provided in the present disclosure. The method comprises the steps that thread grouping information is marked through a mask, calculation is carried out through inter-thread data access, each calculation is achieved through a thread bundle (wrap) thread, in a GPU, threads are scheduled and executed by taking wrap as a unit, the rule is based on data processing of wrap, in general, threads contained in wrap execute the same processing instruction, and each thread has own data value; the protocol operation is a parallel algorithm, and a result is generated by a loop operation on a plurality of incoming data. Such operations include taking the least, sum of squares, logical and/or, vector dot product, taking the most, summing, averaging, etc. various types of operations, it being understood that the present embodiment is not limited in type of operations; it will be appreciated that such operations may be applied to a variety of application scenarios, for example, the corresponding processed data may be audio-video, images, text, signals, etc. The basic implementation process of the embodiment is as follows:

step S1, setting a mask for each thread in the thread bundle based on the usage scenario, wherein the mask marks whether a corresponding thread channel (lane) in the thread bundle participates in the protocol operation in sequence, and transmitting data to be subjected to the protocol operation for each thread.

Before invoking the reduction algorithm API, each thread sets a mask corresponding to the thread according to the usage scenario, wherein the mask marks whether the corresponding thread channel in the thread bundle participates in the reduction operation in sequence, in one embodiment, the mask may be represented as an unsigned integer of a number of bits in a wrapsze (thread number in the thread bundle), the wrapsze is generally 32 or 64, a bit corresponding value of "1" indicates that the thread channel corresponding to the sequence number participates in the reduction operation, and a value of "0" indicates that the thread channel corresponding to the sequence number does not participate in the reduction operation, such as a hexadecimal mask: 0xffffffff, then it means that the thread channels with sequence numbers from 0-31 all participate in the protocol operation. The channel here refers to the number of each thread in the wrapsize parallel threads.

Step S2, each thread acquires a self channel value, wherein the self channel value corresponds to the serial number of the thread in the mask, and a first thread channel value in a thread bundle to which the current thread belongs is acquired based on the mask, and the first thread channel value refers to the serial number corresponding to a thread channel which sequentially participates in the protocol calculation and is first in the mask corresponding to the current thread.

Each thread obtains its own channel value self, and then obtains the first thread channel value head in the belonging thread group according to the mask value in the API. In some embodiments, mask and value values may be passed together through an API call, with a schematic of the protocol API usage in the GPU as shown in FIG. 2.

And step S3, each thread performs the protocol operation according to the step size from large to small, and performs the operation with the data of the current thread after processing the data of the other thread corresponding to the protocol operation based on the mask.

Each thread performs protocol calculation according to the step size from large to small, accesses the data of the other thread through the data among the threads, processes the data in the other thread according to the setting in the mask, and then operates with the value of the thread;

it will be appreciated that the step size may be controlled by a variable, the reduction operation is performed in a loop, and the setting of the step size is not limited in this embodiment. Meanwhile, the embodiment realizes the inter-thread communication in a data access mode among threads, and does not occupy limited shared memory.

Step S4, each thread obtains an operation result in a first thread channel in the corresponding thread bundle.

Each thread obtains the result of the channel head in the wrap and returns a final value depending on whether self is marked 1 in the mask. The whole protocol operation process is executed by one thread with the number of the wrapses in parallel, and the final purpose is to operate the wrapses value values of all the threads; during the loop operation, the result of the value of each channel is calculated and stored in the channel of each channel, and only the result of the head channel is correct at the end of the whole algorithm.

The threads in the wrap used in one calculation are grouped through the setting of the mask, the result obtained by the threads in each group is only the intra-group protocol result, the mask setting supports the continuous multiple threads as a group, the threads as a group are spanned, and only part of the threads participate in the protocol calculation, so that the parallel operation efficiency is improved.

In some embodiments, the overall execution flow of protocol operation optimization is shown in fig. 3, and specifically includes the following steps:

step 1: before calling the protocol algorithm API, each thread sets a mask according to the use scene, and transmits the data value to be calculated by the protocol,

step 1.1 before entering the protocol algorithm, the threads in the wrap may set a consistent mask value mask as exemplified in table 1 below [ table 1 is split into upper and lower parts, respectively, table 1 (upper) and table 1 (lower) ], for one specific application of the present disclosure, the warp size is 32, the protocol calculation type is summation, and the mask of each thread is 0 xffffffffff; different mask values may also be set, as exemplified in table 2 below (table 2 is split into upper and lower parts, respectively, table 2 (upper) and table 2 (lower)), another specific application of the present disclosure is that the warp size is 32, the rule calculation type is summation, and the masks of the threads are 0x1f, 0x3e0, 0x7c00, 0xf800, respectively; each bit in the mask value marks whether the corresponding thread channel participates in the protocol calculation, and the mask can mark a plurality of continuous thread channels as 1 or mark a plurality of cross thread channels as 1;

step 1.2, determining a data value to be calculated by each thread;

step 2: each thread obtains a self channel value self, and then obtains a first thread channel value head in the belonging thread group according to a mask value in the API;

step 2.1, acquiring a current thread channel value through an API provided by a GPU programming model and marking the current thread channel value as self;

step 2.2, obtaining a first thread channel value head of the thread group of the current thread from a mask value, namely, a first bit sequence number marked as 1 from a 0 th bit in the mask;

step 3: each thread performs protocol calculation according to the step size from large to small, accesses the data of the other thread through the data among the threads, processes the data in the other thread according to the setting in the mask, and then operates with the value of the thread;

step 3.1, initializing a step length variable stride value to be half of the total number of threads of the wrap, and then entering step 3.2;

step 3.2, judging whether the stride is larger than 0, if so, entering step 3.3, and if not, entering step 4;

step 3.3, taking a value dest obtained by adding self and stride as a thread channel value to be read, acquiring data value transmitted into the thread channel by using an API provided by a GPU programming model as temp, and then entering step 3.4;

step 3.4, judging whether the bit of the dest obtained in step 3.3 corresponding to the mask is marked as 1, if yes, entering step 3.6, otherwise, entering step 3.5;

step 3.5, reassigning the temp obtained in step 3.3 to a default value, namely if the thread channel does not participate in the protocol operation, reassigning the data corresponding to the corresponding channel of the thread to the default value, wherein the default value is related to a specific protocol operation type, such as an addition operation default value of 0, a maximum operation default value of a minimum value of a value data type, a maximum operation default value of the value data type, a logical AND operation default value of 1, a logical OR operation default value of 0, and a logical XOR operation default value of 0, and then entering step 3.6;

step 3.6, operating the data value and temp of the current thread and storing the data value and temp in the value, namely placing the data value and temp in the data stack of the current thread, and then entering step 3.7;

step 3.7, right shifting the stride value by 1 bit, and then entering step 3.2;

step 4: each thread obtains the result of the channel head in the wrap;

step 4.1, acquiring a value after the step 3 is completed from the thread channel head by using an API provided by the GPU programming model;

step 5: returning a final value according to whether self is marked as 1 in the mask;

step 5.1, judging whether the bit of self corresponding to the mask is marked as 1, if yes, entering step 5.2, otherwise, entering step 5.3;

step 5.2, returning to the value obtained in step 4.1;

step 5.3 returns the original value entered in step 1.

Example two

To achieve the above object, this embodiment proposes a masked protocol operation optimizing apparatus, referring specifically to fig. 4, including,

a first judging module for judging whether the stride is larger than 0,

In some embodiments, the apparatus further comprises a mask that may mark a continuous plurality of thread channels for engagement with the protocol operation, or may mark a plurality of thread channels for engagement with the protocol operation at intervals.

Example III

Correspondingly, the embodiment of the application also provides electronic equipment which can be a terminal or a server. As shown in fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

The electronic device 500 includes a processor 501 having one or more processing cores, a memory 502 having one or more computer readable storage media, and a computer program stored on the memory 502 and executable on the processor. The processor 501 is electrically connected to the memory 502. It will be appreciated by those skilled in the art that the electronic device structure shown in the figures is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The processor 501 is a control center of the electronic device 500, connects various parts of the entire electronic device 500 using various interfaces and lines, and performs various functions of the electronic device 500 and processes data by running or loading software programs (computer programs) and/or units stored in the memory 502, and calling data stored in the memory 502, thereby performing overall monitoring of the electronic device 500.

In the embodiment of the present application, the processor 501 in the electronic device 500 loads the instructions corresponding to the processes of one or more application programs into the memory 502 according to the following steps, and the processor 501 executes the application programs stored in the memory 502, so as to implement various functions:

The specific implementation of each operation may refer to the foregoing embodiments, and will not be repeated herein.

Optionally, as shown in fig. 5, the electronic device 500 further includes: a protocol operation optimization module 503, a communication module 504, an input unit 505, and a power supply 506. The processor 501 is electrically connected to the protocol operation optimization module 503, the communication module 504, the input unit 505, and the power supply 506, respectively. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 5 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The protocol operation optimization module 503 may be used to implement optimization of protocol operations.

The communication module 504 may be used to communicate with other devices.

The input unit 505 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 506 is used to power the various components of the electronic device 500. Alternatively, the power supply 506 may be logically connected to the processor 501 through a power management system, so as to perform functions of managing charging, discharging, and power consumption management through the power management system. The power supply 506 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

Example IV

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of computer programs that can be loaded by a processor to perform the steps of a method of masked protocol operation optimization provided by embodiments of the present application. For example, the computer program may perform the steps of:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, randomAccess Memory), magnetic disk or optical disk, and the like.

Because the computer program stored in the storage medium may perform any step in the method for optimizing the masked protocol operation provided in the embodiment of the present application, the beneficial effects that any method for optimizing the masked protocol operation provided in the embodiment of the present application may be achieved, which are detailed in the previous embodiments and are not described herein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, and yet fall within the scope of the invention.

Claims

1. A method of optimizing a masked protocol operation, the method comprising:

step S5, returning a final value according to the mark of each thread in the mask;

the step S3 specifically includes:

step S301, initializing a stride variable stride value to be half of the warp total thread number, proceeding to step S302,

2. The method according to claim 1, characterized in that:

the mask may mark a continuous plurality of thread channels for specification operation, or may mark a plurality of thread channels for specification operation at intervals.

3. The method according to claim 1, characterized in that:

the current thread accesses data of another thread based on inter-thread data.

4. A method according to any one of claims 1-3, characterized in that:

the step S5 is specifically described as,

step S502, returning to the operation result obtained in the step S4;

step S503, returning to the original data value entered in step S1.

5. A masked protocol operation optimization apparatus, the apparatus comprising:

a result return module for returning a final value based on the tag of each thread in the mask;

wherein the apparatus further comprises:

the operation module specifically comprises a processing module and a processing module,

a first judging module for judging whether the stride is larger than 0,

6. The apparatus of claim 5, wherein the apparatus further comprises:

the result-returning module comprises in particular,

7. The apparatus according to any one of claims 5-6, wherein:

8. An electronic device, characterized in that: comprising a memory storing executable program code and a processor coupled to the memory; wherein the processor invokes executable program code stored in the memory to perform the method of any of claims 1-4.

9. A computer-readable storage medium storing a computer program, characterized in that: the computer program, when executed by a processor, performs the method of any of claims 1-4.