CN111488177A

CN111488177A - Data processing method, data processing device, computer equipment and storage medium

Info

Publication number: CN111488177A
Application number: CN202010290113.5A
Authority: CN
Inventors: 方佳瑞; 赵成舵; 于洋; 周杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-04-14
Filing date: 2020-04-14
Publication date: 2020-08-04

Abstract

The application relates to a data processing method, a data processing device, computer equipment and a storage medium. The method comprises the following steps: acquiring a target task, and acquiring more than one data set corresponding to the target task in parallel; determining an instruction stream corresponding to the target task; the instruction stream comprises more than one target instruction determined by the trigger sequence; and circularly triggering various target instructions according to the triggering sequence, and when each target instruction is triggered, sequentially taking each data set in more than one data set as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to the target task. By adopting the method, the data processing efficiency can be improved.

Description

Data processing method, data processing device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.

Background

With the development of computer technology, machine learning technology appears, various machine learning models based on the machine learning technology can improve model accuracy through a large amount of calculation, and rich online services can be deployed through the machine learning models in practical application. The machine learning model usually includes a plurality of operators, and if the operation time of each operator is too long, the service response is too slow. For example, for a BERT model (a general pre-training language representation model) in the natural language field, in an actual application scenario, a rich online service scenario may be deployed based on the BERT model. In such a scenario, people often utilize a Graphics Processing Unit (GPU) to perform parallel Processing on the BERT service computation process to improve the online response speed and reduce the service delay. While Softmax (normalization) is an important operator in the BERT model, if the operator runs on the GPU for a long time, the overall reasoning task of BERT is inefficient.

In a conventional scheme, to improve the processing efficiency of batch processing of a large amount of data, data to be processed is often divided into a set of minimum processing units, and then a plurality of minimum processing units are combined into a batch, and the data of each batch is processed in parallel. When the minimum unit is processed, if a plurality of instructions need to be transmitted and different instruction operands have dependency relations, the transmission of the following instruction can be transmitted after the execution of the preceding instruction is finished, and the instruction transmission jamming phenomenon is easily caused. Thus, the entire processing for the plurality of minimum processing units takes a very long time, and there is a problem that the efficiency of batch processing of a large amount of data is low.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a computer device, and a storage medium capable of improving processing efficiency when batch processing is performed on a large amount of data, in order to solve the above-described technical problems.

A method of data processing, the method comprising:

acquiring a target task, and acquiring more than one data set corresponding to the target task in parallel;

determining an instruction stream corresponding to the target task; the instruction stream comprises more than one target instruction determined by the trigger sequence;

and circularly triggering various target instructions according to the triggering sequence, and when each target instruction is triggered, sequentially taking each data set in more than one data set as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to the target task.

A data processing apparatus, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a target task and acquiring more than one data set corresponding to the target task in parallel;

the determining module is used for determining an instruction stream corresponding to the target task; the instruction stream comprises more than one target instruction determined by the trigger sequence;

and the instruction triggering module is used for triggering various target instructions circularly according to the triggering sequence, and when each target instruction is triggered, each data set in more than one data set is sequentially used as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to the target task.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the data processing method, the data processing device, the computer equipment and the storage medium, under the scene that a large amount of data to be processed needs to be processed, more than one data set corresponding to a target task is acquired in parallel, and an instruction stream corresponding to the target task is determined, wherein the instruction stream comprises more than one target instruction determined by a trigger sequence. Furthermore, when more than one data set is processed, various target instructions can be triggered circularly according to the triggering sequence, and when each target instruction is triggered, each data set in the more than one data sets is taken as a triggering object in sequence until the triggered target instructions are executed to obtain the operation result corresponding to the target task. In this way, in a cyclic triggering process, the same kind of target instructions based on different data sets are triggered alternately, and then another kind of target instructions based on different data sets are triggered alternately until each kind of target instructions in the instruction stream are triggered in sequence, so that the pipelined emission of instructions is realized. Therefore, the parallel granularity of the target task is adjusted by increasing the instruction level parallelism, the target task corresponding to one data set is not executed at one time, and the resource occupancy rate and the parallel efficiency can be balanced. For the entirety of more than one data set, the data processing efficiency can be greatly improved by a parallel alternative processing mode.

Drawings

FIG. 1 is a diagram of an application environment of a data processing method in one embodiment;

FIG. 2 is a flow diagram illustrating a data processing method according to one embodiment;

FIG. 3 is a diagram illustrating the parallel partitioning of target tasks by a graphics processor, according to one embodiment;

FIG. 4 is a schematic diagram illustrating a comparison between a code fragment for instruction issue and a code fragment for instruction issue in the present application in a conventional manner according to another embodiment;

FIG. 5 is a flowchart illustrating steps of triggering various target instructions cyclically according to a triggering order in one embodiment, and when each target instruction is triggered, sequentially using each data set of more than one data set as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to a target task;

FIG. 6 is a design overview of an online service employing a Transformer inference engine system, under an embodiment;

FIG. 7(A) is a diagram illustrating the implementation of the Softmax operator in the conventional scheme according to one embodiment;

FIG. 7(B) is a diagram illustrating the implementation of the Softmax operator in the present application, in one embodiment;

FIG. 8(A) is a graph illustrating a comparison of performance of Softmax calculations implemented by the present application on a processor with Softmax calculations in a conventional scheme, in one embodiment;

FIG. 8(B) is a graph illustrating performance acceleration of the Softmax calculation implemented by the present application on a processor relative to the Softmax calculation in a conventional scheme, in one embodiment;

FIG. 8(C) is a graph illustrating a comparison of performance of Softmax calculations implemented by the present application on another processor in one embodiment with Softmax calculations in a conventional scheme;

FIG. 8(D) is a graph illustrating performance acceleration of the Softmax calculation implemented by the present application on another processor in one embodiment relative to the Softmax calculation in a conventional scheme;

FIG. 9 is a block diagram showing the structure of a data processing apparatus according to an embodiment;

FIG. 10 is a block diagram showing the structure of a data processing apparatus according to an embodiment;

FIG. 11 is a diagram illustrating an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The data processing method provided by the application can be applied to the application environment shown in fig. 1. Wherein user terminal 110 communicates with computer device 120 over a network. A user may initiate a service request through the user terminal 110 and the computer device 120 generates more than one computing task based on the service request, wherein the more than one computing task includes a target task. The computer device 120 may obtain the target task and obtain more than one data set corresponding to the target task in parallel; determining an instruction stream corresponding to the target task; the instruction stream comprises more than one target instruction determined by the trigger sequence; and circularly triggering various target instructions according to the triggering sequence, and when each target instruction is triggered, sequentially taking each data set of more than one data set as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to the target task. The computer device 120 may determine a service processing result corresponding to the service request based on each operation result and feed back the service processing result to the user terminal 110.

The user terminal 110 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The computer device 120 may specifically be a terminal or a server, and the server may be implemented by an independent server or a server cluster composed of a plurality of servers.

It should be noted that the computer device 120 may be deployed with machine learning models with different functions, and different kinds of online services may be deployed through machine learning models with different functions, for example, the computer device may provide a chat robot service, a search service, or a reading understanding service through a machine learning model with a natural language processing (N L P) function (e.g., BERT model, etc.).

Where the normalization operation is to map some inputs to real numbers between 0-1, and the normalization guarantees that the sum is 1, satisfying the form of a probability distribution. The formula is as follows:

wherein x is_jRepresenting the jth element in the one-dimensional array,

indicates that each element in the one-dimensional array is exp (x)_i) The sum after the operation. Wherein the denominator part sums | x | elements, i.e. a reduction operation. The operation object of the specification operation in the machine learning is often a multidimensional tensor, so that the operation object can be regarded as a batch processing version of the formula, and a high-dimensional merged input matrix can be regarded as a two-dimensional matrix (high dim) or a two-dimensional matrix (high, low). Then when the scaling operation needs to be performed on the array with the length (leading dim), a total of high dim such scaling operations need to be performed, that is, a batch of scaling operations need to be performed. When the target task is a specification task, batch specification operation is performed on a large amount of data to be processed through the data processing method, and processing efficiency can be greatly improved. Of course, the target task may also be other computing tasks that need to be processed in batch, and may be related to a specific application scenarioThis is not limited in the embodiments of the present application.

It should be noted that the data processing method mentioned in the embodiments of the present application mainly aims at that more than one calculation task needs to be executed when the machine learning model implemented by the artificial intelligence technology performs operation, and this includes the target tasks mentioned in the embodiments of the present application. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

It can be understood that the data processing method in the embodiments of the present application specifically relates to the Machine learning technology of artificial intelligence, wherein Machine learning (Machine L earning, M L) is a multi-domain cross subject, and relates to a multi-domain subject such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. how to simulate or implement the learning behavior of human beings by a computer is specially studied to obtain new knowledge or skills, and reorganize the existing knowledge structure to continuously improve the performance of the computer.

The scheme provided by the embodiment of the application relates to an artificial intelligence machine learning technology, and is specifically explained in detail through the following embodiments:

in one embodiment, as shown in fig. 2, a data processing method is provided, which is described by taking the method as an example applied to the computer device 120 in fig. 1, and comprises the following steps:

step S202, acquiring a target task, and acquiring more than one data set corresponding to the target task in parallel.

The target task is a calculation task to be executed, and specifically may be a specification task. A data set is a collection of data, each datum in the data set that can be considered an element, and the data set can also be referred to as an element set. In one embodiment, the data in the data set may be specifically ordered data, such as a one-dimensional array.

Specifically, the computer device may receive a service request sent by the user terminal, and generate a series of computing tasks based on the service request, where one computing task may specifically be a target task, and the target task may specifically be a specification task. The computer device can obtain the data to be processed corresponding to the target task, and decompose the data to be processed to obtain one data block, and each data block can be composed of a plurality of data sets. The computer device can acquire a preset number of data sets in parallel at one time, and then execute the data processing method mentioned in the embodiments of the present application on the preset number of data sets to obtain operation results corresponding to each data set. Wherein the predetermined number is greater than a numerical value of one.

It will be appreciated that the series of computational tasks may be dependent, i.e., have a fixed order of execution. For example, after a certain computing task is completed to obtain a corresponding computing result, the subsequent computing tasks may execute the corresponding computing task based on the computing result. Of course, different computing tasks may also be executed in parallel, and there is no dependency relationship between them, which is not limited in the embodiment of the present application.

In one embodiment, each of the more than one data sets may specifically be a data set composed of raw data that has not been subject to the target task. Each data set in the more than one data sets may also be a data set formed by data of the target task that needs to be executed again after the target task has been executed, where the data set at this time is formed by an intermediate operation result obtained after the target task has been executed in the preamble, and this is not limited in this embodiment of the present application.

In an embodiment, before acquiring more than one data set corresponding to a target task in parallel, the data processing method further includes a step of data segmentation, and the step specifically includes: acquiring an input matrix to be processed corresponding to a target task; according to the preset size of the cut, the input matrix is cut into more than one data block along the high-dimensional direction; dividing the data block into at least one first data group; the first data set includes a predetermined number of data sets, the predetermined number being greater than a value one.

The input matrix is source data of a target task to be executed, and may also be regarded as data to be processed, that is, a sum of data sets of all target tasks to be executed. The input matrix may specifically be a feature vector matrix obtained by the machine learning model in the processing process. The input matrix can be regarded as a two-dimensional matrix (high dim), and each one-dimensional array with a length of the leading dim can form the data set mentioned in the embodiments of the present application. There are high dim such data sets for the input matrix.

Specifically, the computer device may obtain an input matrix to be processed corresponding to the target task. Further, the input matrix is divided into a plurality of data blocks along the high dimension (high dim) direction according to a preset partition size (for example, the partition size is blk _ size). Therefore, the computer equipment can adopt a parallel computing mode for a plurality of data blocks so as to take resource consumption and processing efficiency into consideration. In processing each data block, the computer device may divide the data block into at least one first data group, each first data group including a preset number of data sets, wherein the preset number is greater than a numerical value of one. That is, the computer device may construct a preset number of data sets into one first data group, each time processing the one data group in parallel at the same time. For example, when the preset number is 2, the computer device may perform specification processing on 2 data sets in parallel at a time.

It will be appreciated that for each block of data, the computer device may process each block of data in parallel, as hardware resources allow. Of course, the computer device may also process another data block after completing the processing of one data block, which is not limited in this embodiment of the present application.

It should be noted that, in the face of a rich online server scenario based on a machine learning model, due to the large data processing amount, a large number of computing processes can be supported by a computing device deployed with a Graphics Processing Unit (GPU), and by performing parallel processing on a large number of computing tasks to improve the online response speed, the service delay can be reduced. Mapping such a batch of task operations to a parallel architecture such as a GPU requires special skills to perform the computation. Based on the above, various embodiments of the present application provide an optimization method for efficient batch task processing oriented to a GPU to accelerate the operation process thereof, and the core idea is to mine the high-dimensional parallel operation capability of an input matrix.

In one embodiment, referring to FIG. 3, FIG. 3 is a schematic diagram of a graphics processor partitioning a target task in parallel in one embodiment. As shown in fig. 3, a computer device may assign a copy of a data block to a thread block (thread block) process, where a thread block includes a number of threads, each thread processing one or more elements. Each thread in a thread block corresponding to a slice of a data block is hardware scheduled to run on one SM (streaming multiprocessor) of the GPU. Within the SM are a number of SP (streaming processor) units, each running one thread of a thread block. The threads in the thread block are scheduled in units of Warp (thread bundle). The Warp is the most basic parallel granularity of the GPU, and the SP in one SM is divided into several warps, each with 32 threads. The 32 threads in this Warp are working together and execute the same instructions.

In one embodiment, for each data block, the computer device may assign it a corresponding thread block. The computer device may divide the data block into at least one first data group, and perform the target task with the first data group as a whole. That is, the computer device may assign a corresponding thread bundle to each first data set, wherein one of the elements in each data set in the first data set is stored in a thread register in the corresponding thread bundle for invocation by the corresponding thread. In this way, the computer device can sequentially execute the target operation corresponding to the target task on the respective first data groups, thereby realizing the processing of the data blocks. Therefore, when each data block is processed, only one data set is processed at a time, the occupancy rate and the parallel efficiency of the GPU stream processor are balanced by adjusting the task parallel granularity, and the data processing efficiency is greatly improved.

In the above embodiment, the input matrix is divided into more than one data block, so that a large calculation amount can be split into smaller calculation amounts which can be processed in batches. When each data block is processed, the parallel granularity of the tasks is adjusted, and the resource occupancy rate and the parallel efficiency can be balanced, so that the data processing efficiency is greatly improved on the whole.

Step S204, determining an instruction stream corresponding to the target task; more than one target instruction determined by the trigger sequence is included in the instruction stream.

Where an instruction stream is a sequence of instructions that a computer program needs to execute. The instruction stream mentioned in the embodiments of the present application includes more than one target instruction for triggering order determination. After the target instruction is triggered, the computer device executes a target operation corresponding to the target instruction. Different types of target instructions may correspond to different target operations. For example, when the target task is a specification task, the corresponding instruction stream includes a shuffle instruction and a sum instruction, and the target operation corresponding to the target instruction may specifically include a shuffle operation and a sum operation.

It can be understood that each target instruction in the instruction stream is an instruction related to a target task, and execution of the target task can be achieved after different kinds of target instructions are triggered to execute. For example, when the target task is a summation task, the target instructions in the corresponding instruction stream may be a shuffle instruction and a summation instruction. When the target task is a max task, the corresponding target instruction may specifically include a shuffle instruction and a max instruction. When the target task is a compute variance task, the corresponding target instructions may specifically include a shuffle instruction, a multiply instruction, a sum instruction, and a divide instruction, etc. It is to be understood that the target instructions included in the instruction stream are related to specific target tasks, and the embodiments of the present application do not limit this.

Specifically, after determining the target task, the computer device may determine an instruction stream required for executing the target task, where the instruction stream includes more than one target instruction that triggers the order determination. Wherein the trigger order determination means that the firing order of each of the more than one target instructions is fixed. For example, when the instruction stream includes a first instruction and a second instruction, the first instruction must be issued before the second instruction, i.e., only the first instruction is issued before the second instruction is issued.

For example, when the target task is a specification task, the corresponding instruction stream includes a shuffle instruction (SHF L instruction) and a sum instruction (FADD instruction), and the destination operand of the SHF L instruction is the same as the source operand of the FADD instruction, so that the two instructions have a dependency relationship.

In one embodiment, the instruction stream includes a first instruction and a second instruction, the first instruction being triggered before the second instruction; the trigger object of the second instruction further comprises first target operation data obtained by executing the first instruction, wherein the execution time of the first instruction is more than one time period.

Specifically, the instruction stream includes a first instruction and a second instruction, and the first instruction is triggered before the second instruction. The trigger object of the second instruction further comprises first target operation data obtained by executing the first instruction, wherein the execution time of the first instruction is more than one time period.

In one embodiment, the target task comprises a specification task, the corresponding first instruction is a shuffle instruction (SHF L instruction), the second instruction is a sum instruction (FADD instruction), during each round-robin issue of the instruction stream, the computer device issues a SHF L instruction, and then performs the corresponding shuffle operation to obtain corresponding first destination operation data.

In the above embodiment, the instruction stream includes a first instruction and a second instruction, the transmission of the first instruction and the second instruction has a dependency relationship, that is, the first instruction is triggered before the second instruction, and the trigger object of the second instruction further includes first destination operation data obtained by executing the first instruction. In this way, the whole instruction stream can assist the execution of the target task together through the dependence relationship among different instructions.

And step S206, circularly triggering various target instructions according to the triggering sequence, and when each target instruction is triggered, sequentially taking each data set of more than one data set as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to the target task.

Specifically, the computer device may cyclically trigger various target instructions in a trigger order, and each of the more than one data sets may be sequentially treated as a trigger object when each target instruction is triggered. That is, the computer device will alternately trigger the target instruction corresponding to each data set of more than one data set, and then execute the corresponding target operation, and stop transmitting the target instruction until the triggered target instructions are executed to obtain the operation result corresponding to the target task.

In one embodiment, the data set includes a first instruction and a second instruction. For a data set, when a computer device triggers a target instruction taking each element in the data set as a trigger object, first target operation data obtained after a first instruction in a trigger sequence is executed can be simultaneously used as a trigger object of a second instruction in the trigger sequence.

In one embodiment, the description of the triggering mode of the target instruction is performed by taking more than one data set, specifically two data sets as an example: the computer device may allocate corresponding thread bundles for the two data sets, and configure two sets of registers for the corresponding thread bundles, respectively for storing the two data sets. Wherein the two data sets may be referred to as a first data set and a second data set for convenience of description. One of the elements in the first data set (referred to as a first element) is assigned to a thread in a corresponding thread bundle, and one of the elements in the second data set (referred to as a second element) is assigned to the thread. The computer device may store the first element and the second element in corresponding first and second registers, respectively. In this way, in the triggering process of one cycle, each thread in the thread bundle may trigger the first instruction with one element in the first data set as a trigger object, trigger the first instruction with one element in the second data set as a trigger object, trigger the second instruction with another element in the first data set as a trigger object, and trigger the second instruction with another element in the second data set as a trigger object. Each time a thread triggers a target instruction, each thread may read a trigger object from a corresponding register and perform a corresponding trigger operation based on the trigger object. And continuously and circularly triggering the target instructions until each triggered target instruction is executed to obtain an operation result corresponding to each data set, wherein the operation result also corresponds to the target task.

In one embodiment, the code triggering the target instruction for one loop refers to the following:

SHF L XOR, R4, R3,0X10,0X1F// (first instruction 1)// H

SHF L XOR, R6, R5,0X10,0X1F// (first instruction 2)// H

FADD R3, R3, R4// (second instruction 1)// (preceding instruction://)

FADD R5, R5, R6// (second instruction 2)///H

As can be seen from the above instruction fragments, the target register R3 of the SH L F instruction for the first data set is used as the source operand of the FADD after two cycles, and one more SHF L instruction for the second data set can be issued in the middle of one cycle, thereby increasing the instruction level parallelism, reducing the instruction issue stuck phenomenon caused by the dependency of the instruction operands, and enabling the instructions to be issued in a pipelined manner.

The following description will show, by way of example and comparison, how to improve the processing efficiency by using the instruction triggering method mentioned in the embodiment of the present application, where the FADD can be executed 2 cycles after the SHF L, and in the conventional method, 6 cycles are required for transmitting two groups of instruction streams, such as 1SH L F, 2- - -, 3FADD, 4SH L F, 5- - -, and 6FADD, but in the embodiment of the present application, only 4 cycles are required for transmission, such as 1SH L F, 2SH L F, 2FADD, and 3FADD, which can greatly improve the instruction transmission efficiency, and further improve the processing efficiency for more than one data set, and especially when a large number of data sets need to be processed, the improvement in efficiency will be more significant.

In one embodiment, when each target instruction is triggered, sequentially taking each data set of the more than one data sets as a trigger object specifically includes: in each cycle triggering process, when a first instruction is triggered, each data set in more than one data set is sequentially used as a triggering object, and first target operation data respectively corresponding to each data set is obtained by executing the first instruction; and when the second instruction is triggered, sequentially taking the first target operation data and the corresponding data set respectively corresponding to each data set as trigger objects, and executing the second instruction to obtain second target operation data respectively corresponding to each data set, wherein the second target operation data is used for updating the corresponding data set.

Specifically, in each cycle triggering process, when the computer device triggers the first instruction, each data set of the more than one data sets may be sequentially used as a trigger object, and the first instruction is executed to obtain first destination operation data respectively corresponding to each data set. And when the second instruction is triggered, the computer device may sequentially use the first target operation data and the corresponding data set corresponding to each data set as a trigger object, obtain second target operation data corresponding to each data set by executing the second instruction, and update the corresponding data set with the second target operation data. Therefore, the next round of triggering can be carried out in the next round of triggering process based on the second destination operand obtained in the current round, and the triggering and updating are carried out continuously in a round until operation results corresponding to the data sets are obtained.

In one embodiment, the instruction stream includes a shuffle instruction and a sum instruction, and the computer device may cross-trigger the shuffle instruction and the sum instruction corresponding to each data set until a specification result corresponding to a specification task is obtained by processing each data set in parallel. For each data set, the computer device can acquire data in a certain register through a thread corresponding to the data set, trigger a shuffle instruction and execute the shuffle operation, further trigger and execute a summation instruction on first target operation data of the shuffle operation and data in a register corresponding to the thread to obtain second target operation data, and replace original data in the register with the second target operation data. Thus, the shuffle operation and the sum operation are performed circularly until the sum of all the data in the data set is obtained, i.e. the reduction result corresponding to the data set.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating a comparison between a fragment of a command issue code in a conventional manner and a fragment of a command issue code in the present application in an embodiment, the left side of fig. 4 is a fragment of a command issue code in a conventional manner, and as can be seen from the left side of fig. 4, in a conventional scheme, only one data set is processed each time, a destination operand R3 of an SHF L instruction at each time is a source operand R3 of a next instruction FADD, so that the FADD must wait for the SHF L to be completely executed before issuing, which results in low instruction execution efficiency.

In the above embodiment, by adjusting the task parallel granularity, the target instructions for other data sets can be transmitted in the transmission waiting period of different types of target instructions for one data set, so that the resource occupancy rate can be balanced, and the parallel efficiency of instruction transmission is greatly improved.

In one embodiment, a computer device may obtain a target task and determine an instruction stream corresponding to the target task, where the instruction stream includes more than one target instruction determined by a trigger sequence. The computer device may then obtain more than one data set corresponding to the target task in parallel and assign a respective thread block to the more than one data set, wherein one of the elements in each data set is assigned to one of the threads in the thread block. In this way, the computer device can circularly trigger various target instructions according to the triggering sequence through each thread in the thread block, and when each target instruction is triggered, each data set in more than one data set is taken as a triggering object in sequence until each triggered target instruction is executed to obtain an operation result corresponding to the target task.

In one embodiment, the data processing method mentioned in the embodiments of the present application is implemented by running on a GPU, and the target task mentioned in the embodiments of the present application may specifically be a specification task. This specification task specifically can be an operator in a machine learning model, that is to say when computer equipment carries out specific business processing through this machine learning model, can carry out the specification operation in the course of handling, carries out corresponding specification operation through the data processing method who adopts each embodiment of this application to mention this moment, can improve the treatment effeciency greatly. The machine learning model is specifically applied to the field of natural language processing, and can be a language characterization model, such as a BERT model.

According to the data processing method, under the scene that a large amount of data to be processed needs to be processed, more than one data set corresponding to a target task is acquired in parallel, and an instruction stream corresponding to the target task is determined, wherein the instruction stream comprises more than one target instruction determined by a trigger sequence. Furthermore, when more than one data set is processed, various target instructions can be triggered circularly according to the triggering sequence, and when each target instruction is triggered, each data set in the more than one data sets is taken as a triggering object in sequence until the triggered target instructions are executed to obtain the operation result corresponding to the target task. In this way, in a cyclic triggering process, the same kind of target instructions based on different data sets are triggered alternately, and then another kind of target instructions based on different data sets are triggered alternately until each kind of target instructions in the instruction stream are triggered in sequence, so that the pipelined emission of instructions is realized. Therefore, the parallel granularity of the target task is adjusted by increasing the instruction level parallelism, the target task corresponding to one data set is not executed at one time, and the resource occupancy rate and the parallel efficiency can be balanced. For the entirety of more than one data set, the data processing efficiency can be greatly improved by a parallel alternative processing mode.

In one embodiment, performing the target operation on more than one data set is a two-phase process. The following describes in detail how the two-stage process is implemented. Referring to fig. 5, step S206, that is, cyclically triggering various target instructions according to a triggering sequence, and when each target instruction is triggered, sequentially taking each data set of more than one data sets as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to a target task, specifically includes the following steps:

step S502, each data set is divided into more than one group of data subsets, and one of the data subsets in each data set is acquired as a second data group.

In one embodiment, the computer device may divide each data set into more than one set of data subsets, wherein each data subset includes a preset number of elements. The computer device may obtain one of the subsets of data in each of the data sets as a second data set.

For example, the more than one data set includes a first data set and a second data set. Each data set includes 64 elements, and the computer device may treat 32 elements in each data set as a data subset. That is, the first data set includes 2 first data subsets; the second data set comprises 2 second data subsets. The computer device may use one of the first data subsets and the second data subsets as a second data set, i.e. may form 2 second data sets.

Step S504, a corresponding thread bundle is respectively allocated to each second data group.

In particular, the computer device may each assign a respective thread bundle to each second data group. Each thread in the thread bundle corresponds to a preset number of registers, and each register in the preset number of registers is used for storing one element in one data subset.

For example, when the second data set includes a first data subset and a second data subset, and each data subset includes 32 elements. The computer device may assign a bundle to the second data group, where each bundle corresponds to 2 registers, and the 2 registers in each bundle are used to store one of the elements in the first data subset and one of the elements in the second data subset.

In an embodiment, the step of dividing each data set into more than one group of data subsets and acquiring one data subset in each data set as the second data group specifically includes: determining the number of threads corresponding to the thread bundle, and dividing each data set into more than one group of data subsets based on the number of threads; one of the subsets of data in each of the data sets is acquired as a second data set. The step of assigning a corresponding thread bundle to each data group specifically includes: and acquiring the thread bundles with the same number as the second data groups, and distributing the second data groups to the thread bundles one by one.

The number of threads is the number of threads included in a thread bundle, for example, if 32 threads are included in a thread bundle, the corresponding number of threads is 32. In particular, the computer device can determine a number of threads corresponding to a thread bundle and divide each data set into more than one set of data subsets based on the number of threads. I.e. every 32 elements in the data set constitute a data subset. Elements that are divided into less than the last 32 elements are also grouped into a data subset. Further, the computer device may obtain one of the subsets of data in each of the data sets as a second data set. And the computer equipment acquires the thread bundles with the same number as the second data groups and distributes the second data groups to the thread bundles one by one.

In the above embodiment, the computer device may divide each data set into more than one group of data subsets according to the number of threads corresponding to the thread bundle, so that one of the data subsets in each data set may be commonly allocated to one thread bundle for processing, and each target instruction in the instruction stream may be triggered according to the thread bundle.

Step S506, for each thread bundle, cyclically triggering various target instructions according to the triggering order, and when each target instruction is triggered, sequentially taking each data subset in the second data group corresponding to the thread bundle as a triggering object until the triggered target instructions are executed by the thread bundle to obtain intermediate operation results corresponding to each data subset in the corresponding second data group.

Specifically, each thread in the thread bundle circularly triggers various target instructions according to a triggering sequence, and when each target instruction is triggered, each thread sequentially takes each data subset in the second data group corresponding to the thread bundle as a triggering object until the triggered target instructions are executed by the thread bundle to obtain intermediate operation results corresponding to each data subset in the corresponding second data group.

In one embodiment, an instruction stream includes a first instruction and a second instruction. For each thread bundle, in each loop triggering process, when a thread in the thread bundle triggers the first instruction, one element in each data subset in the second data group may be sequentially used as a trigger object, and the first target operation data corresponding to each data subset is obtained by executing the operation corresponding to the first instruction. When the second instruction is triggered, the computer device may sequentially use, as a trigger object, the first destination operation data corresponding to each data subset and another element in the corresponding data subset, obtain, by executing an operation corresponding to the second instruction, second destination operation data corresponding to each data subset, and update the corresponding data subset with the second destination operation data. And then, the next round of triggering can be carried out in the next round of triggering process based on the second target operation data obtained in the current round, and the triggering and updating are carried out continuously in a round until intermediate operation results corresponding to the data subsets are obtained.

In one embodiment, each thread bundle may select one representative thread to store the intermediate operation result corresponding to each data subset in the corresponding second data group to the corresponding sharing position. In one embodiment, each Warp selects a representative thread to write to the shared memory before performing a synchronization operation __ synchreads (), where the threads in all the bundles wait for each other to have reached the synchronization operation before continuing.

In one embodiment, the computer device may store the intermediate operation results stored in the shared locations in registers corresponding to the second stage of the thread bundle. The second phase of the thread bundle may determine the operation result corresponding to each data set based on the intermediate operation result corresponding to each data subset stored in each register.

Step S508, determining the operation results corresponding to the data sets according to the intermediate operation result corresponding to each data subset in the second data group corresponding to each thread bundle.

Specifically, the computer device may execute the target operation corresponding to the target task again according to the intermediate operation result corresponding to each data subset in the second data group corresponding to each thread bundle, so as to obtain the operation results corresponding to each data set.

In one embodiment, for each data set, the computer device may obtain intermediate operation results corresponding to each data subset in the data set, trigger various target instructions in the trigger order cyclically based on the intermediate operation results, and execute target operations corresponding to the target instructions after triggering the target instructions until operation results corresponding to the data set are obtained. In this way, the computer device can obtain operation results corresponding to the respective data sets.

In an embodiment, in step S508, that is, the step of determining the operation result corresponding to each data set according to the intermediate operation result corresponding to each data subset in the second data group corresponding to each thread bundle specifically includes: forming an intermediate array corresponding to the corresponding data set by using the intermediate operation result corresponding to each data subset in each data set; and circularly triggering various target instructions according to the triggering sequence, and when each target instruction is triggered, sequentially taking each intermediate array of more than one intermediate array as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to the target task.

Specifically, the computer device may construct an intermediate array corresponding to the corresponding data set from intermediate operation results corresponding to each data subset in each data set. In this way, after each data subset in each data set is subjected to a round of target operation corresponding to the target task, corresponding intermediate operation results can be obtained, and the intermediate operation results corresponding to the data subsets can form a one-dimensional intermediate array. That is, there are a predetermined number of data sets, and accordingly, a predetermined number of intermediate arrays are generated. Furthermore, the computer device can perform the second stage target operation on the preset number of intermediate arrays to obtain the corresponding operation result. That is, the computer device may allocate corresponding thread bundles to the preset number of intermediate arrays, and then cyclically trigger various target instructions according to the trigger sequence through the allocated thread bundles, and when each target instruction is triggered, each intermediate array of more than one intermediate array is sequentially used as a trigger object until each triggered target instruction is executed to obtain an operation result corresponding to the target task.

The following explains the two-stage process by taking the target task as a specification task and taking the corresponding target operation as a specification operation as an example: for each data subset in each data set, the computer device can perform the operation of stipulating and summing the elements stored by the 32 threads in one Warp by using a warpAlReduceSum function, and obtain an intermediate operation result corresponding to one data subset. And the computer device can write the intermediate operation results corresponding to the data subsets into the corresponding shared memories respectively. The computer equipment can write the intermediate operation result data in each shared memory into a Warp register, and then operates a warpAllReduceSum function once again to obtain the reduction result of the reduction operation of the data sets.

In the above embodiment, when more than one data set is processed in parallel, the target task may be completed by executing a two-stage loop to trigger a target instruction in the instruction stream and executing a corresponding target operation process. Resource occupancy and parallelism efficiency can be balanced.

In one embodiment, the computer device may further fuse intermediate operation results corresponding to each data subset in a data set in other manners to obtain an operation result corresponding to the data set. For example, the computer device may directly perform a summation operation or a multiplication operation on intermediate operation results corresponding to each data subset in a data set to obtain an operation result corresponding to the target task, which is not limited in this embodiment of the present application.

In an embodiment, in the second-stage warp reduce process, after obtaining the operation result corresponding to each data set, the thread bundle for processing each intermediate operation result may store each operation result to a specified shared location, where the shared location may be a shared memory. After the thread bundle writes the shared memory, the subsequent tasks are continuously executed after one synchronous operation is required to be executed.

In the above embodiment, when more than one data set is processed in parallel, each data set may be split into data subsets to be processed, and then intermediate operation results corresponding to each data subset are fused to obtain an operation result corresponding to the target task.

In one embodiment, the data processing method further includes a branch determining step, which specifically includes: when the number of elements of the data subset is less than the number of threads of the thread bundle, determining an effective thread and an ineffective thread in the thread bundle; the effective thread is the thread which is distributed with the element to be processed; the invalid thread is a thread which is not allocated with the element to be processed; wherein the element to be processed is an element in each data subset in the second data group. Step 506, that is, for each thread bundle, cyclically triggering various target instructions according to the triggering order, and when each target instruction is triggered, sequentially taking each data subset in the second data group corresponding to the thread bundle as a triggering object until the triggered target instructions are executed by the thread bundle to obtain intermediate operation results corresponding to the data subsets in the corresponding second data group, respectively, including: and for each thread bundle, circularly triggering various target instructions according to the triggering sequence, when an effective thread in the thread bundle triggers each target instruction, using the corresponding allocated element to be processed as a triggering object to realize the triggering of the target instruction, and when an ineffective thread in the thread bundle triggers each target instruction, using a preset value as a corresponding triggering object to realize the triggering of the target instruction.

The preset value is a value which does not affect the operation result, for example, when the target task is a specification of taking a maximum value, the preset value is the minimum number which can be expressed by one computer device; this preset value is zero when the target task is a summation specification. Specifically, since each thread is executed in units of a thread bundle during running, when the number of elements in a data subset is smaller than the number of threads in the thread bundle, the threads allocated to the data subset are filled up into a thread bundle. For example, when a warp reduce operation is to be performed on each data set, since the warp reduce is performed in units of 32 threads, it is necessary to make up the size of a thread block (thread block) corresponding to the data set to 32, that is, to make up 32 for less than 32 threads. For example, by adjusting (blk, leading dim) to (blk, (leading dim +31)/32 × 32), an integer multiple of 32 is obtained. And a filled thread may be referred to as an invalid thread, i.e., a thread outside the boundary. That is, the thread in the thread bundle to which the element to be processed is assigned is an active thread, and the thread in the thread bundle to which the element to be processed is not assigned is an inactive thread.

It will be appreciated that invalid threads cannot retrieve valid data from registers at run-time, and thus, boundary conditions require branch decisions. The thread in the Warp can execute the judging step in advance, and when the thread in the Warp is an effective thread, each target instruction is triggered directly based on the elements stored in the corresponding register; when the threads in the Warp are invalid threads, preset numerical values which do not influence the specification result are directly used as trigger objects to trigger corresponding target instructions, namely the preset numerical values are assigned to be 0 when the specification operation is carried out. This determination step can be implemented by the if and else statements, so that the if and else instructions need to be executed once before the specification operation is performed.

In a conventional scheme, before a corresponding target operation is performed on one data set by a thread bundle, a branch judgment is performed, and before the corresponding target operation is performed on N data sets (where N is a positive integer greater than 1), N branch judgments are performed, which undoubtedly brings about a lot of resource overhead and processing time. In the manner mentioned in the embodiments of the present application, the parallel granularity is adjusted to more than one data set, such as M (M is a positive integer greater than 1, and M is less than N) data sets. Then branch decisions will be made before the corresponding target operations are performed on the M data sets by the thread bundle, and N/M branch decisions will be made before the corresponding target operations are performed on the N data sets. Therefore, the resource consumption and the processing time can be greatly reduced, and the processing efficiency is improved.

In the above embodiment, since each thread is executed in the unit of a thread bundle during running, and an invalid thread in the thread bundle is not allocated with an element to be processed, the invalid thread can be assigned as a preset value which does not affect a specification result during the process of participating in operation, so that it can be ensured that the invalid thread does not affect an intermediate operation result corresponding to the data subset during the process of participating in operation.

In one embodiment, the data processing method further includes a step of synchronous waiting, and the step specifically includes: respectively storing the operation result corresponding to each data set in more than one data sets to the corresponding sharing positions; and after the synchronous operation is executed, keeping a waiting state until a waiting ending condition is met, and entering a next target task execution process.

The condition that the wait-to-stop condition is satisfied may be that all threads in the thread block participating in the target task execute to the synchronization operation. Specifically, the computer device may store the operation result corresponding to each of the more than one data sets to the corresponding designated shared location, respectively, where the shared location may be a shared memory. After the graphics processor completes writing the shared memory, it needs to execute a synchronization operation, that is, an operation corresponding to the __ synchreads () instruction, and then enters the next target task execution process until all other parallel processing threads execute the synchronization operation.

In the above embodiment, the processing progress of different threads can be coordinated through synchronous operation, so as to better manage and control the execution of the target task.

In one embodiment, the target task includes a specification task; the operation result includes a specification result. The data processing method further comprises a normalization processing step, which specifically comprises the following steps: obtaining specification results corresponding to each data set in more than one data sets in parallel; and for each data set, dividing each element in the data set with the corresponding reduction result of the corresponding data set to obtain the normalized distribution result corresponding to the corresponding data set.

Specifically, the computer device may obtain, in parallel, a specification result corresponding to each data set of the more than one data sets, and further perform, for each data set, division on each element in the data set and the specification result corresponding to the corresponding data set, to obtain a normalization result corresponding to each element. The normalization result corresponding to each element in the data set constitutes the normalization distribution result corresponding to the data set.

In one embodiment, the computer device issues various target instructions in the instruction stream by a thread and performs corresponding operations, resulting in a specification result corresponding to each data set. The result of the specification can be passed

And (4) showing. And then the computer equipment stores the protocol result corresponding to each data set in the shared memory, can be accessed by other threads after one-time synchronous operation, and executes one-time division to obtain a normalization result

And (6) obtaining the result.

In the above embodiment, for each data set, dividing each element in the data set by the reduction result corresponding to the data set, so as to obtain the normalized distribution result corresponding to the data set.

In one embodiment, the data processing method is executed by a graphics processor disposed on a computer device, and the data processing method is applied to a language representation model; the specification task is a subtask in the normalization task, and the normalization task is one of the calculation tasks in the language representation model.

Specifically, the data processing method mentioned in the embodiments of the present application is executed by a graphics processor disposed on a computer device, and the data processing method is applied to a language representation model. When the language representation model is used for business processing, a plurality of calculation tasks need to be executed, and one of the calculation tasks is a normalization task. And the specification task is a subtask in the normalization task.

For example, a computer device may provide online services by deploying a BERT model, whereas Softmax is an important operator in the BERT model that includes the subtask of a specification task when executed. If this Softmax operator runs too long on the GPU, it may result in inefficiency in the overall inference task of BERT. By the data processing method provided by each embodiment of the application, the execution of the protocol task can be efficiently realized, and further the calculation efficiency of the Softmax operator is improved, so that the service processing efficiency of the BERT is improved, the online response speed is improved, and the service delay is reduced.

In a specific application scenario, by deploying the BERT model, various online services can be provided, such as a question and answer service, a reading understanding service, an information retrieval service, a chat robot service (Dialog System orchotbot), a text summarization service or a sentence similarity comparison service, and the like. A user may trigger a service request corresponding to a certain service, such as an information retrieval request corresponding to an information retrieval service, through a user terminal. And the computer equipment can perform corresponding business processing and feedback through the BERT model. The BERT model has a scene that needs normalization processing in the process of performing service processing, for example, the BERT model needs normalization processing on the feature vectors extracted in the intermediate processing process, and at this time, the feature vectors are subjected to reduction operation by the data processing method mentioned in the embodiments of the present application and then subjected to division operation to obtain a normalization result, so that the subsequent calculation task is resumed.

In the above embodiment, the GPU may perform parallel processing on the computation tasks in the language representation model, where the computation tasks include a normalization task, and the normalization task includes a specification task. Therefore, the execution of the protocol tasks can be efficiently realized by performing the crossed parallel processing on the batch protocol tasks, the calculation efficiency of the Softmax operator is further improved, the service processing efficiency of the language representation model is further improved, the online response speed when the online service is provided through the language representation model is improved, and the service delay is reduced.

In one embodiment, before acquiring the target task, the data processing method further includes: acquiring a service request, and generating more than one computing task based on the service request; the target task is included in the more than one computing task. After the step of dividing each element in the data set by the reduction result corresponding to the corresponding data set to obtain the normalized distribution result corresponding to the corresponding data set, for each data set, the data processing method further includes: processing the calculation task of the subsequent stage based on the normalized distribution result corresponding to each data set to obtain a service processing result corresponding to the service request; and responding to the service request and feeding back a service processing result.

Specifically, the computer device may receive a service request triggered by a user terminal and generate more than one computing task based on the service request, the more than one computing task including a target task. It can be understood that the service request may specifically be a search request, an image recognition request, or a data acquisition request, and based on different online services, the user terminal may trigger different service requests, which is not limited in this embodiment of the present application. Furthermore, the computer device may generate more than one computing task based on the service request, where the computing task may be a feature extraction task, a specification task, a normalization task, a classification task, or the like, and is related to a specific service request, and this is not limited in this embodiment of the present application.

Further, for each first data group, the computer device may cyclically and sequentially trigger various target instructions according to the order of instruction arrangement in the instruction stream, and when each target instruction is triggered, each data set in the first data group is sequentially used as a trigger object until each triggered target instruction is executed to obtain a specification result corresponding to the specification task. In this way, the computer device can obtain specification results corresponding to each data set.

Further, the computer device may obtain, in parallel, a specification result corresponding to each data set in the first data group, and perform, for each data set, division operation on each element in the data set and the specification result corresponding to the corresponding data set, to obtain a normalized distribution result corresponding to the corresponding data set. And processing the calculation task in the subsequent stage based on the normalized distribution result corresponding to each data set to obtain a service processing result corresponding to the service request. And the computer equipment responds to the service request and feeds back a corresponding service processing result. For example, when the service request is a search request, the computer device may feed back the search result to the user terminal. When the service request is a data acquisition request, the computer device may feed back corresponding target data to the user terminal.

In one embodiment, the computer device may obtain a respective normalized difference for each data set in the input matrix. Furthermore, the computer device can process the calculation task in the subsequent stage based on the normalized distribution result corresponding to each data set to obtain the service processing result corresponding to the service request, and feed back the service processing result to the user terminal.

For example, the computer device may provide question and answer service, reading understanding service, information retrieval service, chat robot service (Dialog System or Chatbot), text summarization service or sentence similarity comparison service, etc. based on a machine learning model (such as a BERT model). The user can trigger a service request corresponding to a certain service through the user terminal, such as an information retrieval request corresponding to an information retrieval service, where the information retrieval request carries a retrieval text. And then the computer equipment can process the candidate contents in the database through the machine learning model, specifically, for example, feature extraction is carried out on the candidate texts, then, normalization processing is carried out on the basis of the extracted features, then, classification is carried out, the probability that each candidate text belongs to the target category is obtained, and therefore, the target texts are screened from the candidate texts and fed back to the user terminal on the basis of each probability.

In one specific application scenario, referring to fig. 6, fig. 6 is a design overview of a transform inference engine system adopted by an online service in one embodiment, as shown in fig. 6, the application service may specifically be a WeChat reading book, Tencent little, WeChat dialogue open platform, etc. to provide the application service, a computer device needs to be deployed with a corresponding hardware structure or platform, such as a Central Processing Unit (CPU) based on Intel, a Graphics Processor (GPU) based on NVIDIA, and a container layout engine based on an open source of Kubernet, etc. based on the hardware platform, operations of operators may be provided, such as Matmut, Softmax, AddDiaasA L U, AddBiasTranspose, and L ookup Table based on a transform/Bert Encode structure (a neural network Model structure), which may constitute a corresponding framework, such as a library for deploying Python management operator 11 (a tool for implementing C + + code calls), a transform agent library (C), a dialog management platform, a corresponding language library (Pyth) and a query language library (Pyth) for providing a corresponding learning system interface, a corresponding language library, a query library, a corresponding framework, and a corresponding system loading manager (Pyth) for providing a query language.

In the above embodiment, after the normalized distribution result corresponding to the corresponding data set is obtained, the subsequent stage of processing the calculation task is performed based on the normalized distribution result corresponding to each data set according to the received service request, so as to obtain and feed back the service processing result corresponding to the service request, thereby reducing the delay of the upper layer service response.

The following describes in detail how the data processing method mentioned in the embodiments of the present application can improve the data processing efficiency by way of comparison.

Taking the Softmax operator in the neural network algorithm as an example, for example, in the conventional scheme based on NVIDIA corporation, the parallel manner of Softmax computation is generally as follows: the computer device divides the input matrix into several parts along the high dimension (high dim), the size of the parts is blk _ size. A copy of a block of data is processed by a thread block, which includes a number of threads, each thread processing one or more elements. A thread block of an element in a pending data block (blk _ size) is dispatched by hardware to an SM of the GPU to run. Within the SM are again a number of SP units, each SP running one thread of a thread block. The threads in the thread block are scheduled in units of Warp, wherein Warp is the most basic parallel granularity of the GPU, and 32 threads exist in each Warp. The 32 threads in this Warp are working together and execute the same instructions.

How to perform the scaling operation on the leading dim operation is the most tricky, which requires the communication and time synchronization of data in different SP registers of the GPU. The traditional method can pass through a block reduce algorithm (a reduction processing algorithm), which is a two-stage process. As shown in fig. 7(a), fig. 7(a) is an execution schematic diagram of the Softmax operator in the conventional scheme in one embodiment. One arrow in fig. 7(a) indicates one thread. An element in a data set (i.e., a row of leadingdim) may be stored into a thread's register. The 32 threads form a thread bundle, and the threads corresponding to registers 0-31 in fig. 7(a) are longer than the thread bundle. In the scheme, the result of carrying out the stipulation summation on the elements stored by the 32 threads in the Warp can be directly realized through a warpAllReduceSum function, and the result is written into the shared memory firstly. Since the thread number in the GPU specified line block of NVIDIA does not exceed 1024, the Warp number in the thread block does not exceed 32. And the computer equipment can write the shared memory data into a Warp register, and then operate a warpAllReduceSum function once again to obtain a protocol result corresponding to the data set. Note that each Warp needs to wait after executing __ synchreads () operation after selecting a representative thread to write to the shared memory until all threads in a thread block (thread block) wait until they have arrived, and __ synchreads operation is very time-consuming, which results in time-consuming operations.

Now the reduction operation blockreduce sum for each row of leading dim elements in this two-dimensional matrix is solved (blk _ size, leading dim), then the solution for Softmax is to use blk _ size one-dimensional reduction implementation, as shown in fig. 7 (a). Within the box is a blockReduce process that reduces the elements stored within each thread register. The final specification result is stored in a register of a certain thread. Outside the box, the value of the register is stored in a shared memory, can be accessed by other threads after one synchronous operation, and executes one division to obtain a normalized result.

However, the execution efficiency of the above-mentioned normalization processing method is not optimal, and there is much room to be improved.

For example, it is not efficient to execute the reduction operation through the warpAllReduceSum function, and the destination operand of the shuffle instruction (__ shfl _ xor _ sync instruction) in the loop is the same as the source operand of the sum instruction (hard instruction), so that the two instructions have a dependency relationship, and when the shuffle instruction is issued, it is necessary to wait for the completion of the corresponding shuffle operation to obtain the corresponding destination operand, and then the sum instruction can be issued, which may cause the stalling of the instruction pipeline, and thus, there is a problem of low instruction execution efficiency.

Next, since the warp reduce is performed in units of 32 threads, the size of the thread block needs to be made an integer multiple of 32. The registers of some out-of-boundary threads (also called invalid threads) are stored without data, the out-of-boundary threads do not participate in calculation, and the out-of-boundary threads are assigned with a preset value which does not influence the specification result. Therefore, the boundary condition needs branch judgment, so that the Warp thread needs to execute the if and else instructions once, and the branch judgment causes little overhead.

Finally, an __ synchreads () operation is required after each warp selects a representative thread to write to shared memory, which incurs a significant amount of thread _ sync overhead. A loop length of blk _ size requires 2 × blk _ size sub-thread sync _ synchread, with a non-trivial overhead of synchronization.

The present invention provides a data processing method, which solves the above problems in the conventional scheme, and improves the operation efficiency of a Softmax operator in a GPU, by processing more than one data set in parallel in a single reduction operation, that is, partitioning a blk _ size loop into two layers of cycles of blk _ size/K and K, and expanding a K-length loop, as shown in fig. 7(B), which shows the execution principle of the Softmax operator in the present application, by allocating K-times shared memory and registers, a parallel processing of K data sets is achieved, as shown in fig. 7(B), for example, K2, fig. 7(B) shows the execution flow of a K-2 loop partition, as shown in fig. 2 (B), which shows that a K-2 loop partition can be executed at the same time, and the reduction of the execution flow of a K-2 set (K-2) can be achieved by allocating K-times shared memory and registers, as shown in fig. 7(B), as a reduction of the efficiency of a K-2-times of the execution of a K-2 loop partition, as shown in fig. 7 (elward) can be achieved by increasing the efficiency of a parallel processing of a single instruction, and by reducing the efficiency of a write instruction, which can be achieved by a reduction of the latency of a write instruction, as shown in the internal instruction, a map, which increases the efficiency of a map, and a map, which can be achieved by a reduction of a map, which is achieved by a map, after a reduction of a read instruction, a read instruction.

By the data processing method mentioned in the embodiments of the present application, the data processing method is applied to acceleration of calculation of the Softmax operator of BERT, and significant performance improvement is obtained. The performance effect on Softmax was tested on two typical GPU architectures and compared to conventional approaches such as the closed source deep learning acceleration operator library cuDNNv7 and the open source NVIDIA corporation. We respectively test the reasoning calculation throughput of sequences with the sequence length of 10-500 under the condition that the batch-size is 1 and 20. Fig. 8(a) is a diagram illustrating a performance comparison of Softmax computation implemented by the present application on a processor in one embodiment with Softmax computation in a conventional scheme. Fig. 8(B) is a diagram illustrating performance acceleration results of Softmax computation implemented by the present application on a processor in an embodiment relative to Softmax computation in a conventional scheme. As shown in fig. 8(a) and 8(B), in Tesla P40 (a speculative workload processor), the Softmax calculation method provided in the embodiments of the present application has an acceleration effect of 1.1 × to 14 × compared to the cuDNN method and an acceleration effect of 1.1 × to 4.1 × compared to the NVIDIA method. Fig. 8(C) is a diagram illustrating a comparison of performance of Softmax computation implemented by the present application on another processor in one embodiment with Softmax computation in a conventional scheme. Fig. 8(D) is a diagram illustrating performance acceleration results of Softmax computation implemented by the present application on another processor in one embodiment relative to Softmax computation in a conventional scheme. As shown in fig. 8(C) and 8(D), in Tesla V100 (another speculative workload processor), by the Softmax calculation method provided in the embodiments of the present application, there is an acceleration effect of 1.1 × to 6 × compared to cuDNN, and an acceleration of 1.2 × to 4.6 × compared to NVIDIA.

It should be understood that, although the steps in the flowcharts in fig. 2 or 5 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 or fig. 5 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

In one embodiment, as shown in fig. 9, a data processing apparatus 900 is provided, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: an obtaining module 901, a determining module 902 and an instruction triggering module 903, wherein:

an obtaining module 901, configured to obtain a target task, and obtain more than one data set corresponding to the target task in parallel. A determining module 902, configured to determine an instruction stream corresponding to a target task; more than one target instruction determined by the trigger sequence is included in the instruction stream. The instruction triggering module 903 is configured to cyclically trigger various target instructions according to a triggering sequence, and when each target instruction is triggered, each data set in more than one data set is sequentially used as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to a target task.

In one embodiment, the data processing apparatus 900 further comprises a partitioning module 904, wherein: the obtaining module 901 is further configured to obtain an input matrix to be processed corresponding to the target task. And a dividing module 904, configured to divide the input matrix into more than one data block along the high-dimensional direction according to a preset size of the partition. A dividing module 904, further configured to divide the data block into at least one first data group; the first data set includes a predetermined number of data sets, the predetermined number being greater than a value one.

In one embodiment, the instruction stream includes a first instruction and a second instruction, the first instruction being triggered before the second instruction; the trigger object of the second instruction further comprises first target operation data obtained by executing the first instruction, wherein the execution time of the first instruction is more than one time period. The instruction triggering module 903 is further configured to, in each cycle triggering process, sequentially use each data set of the more than one data sets as a triggering object when triggering the first instruction, and obtain first target operation data corresponding to each data set by executing the first instruction; and when the second instruction is triggered, sequentially taking the first target operation data and the corresponding data set respectively corresponding to each data set as trigger objects, and executing the second instruction to obtain second target operation data respectively corresponding to each data set, wherein the second target operation data is used for updating the corresponding data set.

In one embodiment, the instruction triggering module 903 is further configured to divide each data set into more than one group of data subsets, and obtain one of the data subsets in each data set as a second data group; allocating corresponding thread bundles to each second data group; for each thread bundle, circularly triggering various target instructions according to a triggering sequence, and when each target instruction is triggered, sequentially taking each data subset in a second data group corresponding to the thread bundle as a triggering object until the triggered target instructions are executed by the thread bundle to obtain intermediate operation results corresponding to the data subsets in the corresponding second data group; and determining the operation results respectively corresponding to the data sets according to the intermediate operation results corresponding to each data subset in the second data group corresponding to each thread bundle.

In one embodiment, the instruction trigger module 903 is further configured to determine a number of threads corresponding to the thread bundle and divide each data set into more than one set of data subsets based on the number of threads; one of the subsets of data in each of the data sets is acquired as a second data set. The instruction triggering module is further configured to acquire the thread bundles of which the number is the same as that of the second data groups, and allocate the second data groups to the thread bundles one by one.

In one embodiment, the instruction triggering module 903 is further configured to form an intermediate array corresponding to the corresponding data set from the intermediate operation result corresponding to each data subset in each data set; and circularly triggering various target instructions according to the triggering sequence, and when each target instruction is triggered, sequentially taking each intermediate array of more than one intermediate array as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to the target task.

In one embodiment, the determining module 902 is further configured to determine a valid thread and an invalid thread in the thread bundle when the number of elements of the data subset is less than the number of threads of the thread bundle; the effective thread is the thread which is distributed with the element to be processed; the invalid thread is a thread which is not allocated with the element to be processed; wherein the element to be processed is an element in each data subset in the second data group;

the instruction triggering module 903 is further configured to, for an active thread in a thread bundle, sequentially use, as a triggering object, an element in each data subset in the correspondingly allocated second data group when each target instruction is triggered; and for the invalid threads in the thread bundle, when each target instruction is triggered, taking a preset numerical value as a corresponding trigger object.

In one embodiment, the data processing apparatus 900 further comprises a storage module 905 and an execution module 906, wherein: the storage module 905 is configured to store the operation result corresponding to each of the more than one data sets to the corresponding sharing location. The execution module 906 keeps the waiting state after executing the synchronization operation until entering the next target task execution process when the waiting end condition is satisfied.

In one embodiment, the target task includes a specification task; the operation result comprises a specification result; the data processing apparatus 900 further comprises a normalization processing module 907, wherein the obtaining module 901 is further configured to obtain, in parallel, a specification result corresponding to each of the more than one data sets. The normalization processing module 907 is configured to, for each data set, perform division on each element in the data set and the reduction result corresponding to the corresponding data set, to obtain a normalized distribution result corresponding to the corresponding data set.

Referring to fig. 10, in one embodiment, the data processing apparatus 900 further includes a traffic processing module 908 and a feedback module 909, wherein: the obtaining module 901 is further configured to obtain a service request, and generate more than one computing task based on the service request; the target task is included in the more than one computing task. The service processing module 908 is configured to perform processing on a subsequent stage of a calculation task based on the normalized distribution result corresponding to each data set, so as to obtain a service processing result corresponding to the service request. And a feedback module 909, configured to feed back a service processing result in response to the service request.

In one embodiment, the data processing apparatus is executed by a graphics processor disposed on a computer device, and the data processing apparatus is applied to a language representation model; the specification task is a subtask in a normalization task, and the normalization task is one of calculation tasks in the language representation model.

The data processing device acquires more than one data set corresponding to the target task in parallel under the scene that a large amount of data to be processed needs to be processed, and determines the instruction stream corresponding to the target task, wherein the instruction stream comprises more than one target instruction determined by the trigger sequence. Furthermore, when more than one data set is processed, various target instructions can be triggered circularly according to the triggering sequence, and when each target instruction is triggered, each data set in the more than one data sets is taken as a triggering object in sequence until the triggered target instructions are executed to obtain the operation result corresponding to the target task. In this way, in a cyclic triggering process, the same kind of target instructions based on different data sets are triggered alternately, and then another kind of target instructions based on different data sets are triggered alternately until each kind of target instructions in the instruction stream are triggered in sequence, so that the pipelined emission of instructions is realized. Therefore, the parallel granularity of the target task is adjusted by increasing the instruction level parallelism, the target task corresponding to one data set is not executed at one time, and the resource occupancy rate and the parallel efficiency can be balanced. For the entirety of more than one data set, the data processing efficiency can be greatly improved by a parallel alternative processing mode.

For specific limitations of the data processing apparatus, reference may be made to the above limitations of the data processing method, which are not described herein again. The various modules in the data processing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 11. The computer device includes a processor, a memory, and a network interface connected by a system bus. The processor of the computer device may specifically be a graphics processor for providing computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as operation results. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein prior to the concurrently acquiring more than one data set corresponding to the target task, the method further comprises:

acquiring an input matrix to be processed corresponding to the target task;

according to the size of a preset cut, cutting the input matrix into more than one data block along the high-dimensional direction;

dividing the data block into at least one first data group; the first data set comprises a preset number of data sets, and the preset number is larger than a value one.

3. The method of claim 1, wherein the instruction stream includes a first instruction and a second instruction, the first instruction being triggered before the second instruction; the trigger object of the second instruction further comprises first target operation data obtained by executing the first instruction, wherein the execution time of the first instruction is more than one time period.

4. The method according to claim 3, wherein the sequentially taking each of the more than one data sets as a trigger object when triggering each target instruction comprises:

in each cycle triggering process, when the first instruction is triggered, sequentially taking each data set of the more than one data sets as a triggering object, and executing the first instruction to obtain first target operation data respectively corresponding to each data set;

and when the second instruction is triggered, sequentially taking the first target operation data and the corresponding data set respectively corresponding to each data set as trigger objects, and executing the second instruction to obtain second target operation data respectively corresponding to each data set, wherein the second target operation data is used for updating the corresponding data set.

5. The method according to claim 1, wherein the cyclically triggering various target instructions according to the triggering order, and when each target instruction is triggered, sequentially taking each data set of the more than one data sets as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to the target task, comprises:

dividing each data set into more than one group of data subsets, and acquiring one data subset in each data set as a second data group;

each second data group is respectively allocated with a corresponding thread bundle;

for each thread bundle, circularly triggering various target instructions according to the triggering sequence, and when each target instruction is triggered, sequentially taking each data subset in a second data group corresponding to the thread bundle as a triggering object until the triggered target instructions are executed by the thread bundle to obtain intermediate operation results corresponding to the data subsets in the corresponding second data group;

and determining the operation results respectively corresponding to the data sets according to the intermediate operation results corresponding to each data subset in the second data group corresponding to each thread bundle.

6. The method of claim 5, wherein dividing each data set into more than one data subset and obtaining one of the data subsets in each data set as a second data set comprises:

determining the number of threads corresponding to the thread bundle, and dividing each data set into more than one group of data subsets based on the number of threads;

acquiring one data subset in each data set as a second data set;

the allocating a corresponding thread bundle to each of the data groups includes:

and acquiring the thread bundles with the same number as the second data groups, and distributing the second data groups to the thread bundles one by one.

7. The method of claim 5, wherein determining the operation result corresponding to each data set according to the intermediate operation result corresponding to each data subset in the second data set corresponding to each thread bundle comprises:

forming an intermediate array corresponding to the corresponding data set by using the intermediate operation result corresponding to each data subset in each data set;

and circularly triggering various target instructions according to the triggering sequence, and when each target instruction is triggered, sequentially taking each intermediate array of the more than one intermediate arrays as a triggering object until each triggered target instruction is executed to obtain an operation result corresponding to the target task.

8. The method of claim 5, further comprising:

when the number of elements of the data subset is less than the number of threads of the thread bundle, determining valid threads and invalid threads in the thread bundle; the effective thread is a thread which is distributed with elements to be processed; the invalid thread is a thread which is not allocated with an element to be processed; wherein the element to be processed is an element in each data subset in the second data group;

when each target instruction is triggered, sequentially taking each data subset in the second data group corresponding to the thread bundle as a trigger object, including:

for the effective threads in the thread bundle, when each target instruction is triggered, sequentially taking the elements in each data subset in the second data group which are correspondingly distributed as trigger objects;

and for the invalid threads in the thread bundle, when each target instruction is triggered, taking a preset numerical value as a corresponding trigger object.

9. The method of claim 1, further comprising:

storing the operation result corresponding to each data set in the more than one data sets to the corresponding sharing position respectively;

and after the synchronous operation is executed, keeping a waiting state until a waiting ending condition is met, and entering a next target task execution process.

10. The method of any of claims 1-9, wherein the target task comprises a specification task, and the operation result comprises a specification result; the method further comprises the following steps:

obtaining specification results corresponding to each data set in the more than one data sets in parallel;

and for each data set, performing division operation on each element in the data set and the corresponding reduction result of the corresponding data set respectively to obtain a normalized distribution result corresponding to the corresponding data set.

11. The method of claim 10, wherein prior to obtaining the target task, the method further comprises:

acquiring a service request, and generating more than one computing task based on the service request; the more than one computing tasks comprise target tasks;

after dividing each element in the data set by the reduction result corresponding to the corresponding data set to obtain the normalized distribution result corresponding to the corresponding data set, for each data set, the method further includes:

processing the calculation task of the subsequent stage based on the normalized distribution result corresponding to each data set to obtain a service processing result corresponding to the service request;

and responding to the service request to feed back the service processing result.

12. The method of claim 10, wherein the method is performed by a graphics processor deployed on a computer device, and wherein the method is applied to a language characterization model; the specification task is a subtask in a normalization task, and the normalization task is one of calculation tasks in the language representation model.

13. A data processing apparatus comprising:

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 12.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 12.