CN114742035B

CN114742035B - Text processing method and network model training method based on attention mechanism optimization

Info

Publication number: CN114742035B
Application number: CN202210555349.6A
Authority: CN
Inventors: 李敏; 曾锦乐; 吴志华; 蓝翔; 邢冯; 刘益群
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2023-07-07
Anticipated expiration: 2042-05-19
Also published as: WO2023221454A1; CN114742035A

Abstract

The disclosure provides a text processing method based on attention mechanism optimization, a network model training method, a device, equipment, a medium and a product, and relates to the technical field of artificial intelligence, in particular to the technical field of natural language processing and deep learning. The specific implementation scheme comprises the following steps: dividing M text sentences in the text to be processed to obtain N sentence groups, wherein N is an integer greater than 0, M is an integer not less than N, and each sentence group of the N sentence groups comprises at least one text sentence; determining an attention distribution characteristic of at least one text sentence in each sentence group; and carrying out parallel operation based on the attention distribution characteristics of each sentence group to obtain an output result aiming at the text to be processed.

Description

Text processing method and network model training method based on attention mechanism optimization

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of natural language processing and deep learning, and can be applied to scenes such as text processing.

Background

Text processing has wide application in the context of language understanding tasks, question and answer tasks, machine translation, natural language reasoning, and the like. However, in some scenarios, the text processing process has a phenomenon of low processing efficiency and poor utilization of computing resources.

Disclosure of Invention

The disclosure provides a text processing method based on attention mechanism optimization, a network model training method, a device, equipment, a medium and a product.

According to an aspect of the present disclosure, there is provided a text processing method based on attention mechanism optimization, including: dividing M text sentences in a text to be processed to obtain N sentence groups, wherein N is an integer greater than 0, M is an integer not less than N, and each sentence group of the N sentence groups comprises at least one text sentence; determining an attention profile characteristic of at least one text sentence in each sentence packet; and carrying out parallel operation based on the attention distribution characteristics of each statement group to obtain an output result aiming at the text to be processed.

According to another aspect of the present disclosure, there is provided a network model training method based on attention mechanism optimization, including: dividing M sample sentences in a sample to be processed to obtain N sample sentence groups, wherein N is an integer greater than 0, M is an integer not less than N, and each sample sentence group of the N sample sentence groups comprises at least one sample sentence; the N sample sentence groups are used as input data of a target network model to be trained, and the attention distribution characteristics of at least one sample sentence in each sample sentence group are obtained; performing parallel operation based on the attention distribution characteristics of each sample statement group to obtain an output result aiming at the sample to be processed; and adjusting model parameters of the target network model to be trained according to the output result and a preset result label to obtain a trained target network model.

According to another aspect of the present disclosure, there is provided a text processing apparatus optimized based on an attention mechanism, including: the first processing module is used for dividing M text sentences in the text to be processed to obtain N sentence groups, wherein N is an integer greater than 0, M is an integer not less than N, and each sentence group of the N sentence groups comprises at least one text sentence; a second processing module for determining a concentration profile of at least one text sentence in each of the sentence groupings; and the third processing module is used for carrying out parallel operation based on the attention distribution characteristics of each statement group to obtain an output result aiming at the text to be processed.

According to another aspect of the present disclosure, there is provided a network model training apparatus based on attention mechanism optimization, including: a fourth processing module, configured to divide M sample sentences in a sample to be processed to obtain N sample sentence groups, where N is an integer greater than 0 and M is an integer not less than N, where each sample sentence group of the N sample sentence groups includes at least one sample sentence; a fifth processing module, configured to use the N sample sentence groups as input data of a target network model to be trained, to obtain attention distribution characteristics of at least one sample sentence in each sample sentence group; a sixth processing module, configured to perform parallel operation based on the attention distribution characteristics of each sample sentence group, to obtain an output result for the sample to be processed; and a seventh processing module, configured to adjust model parameters of the target network model to be trained according to the output result and a preset result label, to obtain a trained target network model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text processing method or the network model training method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the above-described text processing method or network model training method.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above-described text processing method or network model training method.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates a system architecture of a text processing method and apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a text processing method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow chart of a text processing method according to yet another embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of a network model training method according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a process diagram of determining an attention computing function according to an embodiment of the disclosure;

FIG. 6 schematically illustrates a schematic diagram of a text processing process according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a text processing apparatus according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of a network model training apparatus according to an embodiment of the present disclosure;

fig. 9 schematically illustrates a block diagram of an electronic device for text processing according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Embodiments of the present disclosure provide a text processing method based on attention mechanism optimization. The method comprises the following steps: dividing M text sentences in the text to be processed to obtain N sentence groups, wherein N is an integer greater than 0, M is an integer not less than N, each sentence group of the N sentence groups comprises at least one text sentence, the attention distribution characteristics of at least one text sentence in each sentence group are determined, and parallel operation is performed based on the attention distribution characteristics of each sentence group to obtain an output result aiming at the text to be processed.

Fig. 1 schematically illustrates a system architecture of a text processing method and apparatus based on attention mechanism optimization according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

The system architecture 100 according to this embodiment may include a requesting terminal 101, a network 102, and a server 103. The network 102 is used as a medium for providing a communication link between the requesting terminal 101 and the server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others. The server 103 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud computing, network service, and middleware service.

The requesting terminal 101 interacts with the server 103 through the network 102 to receive or transmit data or the like. The requesting terminal 101 is for example used for initiating a text processing request to the server 103, the requesting terminal 101 is for example also used for sending the pending text to the server 103.

The server 103 may be a server providing various services, and may be, for example, a background processing server (merely an example) that performs text processing in accordance with a text processing request transmitted by the requesting terminal 101.

For example, the server 103 divides M text sentences in the text to be processed in response to a text processing request acquired from the request terminal 101 to obtain N sentence groups, N is an integer greater than 0, M is an integer not less than N, at least one text sentence is included in each sentence group of the N sentence groups, the attention distribution characteristics of at least one text sentence in each sentence group are determined, and parallel operations are performed based on the attention distribution characteristics of each sentence group to obtain an output result for the text to be processed, and the output result is returned to the request terminal 101.

It should be noted that the text processing method provided by the embodiment of the present disclosure may be executed by the server 103. Accordingly, the text processing apparatus provided by the embodiments of the present disclosure may be provided in the server 103. The text processing method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 103 and is capable of communicating with the requesting terminal 101 and/or the server 103. Accordingly, the text processing apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 103 and is capable of communicating with the requesting terminal 101 and/or the server 103.

It should be understood that the number of requesting terminals, networks and servers in fig. 1 is merely illustrative. There may be any number of requesting terminals, networks, and servers, as desired for implementation.

Embodiments of the present disclosure provide a text processing method based on attention mechanism optimization, and a text processing method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 3 in conjunction with the system architecture of fig. 1. The text processing method of the embodiment of the present disclosure may be performed by the server 103 shown in fig. 1, for example.

Fig. 2 schematically illustrates a flow chart of a text processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the text processing method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S230.

In operation S210, dividing M text sentences in the text to be processed to obtain N sentence groups, where N is an integer greater than 0, M is an integer not less than N, and each sentence group of the N sentence groups includes at least one text sentence.

In operation S220, a concentration distribution characteristic of at least one text sentence in each sentence group is determined.

In operation S230, parallel operation is performed based on the attention distribution characteristics of each sentence group, resulting in an output result for the text to be processed.

The following illustrates respective operation example flows of the text processing method of the present embodiment.

Illustratively, M text sentences in the text to be processed are divided to obtain N sentence groups. In one exemplary manner, the M text sentences are divided according to the character sequence length of each text sentence in the M text sentences to obtain N sentence groups. N is an integer greater than 0, M is an integer not less than N, and each sentence group of the N sentence groups comprises at least one text sentence. Each sentence group corresponds to a preset character sequence length interval, and the character sequence length interval can be determined by a character sequence length threshold, for example.

According to the character sequence length of the text sentences, the M text sentences are divided to obtain N sentence groups, so that the attention distribution operation efficiency of the text to be processed is improved, the text processing efficiency can be effectively improved, and the text processing effect can be effectively improved.

An attention profile characteristic of at least one text sentence in each sentence packet is determined. For example, an attention arithmetic function that matches each sentence group may be determined. And executing the attention operation function matched with each sentence group in parallel to obtain the attention distribution characteristic of at least one text sentence in each sentence group. For example, the attention distribution characteristics between characters of each text sentence are determined from character characteristics of each text sentence in the corresponding sentence group using an attention operation function matched with each sentence group.

The attention calculation function may be, for example, a kernel function, which may include, for example, a matrix multiplication function, a matrix point multiplication function, a matrix line average function, a matrix line variance function, and the like. The character features of each text sentence may include, for example, character encoding features and character position features, and the attention distribution features between the characters of each text sentence in the corresponding sentence group may be determined using a kernel function that matches each sentence group.

The output result for the text to be processed can be obtained according to the character features of each text sentence in the N sentence groups and the attention distribution features among the characters. The output results may include, for example, processing results in language understanding tasks, question and answer tasks, machine translation, natural language reasoning, text prediction, and the like. For example, the output result may be a translation result for the text to be processed, or may be a semantic understanding result for the text to be processed, which is not limited in this embodiment.

According to the method and the device for processing the text, M text sentences in the text to be processed are divided to obtain N sentence groups, the attention distribution characteristics of at least one text sentence in each sentence group are determined, and parallel operation is performed based on the attention distribution characteristics of each sentence group to obtain an output result aiming at the text to be processed. By grouping text sentences in the text to be processed and determining the attention distribution characteristics of at least one text sentence in each sentence group, the text processing efficiency can be effectively improved, and the text processing effect can be effectively ensured. Redundant calculation in the text processing process can be effectively reduced, and the utilization rate of calculation resources in the text processing process is improved.

Fig. 3 schematically illustrates a flow chart of a text processing method according to another embodiment of the present disclosure.

As shown in fig. 3, the text processing method 300 of the embodiment of the present disclosure may include, for example, operations S210, operations S310 to S320, and operation S230.

In operation S310, an attention arithmetic function matching each sentence group is determined.

In operation S320, for the target sentence group, the attention distribution characteristics between characters of each text sentence are determined from character characteristics of each text sentence in the target sentence group using the attention operation function matched with the target sentence group, the target sentence group being any sentence group of the N sentence groups.

An example flow of each operation of the text processing method of the present embodiment is illustrated below.

Illustratively, the M text sentences are divided according to the character sequence length of each text sentence in the M text sentences and according to a preset mapping relation between each sentence group and a character sequence length interval, so as to obtain N sentence groups.

An attention computation function that matches each of the N groupings of sentences is determined. In one example, a kernel function matching the character sequence length section may be determined as an attention operation function according to the character sequence length section corresponding to each sentence packet.

Illustratively, the attention computation functions that match the N groupings of sentences are executed in parallel, resulting in an attention distribution feature of at least one text sentence in each sentence grouping. For a target sentence group in the N sentence groups, determining a target thread block for executing the attention distribution operation according to a target kernel function matched with the target sentence group. And executing the target kernel function in parallel by utilizing at least one thread in the target thread block so as to obtain the attention distribution characteristic according to the character characteristic of each text sentence in the target sentence group.

According to the character sequence length intervals corresponding to the sentence groups, a kernel function matched with the character sequence length intervals is determined to serve as an attention operation function, so that the attention distribution operation speed of a text processing process is improved, the text processing efficiency can be effectively improved, and the computing resource utilization rate of the text processing process can be effectively improved.

Taking the attention operation function as a kernel function, the kernel function is an operation function in CUDA (Compute Unified Device Architecture, unified computing device architecture) that can be executed in parallel. The kernel functions are organized in the form of a Thread Grid (Grid), which may include multiple Thread blocks (blocks) in a single Thread Grid, and multiple threads (threads) in a single Thread Block. For example, 128, 256, 384, 512, or 1024 threads may be included in a single thread block. A thread block may be used as an execution unit of a kernel function, where multiple threads in the thread block share the same memory resources.

Illustratively, a kernel function matching the character sequence length section is determined as an attention arithmetic function matching the corresponding sentence group according to the character sequence length section corresponding to each sentence group. The character sequence length section may include, for example, types of sections of (0, 128], (128, 256], (256, 384], (384, 512), and the like, which is not limited in this embodiment.

And determining the attention distribution characteristics among characters of each text sentence according to the character characteristics of each text sentence in the target sentence group by using a kernel function matched with the target sentence group. The character features of each text sentence may include, for example, character encoding features and character position features. The kernel functions may include, for example, matrix multiplication functions, matrix point multiplication functions, matrix row average functions, matrix row variance functions, and the like.

For any target text sentence, the query feature matrix of the target text sentence may be determined from the character feature matrix and the first parameter matrix of the target text sentence using a matrix multiplication function. And determining the key feature matrix of the target text sentence according to the character feature matrix and the second parameter matrix of the target text sentence. And determining the value characteristic matrix of the target text sentence according to the character characteristic matrix and the third parameter matrix of the target text sentence.

And obtaining the attention evaluation matrix of the target text statement according to the query feature matrix and the key feature matrix of the target text statement by utilizing a matrix point multiplication function. For example, the query feature matrix may be subjected to a block-based processing in a row direction based on a preset sliding window and a preset sliding step length, so as to obtain at least one query feature sub-matrix. And obtaining the attention evaluation matrix of the target text sentence according to the at least one query feature submatrix and the key feature matrix by utilizing a matrix point multiplication function. The attention assessment matrix after division operation can be normalized by utilizing a matrix row variance function to obtain an attention weight matrix.

The attention distribution matrix based on the self-attention mechanism can be obtained according to the value characteristic matrix and the attention weight matrix of the target text sentence by utilizing a matrix point multiplication function. The attention distribution matrix may indicate a degree of relatedness between any character in the target text sentence and other characters.

And carrying out parallel operation according to the character characteristics and the attention distribution characteristics of at least one text sentence of each sentence group to obtain an output result aiming at the text to be processed, wherein the output result can comprise output bit text and text credibility probability.

The attention distribution characteristics of at least one text sentence in the corresponding sentence group are determined by determining an attention operational function that matches each sentence group and using the attention operational function that matches each sentence group. The complexity of the attention distribution operation can be effectively reduced, the redundant calculation of the attention distribution operation can be effectively reduced, and the attention distribution operation efficiency can be effectively improved. The method can effectively improve the utilization rate of computing resources, effectively improve the text processing efficiency, and provide credible data support for language understanding tasks, question-answering tasks, machine translation, natural language reasoning, text prediction and other tasks.

Fig. 4 schematically illustrates a flow chart of a network model training method according to an embodiment of the present disclosure.

As shown in FIG. 4, training method 400 may include operations S410-S440, for example.

In operation S410, dividing M sample sentences in the samples to be processed to obtain N sample sentence groups, where M is an integer not less than N, N is an integer greater than 0, and each sample sentence group of the N sample sentence groups includes at least one sample sentence.

In operation S420, the N sample sentence groups are used as input data of the target network model to be trained, and the attention distribution characteristics of at least one sample sentence in each sample sentence group are obtained.

In operation S430, parallel operation is performed based on the attention distribution characteristics of each sample sentence group, to obtain an output result for the sample to be processed.

In operation S440, the model parameters of the target network model to be trained are adjusted according to the output result and the preset result label, so as to obtain a trained target network model.

An example flow of each operation of the model training method of the present embodiment is illustrated below.

For example, the M sample sentences may be divided according to the character sequence length of each sample sentence in the M sample sentences, to obtain N sample sentence groups, where each sample sentence group corresponds to a preset character sequence length interval.

An attention arithmetic function matching each sample sentence group can be determined. For example, a kernel function matching the character sequence length section may be determined as the attention computation function based on the character sequence length section corresponding to each sentence group. Attention computation functions matched with each sample sentence group can be executed in parallel to obtain attention distribution characteristics of at least one sample sentence in each sample sentence group.

Illustratively, for a target sample sentence group, an attention computing function matching the target sample sentence group is invoked by a target network model to be trained. And determining the attention distribution characteristics among the characters of each sample sentence according to the character characteristics of each sample sentence in the target sample sentence group by using the attention operation function. The target sample sentence packet may be any sample sentence packet of the N sample sentence packets.

And obtaining an output result aiming at the sample to be processed according to the character characteristics and the attention distribution characteristics of at least one sample sentence in each sample sentence group by utilizing the target network model to be trained. And determining a loss function value according to the output result of the sample to be processed and a preset result label. And adjusting model parameters of the target network model to be trained according to the loss function value to obtain the trained target network model.

Illustratively, the target network model may include a plurality of encoder layers connected in sequence and a plurality of decoder layers connected in sequence, and an implicit layer vector may be transferred between a last encoder layer and each decoder layer. Each encoder layer may include at least a self-attention mechanism layer and a feedforward neural network layer, and each decoder layer may include at least a self-attention mechanism layer and a feedforward neural network layer.

The encoder layer can be utilized to encode M sample sentences in the sample to be processed, so that each sample sentence is mapped into a digital vector by a natural language vector, and character characteristics of each sample sentence are obtained. The self-attention mechanism layer can be utilized to obtain the attention distribution characteristics among the characters of the corresponding sample sentences according to the character characteristics of each sample sentence. The character features and the attention distribution features of each sample sentence may be passed to the decoder layer as implicit layer vectors. And processing the hidden layer vector by the decoder layer to obtain an output result aiming at the sample to be processed. During processing at the encoder and/or decoder layers, a kernel function may be called for data manipulation, which may be used, for example, to perform mathematical level calculations on the data.

A text processing model may be derived based on the trained target network model. For example, the text to be processed may be used as input data of a text processing model, so as to obtain character features of each text sentence in the text to be processed. And calling an attention operation function matched with at least one sentence group in the text to be processed through the text processing model to obtain the attention distribution characteristic of at least one text sentence in each sentence group. Each sentence group corresponds to a preset character sequence length interval, and the attention operation function is determined according to the character sequence length interval corresponding to each sentence group. And obtaining an output result aiming at the text to be processed according to the character characteristics and the attention distribution characteristics of each text sentence by using the text processing model.

According to the embodiment of the disclosure, sample sentences of samples to be processed are divided to obtain at least one sample sentence group, attention distribution characteristics of at least one sample sentence in each sample sentence group are determined by utilizing a target network model to be trained, and parallel operation is performed based on the attention distribution characteristics of each sample sentence group to obtain an output result aiming at the samples to be processed. The method can effectively improve the operation efficiency of the attention distribution operation, can effectively improve the convergence rate of the network model training, effectively ensure the generalization performance of the trained target network model, is beneficial to improving the text processing efficiency, and can provide credible data support for diversified natural language processing tasks.

Fig. 5 schematically illustrates a process diagram of determining an attention computing function according to an embodiment of the present disclosure.

Assuming that the batch of training samples is n in batch size, the batch size may indicate the number of samples that are passed on to the program for network model training at a single time. As shown in fig. 5, taking n=6 as an example, the training sample batch includes 6 sample sentences, and the character sequence lengths of the 6 sample sentences are 40, 120, 178, 200, 340, 512, respectively.

The sample sentences in the training sample batch can be divided according to the character sequence length of each sample sentence to obtain at least one sample sentence group, and each sample sentence group corresponds to a preset character sequence length interval respectively. For example, the n sample sentences may be divided according to a character sequence length section to which the character sequence lengths of the n sample sentences belong, to obtain at least one sample sentence group. The number of sample sentences in each sample sentence group may be different, and there may be a difference in the actual address of the input data corresponding to each sample sentence group.

For example, sample sentences having character sequence lengths of 40, 120, 178, 200, 340, 512 are divided to obtain 4 sample sentence groups. The character sequence length intervals corresponding to the respective sample sentence packets are (0, 128), (128, 256), (256, 384), (384, 512).

And determining a kernel function matched with the character sequence length interval according to the character sequence length interval corresponding to each sentence group to serve as an attention operation function. For example, kernel functions that match the respective statement groupings may include kernel functions fmha_128_kernel, kernel functions fmha_256_kernel, kernel functions fmha_384_kernel, and kernel functions fmha_512_kernel.

The batch_size=2 of the sentence packet corresponding to the kernel function fmha_128_kernel, and the maximum character sequence length cur_s=128; the batch_size=2 of the sentence packet corresponding to the kernel function fmha_256_kernel, and the maximum character sequence length cur_s=256; the batch_size=2 of the sentence packet corresponding to the kernel function fmha_384_kernel, and the maximum character sequence length cur_s=384; the kernel function fmha_512_kernel corresponds to the sentence packet batch_size=1, and the maximum character sequence length cur_s=512.

The method can effectively improve the operation speed of the attention distribution operation, is beneficial to improving the text processing efficiency and effectively improves the utilization rate of computing resources in the text processing process.

Fig. 6 schematically illustrates a schematic diagram of a text processing process according to an embodiment of the present disclosure.

For example, it may be determined whether the execution of the listening object function is completed based on a preset listening event. In response to completion of execution of the listening object function, an attention distribution feature is determined from character features of each text sentence in the target sentence group using an attention operation function. The snoop object function includes an operation function whose execution order precedes the attention operation function.

Multiple attention operational functions for which there is no data processing dependency may be executed in parallel, and for other operational functions for which there is a data processing dependency, the execution order between the operational functions may be determined from the data processing dependencies between the operational functions. The data processing dependencies may include, for example, the computation of matrices that the matrix computation process needs to rely on other matrices.

As shown in fig. 6, as the attention operation functions matched with each sentence group, there is no data processing correlation between the kernel functions fmha_128_kernel, fmha_256_kernel, fmha_384_kernel, and fmha_512_kernel, and the kernel functions fmha_128_kernel, fmha_256_kernel, fmha_384_kernel, and fmha_512_kernel may be executed in parallel.

Whether or not execution of the monitor object function is completed may be determined based on a preset monitor event, and the monitor object function may be, for example, a kernel a whose execution order precedes the attention operation function. In response to completion of execution of the kernel function kernel a, an operation of determining an attention distribution matrix using an attention arithmetic function is performed. For kernel C whose execution order is located after the attention operation function, kernel C may be executed after the attention operation function is completely executed.

The method can fully ensure the accuracy of the calculation result in the text processing process on the basis of effectively improving the utilization rate of the calculation resources and the text processing efficiency.

Fig. 7 schematically illustrates a block diagram of a text processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the text processing apparatus 700 of the embodiment of the present disclosure includes, for example, a first processing module 710, a second processing module 720, and a third processing module 730.

A first processing module 710, configured to divide M text sentences in the text to be processed to obtain N sentence groups, where N is an integer greater than 0, M is an integer not less than N, and each sentence group of the N sentence groups includes at least one text sentence; a second processing module 720 for determining a concentration profile of at least one text sentence in each sentence packet; and a third processing module 730, configured to perform parallel operation based on the attention distribution characteristics of each sentence group, so as to obtain an output result for the text to be processed.

According to the method and the device for processing the text, M text sentences in the text to be processed are divided to obtain N sentence groups, the attention distribution characteristics of at least one text sentence in each sentence group are determined, and parallel operation is performed based on the attention distribution characteristics of each sentence group to obtain an output result aiming at the text to be processed. By grouping text sentences in the text to be processed and determining the attention distribution characteristics of at least one text sentence in each sentence group, the text processing efficiency can be effectively improved, and the text processing effect can be effectively ensured. Redundant calculation in the text processing process can be effectively reduced, and the utilization rate of calculation resources is improved.

According to an embodiment of the present disclosure, the first processing module includes: the first processing sub-module is used for dividing the M text sentences according to the character sequence length of each text sentence in the M text sentences to obtain N sentence groups, wherein each sentence group corresponds to a preset character sequence length interval.

According to an embodiment of the present disclosure, the second processing module includes: the second processing sub-module is used for determining attention operation functions matched with each statement group; and a third processing sub-module, configured to determine, for the target sentence group, an attention distribution feature between characters in each text sentence according to character features of each text sentence in the target sentence group by using an attention operation function that matches the target sentence group, where the target sentence group is any sentence group of the N sentence groups.

According to an embodiment of the present disclosure, the second processing sub-module includes: the first processing unit is configured to determine, according to a character sequence length interval corresponding to each sentence packet, a kernel function that matches the character sequence length interval as an attention operation function, and the third processing sub-module includes: the second processing unit is used for determining a target thread block for executing the attention distribution operation according to the target kernel function matched with the target statement group; and the third processing unit is used for executing the target kernel function in parallel by utilizing at least one thread in the target thread block so as to obtain the attention distribution characteristic according to the character characteristic of each text sentence in the target sentence group.

According to an embodiment of the present disclosure, the second processing module further includes: the fourth processing sub-module is used for determining whether the execution of the monitoring object function is completed or not based on a preset monitoring event; the third processing sub-module is used for: in response to completion of execution of the monitor object function, the attention distribution feature is determined from character features of each text sentence in the target sentence group using the attention operation function, the monitor object function including operation functions whose execution order is before the attention operation function.

Fig. 8 schematically illustrates a block diagram of a network model training apparatus according to an embodiment of the present disclosure.

As shown in fig. 8, a network model training apparatus 800 of an embodiment of the present disclosure includes, for example, a fourth processing module 810, a fifth processing module 820, a sixth processing module 830, and a seventh processing module 840.

A fourth processing module 810, configured to divide M sample sentences in the samples to be processed to obtain N sample sentence groups, where N is an integer greater than 0, M is an integer not less than N, and each sample sentence group of the N sample sentence groups includes at least one sample sentence; a fifth processing module 820, configured to take the N sample sentence groups as input data of a target network model to be trained, and obtain attention distribution characteristics of at least one sample sentence in each sample sentence group; a sixth processing module 830, configured to perform parallel operation based on the attention distribution characteristics of each sample sentence group, so as to obtain an output result for the sample to be processed; and a seventh processing module 840, configured to adjust model parameters of the target network model to be trained according to the output result and the preset result label, to obtain a trained target network model.

According to an embodiment of the present disclosure, the fourth processing module includes: and the fifth processing submodule is used for dividing the M sample sentences according to the character sequence length of each sample sentence in the M sample sentences to obtain N sample sentence groups, wherein each sample sentence group corresponds to a preset character sequence length interval respectively.

According to an embodiment of the present disclosure, the fifth processing module includes: a sixth processing sub-module, configured to determine an attention computation function that matches each sample statement group; a seventh processing sub-module, configured to call, for the target sample sentence group, an attention operation function matched with the target sample sentence group through a target network model to be trained; and an eighth processing sub-module for determining, according to character features of each sample sentence in the target sample sentence group, attention distribution features between characters in each sample sentence by using an attention operation function, the target sample sentence group being any sample sentence group in the N sample sentence groups.

It should be noted that, in the technical solution of the present disclosure, the related processes of information collection, storage, use, processing, transmission, provision, disclosure and the like all conform to the rules of relevant laws and regulations, and do not violate the public welcome.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 illustrates a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic device 900 is intended to represent various forms of digital computers, such as laptops, desktops, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a central processing sub-module (CPU), a graphics processing sub-module (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running deep learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as a text processing method. For example, in some embodiments, the text processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the text processing method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the text processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable model training apparatus, such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with an object, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a subject; and a keyboard and pointing device (e.g., a mouse or trackball) by which an object can provide input to the computer. Other kinds of devices may also be used to provide for interaction with an object; for example, feedback provided to the subject may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the subject may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., an object computer having a graphical object interface or a web browser through which an object can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A text processing method based on attention mechanism optimization, comprising:

dividing M text sentences in a text to be processed to obtain N sentence groups, wherein N is an integer greater than 0, M is an integer not less than N, and each sentence group of the N sentence groups comprises at least one text sentence;

determining an attention profile characteristic of at least one text sentence in each sentence packet; and

Performing parallel operation based on the attention distribution characteristics of each statement group to obtain an output result aiming at the text to be processed;

wherein said determining the attention profile characteristics of at least one text sentence in each of said sentence groupings comprises:

determining an attention computing function matching each of the groupings of statements, comprising: according to the character sequence length intervals corresponding to the sentence groups, determining a kernel function matched with the character sequence length intervals to serve as the attention operation function; and

for a target sentence group, determining, by using an attention computing function that matches the target sentence group, attention distribution features between characters in each text sentence in the target sentence group according to character features of each text sentence in the target sentence group, including: determining a target thread block for executing the attention distribution operation according to the target kernel function matched with the target statement group; executing the target kernel function in parallel by utilizing at least one thread in the target thread block so as to obtain the attention distribution characteristic according to the character characteristic of each text sentence in the target sentence group; wherein the target sentence packet is any sentence packet in the N sentence packets.

2. The method of claim 1, wherein the dividing the M text sentences in the text to be processed into N sentence packets comprises:

dividing the M text sentences according to the character sequence length of each text sentence in the M text sentences to obtain N sentence groups,

wherein each sentence group corresponds to a preset character sequence length interval.

3. The method of claim 1, further comprising:

determining whether the execution of the monitoring object function is completed or not based on a preset monitoring event; and

the determining, by using an attention operation function matched with the target sentence group, attention distribution features between characters in each text sentence according to character features of each text sentence in the target sentence group, including:

in response to completion of execution of the listening object function, determining the attention distribution feature from character features of each of the text sentences in the target sentence group using the attention operation function,

wherein the snoop object function comprises an operation function with an execution sequence before the attention operation function.

4. A network model training method based on attention mechanism optimization comprises the following steps:

dividing M sample sentences in a sample to be processed to obtain N sample sentence groups, wherein N is an integer greater than 0, M is an integer not less than N, and each sample sentence group of the N sample sentence groups comprises at least one sample sentence;

the N sample sentence groups are used as input data of a target network model to be trained, and the attention distribution characteristics of at least one sample sentence in each sample sentence group are obtained;

performing parallel operation based on the attention distribution characteristics of each sample statement group to obtain an output result aiming at the sample to be processed; and

according to the output result and a preset result label, adjusting model parameters of the target network model to be trained to obtain a trained target network model;

the step of obtaining the attention distribution characteristics of at least one sample sentence in each sample sentence group by using the N sample sentence groups as input data of a target network model to be trained includes:

determining an attention computing function matching each of the sample sentence groupings, comprising: according to the character sequence length intervals corresponding to the sample sentence groups, determining a kernel function matched with the character sequence length intervals to serve as the attention operation function;

Aiming at target sample statement groups, calling attention operation functions matched with the target sample statement groups through the target network model to be trained; and

determining, by using the attention operation function, attention distribution characteristics between characters in each of the sample sentences according to character characteristics of each of the sample sentences in the target sample sentence group, including: determining a target thread block for executing attention distribution operation according to the attention operation function matched with the target sample statement group; executing the attention operation function in parallel by utilizing at least one thread in the target thread block so as to obtain the attention distribution characteristic according to the character characteristic of each sample sentence in the target sample sentence group; wherein the target sample sentence group is any sample sentence group in the N sample sentence groups.

5. The method of claim 4, wherein the dividing the M sample sentences in the sample to be processed into N sample sentence groups comprises:

dividing the M sample sentences according to the character sequence length of each sample sentence in the M sample sentences to obtain N sample sentence groups,

Wherein each sample sentence group corresponds to a preset character sequence length interval.

6. A text processing apparatus optimized based on an attention mechanism, comprising:

the first processing module is used for dividing M text sentences in the text to be processed to obtain N sentence groups, wherein N is an integer greater than 0, M is an integer not less than N, and each sentence group of the N sentence groups comprises at least one text sentence;

a second processing module for determining a concentration profile of at least one text sentence in each of the sentence groupings; and

the third processing module is used for carrying out parallel operation based on the attention distribution characteristics of each statement group to obtain an output result aiming at the text to be processed;

wherein the second processing module comprises:

a second processing sub-module, configured to determine an attention computation function that matches each of the sentence groups; and

a third processing sub-module, configured to determine, for a target sentence group, attention distribution characteristics between characters in each text sentence according to character characteristics of each text sentence in the target sentence group by using an attention operation function that is matched with the target sentence group, where the target sentence group is any sentence group in the N sentence groups;

Wherein the second processing sub-module comprises:

the first processing unit is used for determining a kernel function matched with the character sequence length interval according to the character sequence length interval corresponding to each statement group to serve as the attention operation function;

wherein the third processing sub-module comprises:

the second processing unit is used for determining a target thread block for executing the attention distribution operation according to the target kernel function matched with the target statement group;

and the third processing unit is used for executing the target kernel function in parallel by utilizing at least one thread in the target thread block so as to obtain the attention distribution characteristic according to the character characteristic of each text sentence in the target sentence group.

7. The apparatus of claim 6, wherein the first processing module comprises:

a first processing sub-module, configured to divide the M text sentences according to a character sequence length of each text sentence in the M text sentences, to obtain the N sentence groups,

8. The apparatus of claim 6, the second processing module further comprising:

The fourth processing sub-module is used for determining whether the execution of the monitoring object function is completed or not based on a preset monitoring event; and

the third processing sub-module is configured to:

9. A network model training device based on attention mechanism optimization, comprising:

a fourth processing module, configured to divide M sample sentences in a sample to be processed to obtain N sample sentence groups, where N is an integer greater than 0 and M is an integer not less than N, where each sample sentence group of the N sample sentence groups includes at least one sample sentence;

a fifth processing module, configured to use the N sample sentence groups as input data of a target network model to be trained, to obtain attention distribution characteristics of at least one sample sentence in each sample sentence group;

a sixth processing module, configured to perform parallel operation based on the attention distribution characteristics of each sample sentence group, to obtain an output result for the sample to be processed; and

A seventh processing module, configured to adjust model parameters of the target network model to be trained according to the output result and a preset result label, to obtain a trained target network model;

wherein the fifth processing module comprises:

a sixth processing sub-module, configured to determine an attention computation function that matches each of the sample sentence groups; the kernel function matched with the character sequence length intervals is determined according to the character sequence length intervals corresponding to the sample sentence groups, and the kernel function is used as the attention operation function;

a seventh processing sub-module, configured to call, for a target sample sentence packet, an attention operation function matched with the target sample sentence packet through the target network model to be trained;

an eighth processing sub-module, configured to determine, according to character features of each of the sample sentences in the target sample sentence group, attention distribution features between characters in each of the sample sentences using the attention operation function; and determining a target thread block for performing an attention distribution operation according to the attention operation function matched with the target sample statement group; executing the attention operation function in parallel by utilizing at least one thread in the target thread block so as to obtain the attention distribution characteristic according to the character characteristic of each sample sentence in the target sample sentence group; wherein the target sample sentence group is any sample sentence group in the N sample sentence groups.

10. The apparatus of claim 9, wherein the fourth processing module comprises:

a fifth processing sub-module, configured to divide the M sample sentences according to the character sequence length of each sample sentence in the M sample sentences, to obtain the N sample sentence groups,

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text processing method of any one of claims 1-3 or the network model training method of any one of claims 4-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the text processing method of any one of claims 1-3 or the network model training method of any one of claims 4-5.