CN113836266A

CN113836266A - BERT-based natural language processing method and related equipment

Info

Publication number: CN113836266A
Application number: CN202111119670.1A
Authority: CN
Inventors: 成杰峰; 彭奕
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2021-12-24

Abstract

The application relates to the technical field of artificial intelligence, and particularly provides a natural language processing method based on BERT and related equipment, wherein the method comprises the following steps: acquiring a word vector of a natural language text; loading word vectors of the natural language text into a graphic processor according to a first CUDA program so that the graphic processor obtains a query matrix, a key matrix and a value matrix in a parallel processing mode based on the word vectors of the natural language text, and storing the query matrix, the key matrix and the value matrix into a memory; and acquiring the query matrix, the key matrix and the value matrix from the memory according to a second CUDA program, and loading the query matrix, the key matrix and the value matrix into the graphic processor, so that the graphic processor obtains the first attention characteristic of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix. The embodiment of the application is beneficial to improving the efficiency of natural language processing.

Description

BERT-based natural language processing method and related equipment

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a natural language processing method based on BERT and a related device.

Background

Natural Language Processing (Natural Language Processing) is one of the hot spots of research in the field of artificial intelligence, how to make a computer read human Language is the key point of NLP technology, with the increase of research and development, NLP technology has made breakthrough progress, and the figure of NLP can be seen in many subdivided fields such as intelligent question and answer, machine translation, and spam filtering. The NLP technology generally relies on NLP models, BERT (Bidirectional encoding from transforms) introduced by google research and development teams is the most widely used NLP model in recent years and performs well, but BERT's model parameters are very large, and BERT needs to consume hundreds of milliseconds in a batch processing scenario due to billions of parameters. It can be seen that the problem of low processing efficiency exists in the current BERT-based natural language processing.

Disclosure of Invention

In view of the above problems, the present application provides a natural language processing method based on BERT and related devices, which are beneficial to improving the efficiency of natural language processing.

In order to achieve the above object, a first aspect of embodiments of the present application provides a BERT-based natural language processing method, including:

acquiring a word vector of a natural language text;

loading word vectors of the natural language text into a graphic processor according to a first CUDA program so that the graphic processor obtains a query matrix, a key matrix and a value matrix in a parallel processing mode based on the word vectors of the natural language text, and storing the query matrix, the key matrix and the value matrix into a memory;

and acquiring the query matrix, the key matrix and the value matrix from the memory according to a second CUDA program, and loading the query matrix, the key matrix and the value matrix into the graphic processor, so that the graphic processor obtains the first attention characteristic of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix.

With reference to the first aspect, in one possible implementation manner, loading a word vector of a natural language text into a graphics processor, so that the graphics processor obtains a query matrix, a key matrix, and a value matrix in a parallel processing manner based on the word vector of the natural language text, including:

loading the word vectors of the natural language text into a graphic processor so that the graphic processor constructs corresponding query vectors, key vectors and value vectors based on the word vectors of the natural language text;

acquiring a query weight matrix, a key weight matrix and a value weight matrix from a memory;

loading the query weight matrix, the key weight matrix, and the value weight matrix into the graphics processor, such that the graphics processor computes the query matrix, the key matrix, and the value matrix in parallel based on the query vector and the query weight matrix, the key vector and the key weight matrix, and the value vector and the value weight matrix.

With reference to the first aspect, in one possible implementation manner, loading the query matrix, the key matrix, and the value matrix into the graphics processor, so that the graphics processor obtains the first attention feature of the natural language text in a parallel processing manner based on the query matrix, the key matrix, and the value matrix, including:

loading the query matrix, the key matrix and the value matrix into the graphics processor, so that the graphics processor calculates attention weights in parallel based on the query matrix, the key matrix and the value matrix;

and multiplying the attention weight by the value matrix to obtain a first attention feature of the natural language text.

With reference to the first aspect, in one possible implementation, the query matrix, the key matrix, and the value matrix are inputs of each attention mechanism in a multi-attention mechanism, and the first attention feature is an output of each attention mechanism, the method further includes:

splicing the first attention features to obtain attention splicing features;

and obtaining a second attention feature of the natural language text according to the attention splicing feature.

With reference to the first aspect, in one possible implementation manner, obtaining a second attention feature of the natural language text according to the attention stitching feature includes:

performing linear mapping on the attention splicing characteristic to obtain a second attention characteristic;

alternatively, the first and second electrodes may be,

smoothing the splicing position of the attention splicing feature to obtain the attention splicing feature after smoothing;

and performing linear mapping on the attention splicing characteristics after the smoothing processing to obtain second attention characteristics.

With reference to the first aspect, in one possible implementation, before obtaining a word vector of a natural language text, the method further includes:

performing word segmentation processing on the natural language text to obtain a word segmentation list;

and carrying out thermal independent coding on the words in the word list to obtain word vectors of the natural language text, and storing the word vectors of the natural language text in a memory.

With reference to the first aspect, in a possible implementation manner, smoothing a splice of the attention stitching feature to obtain a smoothed attention stitching feature includes:

carrying out operations of average value calculation, average value normalization, splicing position determination and average value filtering on any two adjacent first attention characteristics in the attention splicing characteristics to obtain the attention splicing characteristics after smoothing treatment;

wherein, the operations of average value calculation, average value normalization, splicing position determination and average value filtering are executed, and the operations comprise:

respectively calculating a first average value of the first attention feature A and a second average value of the first attention feature B aiming at a first attention feature A and a first attention feature B in any two adjacent first attention features, wherein the first attention feature B is an attention feature spliced behind the first attention feature A;

normalizing the first average value to an interval [1, M ] to obtain a first target value, and normalizing the second average value to an interval [1, N ], wherein M represents the total column number of the first attention feature A, N represents the total column number of the first attention feature B, and M and N are integers more than 1;

selecting a first target value column feature which is sequentially adjacent to a first attention feature B from the first attention feature A, selecting a second target value column feature which is sequentially adjacent to the first attention feature A from the first attention feature B, and determining the first target value column feature and the second target value column feature as a splicing part of the first attention feature A and the first attention feature B;

and performing mean filtering on features smaller than a preset value in the joint of the first attention feature A and the first attention feature B to finish smoothing processing on the joint of the first attention feature A and the first attention feature B.

A second aspect of an embodiment of the present application provides a BERT-based natural language processing apparatus, which includes an obtaining unit and a processing unit;

an acquisition unit configured to acquire a word vector of a natural language text;

the processing unit is used for loading word vectors of the natural language text into the graphic processor according to the first CUDA program so that the graphic processor obtains a query matrix, a key matrix and a value matrix in a parallel processing mode based on the word vectors of the natural language text, and stores the query matrix, the key matrix and the value matrix into the memory;

and the processing unit is further used for acquiring the query matrix, the key matrix and the value matrix from the memory according to the second CUDA program, and loading the query matrix, the key matrix and the value matrix into the graphic processor, so that the graphic processor obtains the first attention feature of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix.

A third aspect of embodiments of the present application provides an electronic device, which includes an input device, an output device, and a processor, and is adapted to implement one or more instructions; and a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the steps of:

acquiring a word vector of a natural language text;

A fourth aspect of embodiments of the present application provides a computer storage medium having one or more instructions stored thereon, the one or more instructions adapted to be loaded by a processor and to perform the following steps:

acquiring a word vector of a natural language text;

The above scheme of the present application includes at least the following beneficial effects: the method comprises the steps of obtaining word vectors of natural language texts; loading word vectors of the natural language text into a graphic processor according to a first CUDA program so that the graphic processor obtains a query matrix, a key matrix and a value matrix in a parallel processing mode based on the word vectors of the natural language text, and storing the query matrix, the key matrix and the value matrix into a memory; and acquiring the query matrix, the key matrix and the value matrix from the memory according to a second CUDA program, and loading the query matrix, the key matrix and the value matrix into the graphic processor, so that the graphic processor obtains the first attention characteristic of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix. Therefore, when the BERT model is used for natural language processing, the calculation of the query matrix, the key matrix and the value matrix in the self-attention mechanism is realized in the GPU layer for parallel processing, and in addition, a plurality of operators for calculating the attention weight based on the query matrix, the key matrix and the value matrix are combined into one operator, so that the data access times between the memory of the electronic equipment and a GPU (graphics processing unit) are reduced, the reasoning time of the model is favorably reduced, and the processing efficiency of the natural language is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of an application environment provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a natural language processing method based on BERT according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a transform encoder according to an embodiment of the present application;

FIG. 4 is a schematic illustration of a splice of an attention splice feature provided by an embodiment of the present application;

fig. 5 is a schematic flowchart of another BERT-based natural language processing method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a natural language processing apparatus based on BERT according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "comprising" and "having," and any variations thereof, as appearing in the specification, claims and drawings of this application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. Furthermore, the terms "first," "second," and "third," etc. are used to distinguish between different objects and are not used to describe a particular order.

An embodiment of the present application provides a natural language processing method based on BERT, which can be implemented based on an application environment shown in fig. 1, please refer to fig. 1, where the application environment includes an electronic device and at least one terminal connected to the electronic device through a network. At least one terminal comprises a terminal used by a user, the terminal is used for receiving a natural language text input by the user and submitting the natural language text to the electronic equipment, the electronic equipment executes the natural language processing method based on the BERT, and finally the natural language text input by the user is processed into a machine-level language. The terminal is used for receiving input of the developer, so that deployment of a BERT model and deployment of a CUDA (computer Unified Device Architecture) program are performed on the electronic Device, and when the BERT model is called to process a natural language text, parallel processing of partial operations by the GPU is realized through the CUDA program, so that the number of times of data or variable access between a memory of the electronic Device and the GPU is reduced, the inference time of the model is favorably reduced, and the processing efficiency of the natural language is further improved.

For example, the electronic device may be an independent server, or may be a cloud server that provides basic cloud computing services such as cloud service, cloud database, cloud computing, cloud function, cloud storage, web service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and artificial intelligence platform, and the like. Any of the at least one terminal may be a smartphone, a computer, a wearable device, and a vehicle-mounted device, among others.

Based on the application environment shown in fig. 1, the BERT-based natural language processing method provided by the embodiment of the present application is described in detail below with reference to other drawings.

Referring to fig. 2, fig. 2 is a flowchart illustrating a BERT-based natural language processing method according to an embodiment of the present application, where the method is applied to an electronic device, and as shown in fig. 2, the method includes steps 201 and 203:

201: a word vector of a natural language text is obtained.

In the specific embodiment of the present application, a BERT model is used to process a natural language text, and for the natural language text input by a user, an electronic device inputs the natural language text into the BERT model, and a bottom encoder of the BERT model is called to encode each word of the natural language text to obtain a word vector, where the word vector is usually stored in a memory of the electronic device, that is, obtaining the word vector of the natural language text refers to obtaining the word vector from the memory. Optionally, the manner of obtaining the word vector may be one-hot encoding or word embedding. The natural language text can be a sentence input by a user in the scenes of an intelligent question-answering system, machine translation, intelligent inquiry and the like.

Illustratively, prior to obtaining the word vectors for the natural language text, the method further comprises:

Specifically, for example, if the currently input natural language text is "i eat lunch today in a company", then "i/today/in/company/eating/lunch" is obtained by word segmentation, and the words obtained after word segmentation are stored in a list, so as to obtain a word segmentation list [ "i", "today", "at", "company", "eat", "lunch" ], and for each word in the word segmentation list, by searching for the position of the word in a pre-constructed vocabulary table, the position corresponding to the word is encoded to be 1, and other positions are 0, for example, the vocabulary table is [ "i", "you", "is", "at", "eat", "weather", "today", "company", "lunch" ], so that the vector of "i" in the natural language text is [1,0,0,0,0,0,0,0,0], "today" is [0,0,0,0,0,0,1,0,0], whereby a word vector for each word in the natural language text can be obtained. In the embodiment, the method adopts the one-hot coding to process the natural language text, plays a role in expanding features to a certain extent, and is suitable for a model with more parameters, such as BERT.

202: and loading the word vectors of the natural language text into the GPU according to the first CUDA program so that the GPU obtains a query matrix, a key matrix and a value matrix in a parallel processing mode based on the word vectors of the natural language text, and storing the query matrix, the key matrix and the value matrix into a memory.

In a specific embodiment of the present application, the first CUDA program refers to a program code that is written and used for obtaining a query matrix, a key matrix, and a value matrix through parallel computation, and is configured to instruct to load a word vector of a natural language text into the GPU, so that the GPU obtains the query matrix, the key matrix, and the value matrix in a parallel processing manner based on the word vector of the natural language text, and stores the query matrix, the key matrix, and the value matrix in a memory. CUDA is an england-derived general-purpose parallel computing architecture that enables GPUs to solve complex computational problems. It should be understood that when the BERT model is used for natural language processing, various Transformer encoders are usually used to encode natural language texts, as shown in fig. 3, a Transformer encoder is composed of a self-attention layer and a feedforward neural network layer, for input word vectors, the self-attention layer performs self-attention feature calculation through a multi-head attention mechanism, it should be understood that in the multi-head attention mechanism, calculation of a query matrix, a key matrix and a value matrix is usually performed in series, that is, the query matrix, the key matrix and the value matrix are calculated in sequence, where matrix multiplication calculation is a more critical step in the multi-head attention mechanism, and since serial calculation has multiple access operations of data or variables between a GPU and an electronic device memory, time overhead is increased to some extent, and the efficiency of the BERT model for natural language text inference is affected.

Exemplarily, loading a word vector of a natural language text into a GPU, so that the GPU obtains a query matrix, a key matrix, and a value matrix in a parallel processing manner based on the word vector of the natural language text, including:

loading the word vectors of the natural language text into a GPU (graphics processing Unit), so that the GPU builds corresponding query vectors, key vectors and value vectors based on the word vectors of the natural language text;

and loading the query weight matrix, the key weight matrix and the value weight matrix into the GPU, so that the GPU parallelly calculates the query matrix, the key matrix and the value matrix based on the query vector, the query weight matrix, the key vector, the key weight matrix and the value vector and value weight matrix.

Specifically, in the multi-head attention mechanism, the query vector q is usually made to be a key vector k as a value vector v, and a preset query weight matrix W exists in each of the sub-spaces corresponding to the multi-heads^qWeight matrix W^kSum weight matrix W^vAnd the parameters of the weight matrix adopted by each head in the multiple heads are different, such as inquiring the weight matrix W^qCan be represented as in each head

I.e. the query vector q passes through n query weight matrices W^qWill obtain n query matrices, a key weight matrix W^kSum weight matrix W^vAll this is the case. It should be understood that for each query matrix, each key matrix, and each value matrix, the serial computation approach is generally expressed as:

Q＝q*W^q；

K＝k*W^k；

V＝v*W^v；

wherein Q represents a query matrix, K represents a key matrix, and V represents a value matrix. According to the first CUDA program, the GPU adopts parallel computation, which is specifically represented as:

[Q，K，V]＝q*[W^q，W^k，W^v]；

the weight matrix W is based on the nature of the matrix multiplication^q，W^k，W^vAnd if the two matrixes are independent, the results of the query matrix Q, the key matrix K and the value matrix V are also independent, the calculation result is not influenced, and the access time between the GPU and the memory is reduced to some extent, so that the processing efficiency of the model is improved.

203: and acquiring the query matrix, the key matrix and the value matrix from the memory according to a second CUDA program, and loading the query matrix, the key matrix and the value matrix into the GPU, so that the GPU obtains the first attention characteristic of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix.

In this embodiment of the application, the second CUDA program refers to a program code that is written and obtains an attention vector through parallel one-time computation, and is configured to instruct to obtain a query matrix, a key matrix, and a value matrix from a memory, and load the query matrix, the key matrix, and the value matrix into the GPU, so that the GPU obtains the first attention feature of the natural language text in a parallel processing manner based on the query matrix, the key matrix, and the value matrix. It should be understood that, based on the query matrix Q, the key matrix K and the value matrix V obtained above in the BERT model, the GPU calls the transform encoder to calculate the self-attention feature, and the following formula is usually adopted for calculation:

e＝Score(Q，K)；

α＝Softmax(e)；

Attention Values＝αV；

wherein Score (Q, K) represents the Attention Score, d represents the dimensionality of the key vector, Softmax represents the normalization of the Attention scores of all words, and Attention Values represent the calculated Attention features. For the above calculation of α, four processes are usually required from the GPU level, and each of the four processes has the following steps: the variable is taken from the memory and loaded into the GPU, the GPU carries out calculation based on the variable, the GPU stores the result back into the memory and assigns values to the variable, it is not difficult to see that each time of processing has 2 times of taking operation and storing operation, the total number of the taking operation and the storing operation in four times of processing is 8 times, and the time consumption of access is large.

Illustratively, loading the query matrix, the key matrix, and the value matrix into a graphics processor to cause the graphics processor to obtain a first attention feature of the natural language text in a parallel processing manner based on the query matrix, the key matrix, and the value matrix includes:

Specifically, the parallel processing here refers to combining a plurality of operations for calculating the attention weight α into one operator for processing, and using the characteristics of GPU parallel calculation, a plurality of consecutive operations are implemented by one-time calculation at the CUDA level, so that the electronic device only needs to execute one operation of acquiring a variable from the memory and loading the variable into the GPU, and only needs to execute one operation of storing the first attention characteristic into the memory, which reduces the number of accesses and significantly accelerates the inference of the model.

It should be noted that the query matrix, the key matrix, and the value matrix are inputs of each attention mechanism in the multi-attention mechanism, and the first attention feature is an output of each attention mechanism, the method further includes:

splicing the first attention features to obtain attention splicing features;

Illustratively, the obtaining of the second attention feature of the natural language text according to the attention stitching feature includes:

alternatively, the first and second electrodes may be,

The second attention feature refers to a feature finally output from the attention layer in the transform encoder, the feature is usually used as an input of a feedforward neural network, and the second attention feature and the first attention feature are both matrixes. Assuming that a multi-head attention mechanism adopts 8 attention heads, the calculation result of each attention head is respectively the first attention characteristic Z₀First attention feature Z₁…, first attention feature Z₇Then the first attention feature Z is₀First attention feature Z₁…, first attention feature Z₇And splicing to obtain the attention splicing characteristic. The linear mapping may be multiplied by a preset additional weight matrix, and the additional weight matrix is obtained by joint training in the model. For attention splicing characteristics obtained by splicing a plurality of matrixes, the splicing positions are not smooth enough, and subsequent linear mapping is not facilitated, so that each splicing position of the attention splicing characteristics needs to be smoothed.

Exemplarily, smoothing the splicing position of the attention splicing feature to obtain the smoothed attention splicing feature includes:

In particular, since the average value of the first attention feature may be a number smaller than 1, it is necessary to normalize the average value to an integer between 1 and the total number of columns of the first attention feature in order to facilitate the selection of the subsequent column features. As shown in fig. 4, assuming that the first target value is 3 and the second target value is 2, the last 3 columns of features of the first attention feature a are selected, the first 2 columns of features of the first attention feature B are selected, the 5 columns of features form a joint of the first attention feature a and the first attention feature B, a new matrix can be formed for the 5 columns of features, the matrix is subjected to mean filtering, when a feature greater than or equal to a preset value appears in a window or a template, the feature does not participate in the mean filtering, and a feature smaller than the preset value is subjected to the mean filtering. The preset value can be set according to an empirical value, for example, features greater than or equal to 0.8 do not participate in the mean filtering, because such features are usually more significant features, and filtering them may cause information loss. In the embodiment, the splicing position of the attention splicing features is subjected to smoothing processing, so that two adjacent first attention features tend to be flat, calculation of subsequent linear mapping is facilitated, mean value filtering is not performed on the features of which the splicing position is larger than or equal to a preset value, remarkable features in attention are retained, and the overall calculation result is not greatly influenced.

It can be seen that, in the embodiment of the application, word vectors of natural language texts are obtained; loading word vectors of the natural language text into a graphic processor according to a first CUDA program so that the graphic processor obtains a query matrix, a key matrix and a value matrix in a parallel processing mode based on the word vectors of the natural language text, and storing the query matrix, the key matrix and the value matrix into a memory; and acquiring the query matrix, the key matrix and the value matrix from the memory according to a second CUDA program, and loading the query matrix, the key matrix and the value matrix into the graphic processor, so that the graphic processor obtains the first attention characteristic of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix. Therefore, when the BERT model is used for natural language processing, the calculation of the query matrix, the key matrix and the value matrix in the self-attention mechanism is realized in the GPU layer for parallel processing, and in addition, a plurality of operators for calculating the attention weight based on the query matrix, the key matrix and the value matrix are combined into one operator, so that the data access times between the memory of the electronic equipment and a GPU (graphics processing unit) are reduced, the reasoning time of the model is favorably reduced, and the processing efficiency of the natural language is further improved.

Referring to fig. 5, fig. 5 is a flow chart illustrating another BERT-based natural language processing method according to an embodiment of the present application, as shown in fig. 5, including steps 501 and 505:

501: acquiring a word vector of a natural language text;

502: loading word vectors of the natural language text into a GPU according to a first CUDA program, so that the GPU constructs corresponding query vectors, key vectors and value vectors based on the word vectors of the natural language text;

503: acquiring a query weight matrix, a key weight matrix and a value weight matrix from a memory;

504: loading the query weight matrix, the key weight matrix and the value weight matrix into a GPU (graphics processing Unit), so that the GPU can parallelly calculate the query matrix, the key matrix and the value matrix based on the query vector, the query weight matrix, the key vector, the key weight matrix and the value vector and value weight matrix, and store the query matrix, the key matrix and the value matrix into a memory;

505: and acquiring the query matrix, the key matrix and the value matrix from the memory according to a second CUDA program, and loading the query matrix, the key matrix and the value matrix into the GPU, so that the GPU obtains the first attention characteristic of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix.

The specific implementation of steps 501-505 has been described in the embodiment shown in fig. 2, and can achieve the same or similar beneficial effects, and is not repeated here for avoiding repetition.

Based on the description of the embodiments of the natural language processing method based on BERT, please refer to fig. 6, fig. 6 is a schematic structural diagram of a natural language processing apparatus based on BERT according to the embodiments of the present application, and as shown in fig. 6, the apparatus includes an obtaining unit 601 and a processing unit 602;

an obtaining unit 601, configured to obtain a word vector of a natural language text;

the processing unit 602 is configured to load a word vector of a natural language text into the graphics processor according to the first CUDA program, so that the graphics processor obtains a query matrix, a key matrix, and a value matrix based on the word vector of the natural language text by using a parallel processing manner, and stores the query matrix, the key matrix, and the value matrix in the memory;

the processing unit 602 is further configured to obtain the query matrix, the key matrix, and the value matrix from the memory according to the second CUDA program, and load the query matrix, the key matrix, and the value matrix into the graphics processor, so that the graphics processor obtains the first attention feature of the natural language text in a parallel processing manner based on the query matrix, the key matrix, and the value matrix.

It can be seen that in the BERT based natural language processing apparatus shown in fig. 6, by acquiring word vectors of natural language texts; loading word vectors of the natural language text into a graphic processor according to a first CUDA program so that the graphic processor obtains a query matrix, a key matrix and a value matrix in a parallel processing mode based on the word vectors of the natural language text, and storing the query matrix, the key matrix and the value matrix into a memory; and acquiring the query matrix, the key matrix and the value matrix from the memory according to a second CUDA program, and loading the query matrix, the key matrix and the value matrix into the graphic processor, so that the graphic processor obtains the first attention characteristic of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix. Therefore, when the BERT model is used for natural language processing, the calculation of the query matrix, the key matrix and the value matrix in the self-attention mechanism is realized in the GPU layer for parallel processing, and in addition, a plurality of operators for calculating the attention weight based on the query matrix, the key matrix and the value matrix are combined into one operator, so that the data access times between the memory of the electronic equipment and a GPU (graphics processing unit) are reduced, the reasoning time of the model is favorably reduced, and the processing efficiency of the natural language is further improved.

In one possible implementation, in loading the word vector of the natural language text into the graphics processor, so that the graphics processor obtains the query matrix, the key matrix, and the value matrix in a parallel processing manner based on the word vector of the natural language text, the processing unit 602 is specifically configured to:

In one possible implementation, in loading the query matrix, the key matrix, and the value matrix into the graphics processor, so that the graphics processor obtains the first attention feature of the natural language text in a parallel processing manner based on the query matrix, the key matrix, and the value matrix, the processing unit 602 is specifically configured to:

In one possible implementation, the query matrix, the key matrix, and the value matrix are inputs of each attention mechanism in a multi-head attention mechanism, the first attention feature is an output of each attention mechanism, and the processing unit 602 is further configured to:

splicing the first attention features to obtain attention splicing features;

In a possible implementation, in terms of deriving the second attention feature of the natural language text according to the attention stitching feature, the processing unit 602 is specifically configured to:

alternatively, the first and second electrodes may be,

In a possible implementation, the processing unit 602 is further configured to:

In a possible implementation manner, in terms of performing smoothing processing on the junction of the attention stitching feature to obtain a smoothed attention stitching feature, the processing unit 602 is specifically configured to:

in terms of performing operations of average value calculation, average value normalization, joint determination, and average value filtering, the processing unit 602 is specifically configured to:

According to an embodiment of the present application, the units of the BERT-based natural language processing apparatus shown in fig. 6 may be respectively or entirely combined into one or several additional units to form the apparatus, or some unit(s) thereof may be further split into multiple units with smaller functions to form the apparatus, which may achieve the same operation without affecting the achievement of the technical effect of the embodiment of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the BERT based natural language processing apparatus may also include other units, and in practical applications, these functions may also be implemented by assistance of other units, and may be implemented by cooperation of a plurality of units.

According to another embodiment of the present application, the BERT based natural language processing apparatus device as shown in fig. 6 may be constructed by running a computer program (including program codes) capable of executing steps involved in the respective methods as shown in fig. 2 or fig. 5 on a general-purpose computing device such as a computer including a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like, and a storage element, and implementing the BERT based natural language processing method of the embodiments of the present application. The computer program may be recorded on a computer-readable recording medium, for example, and loaded and executed in the above-described computing apparatus via the computer-readable recording medium.

Based on the description of the method embodiment and the device embodiment, the embodiment of the application further provides an electronic device. Referring to fig. 7, the electronic device includes at least a processor 701, an input device 702, an output device 703, and a computer storage medium 704. The processor 701, the input device 702, the output device 703, and the computer storage medium 704 within the electronic device may be connected by a bus or other means.

A computer storage medium 704 may be stored in the memory of the electronic device, the computer storage medium 704 being used for storing a computer program comprising program instructions, the processor 701 being used for executing the program instructions stored by the computer storage medium 704. The processor 701 (or CPU) is a computing core and a control core of the electronic device, and is adapted to implement one or more instructions, and in particular, is adapted to load and execute the one or more instructions so as to implement a corresponding method flow or a corresponding function.

In one embodiment, the processor 701 of the electronic device provided in the embodiment of the present application may be configured to perform a series of BERT-based natural language processing:

acquiring a word vector of a natural language text;

It can be seen that in the electronic device shown in fig. 7, by acquiring word vectors of natural language text; loading word vectors of the natural language text into a graphic processor according to a first CUDA program so that the graphic processor obtains a query matrix, a key matrix and a value matrix in a parallel processing mode based on the word vectors of the natural language text, and storing the query matrix, the key matrix and the value matrix into a memory; and acquiring the query matrix, the key matrix and the value matrix from the memory according to a second CUDA program, and loading the query matrix, the key matrix and the value matrix into the graphic processor, so that the graphic processor obtains the first attention characteristic of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix. Therefore, when the BERT model is used for natural language processing, the calculation of the query matrix, the key matrix and the value matrix in the self-attention mechanism is realized in the GPU layer for parallel processing, and in addition, a plurality of operators for calculating the attention weight based on the query matrix, the key matrix and the value matrix are combined into one operator, so that the data access times between the memory of the electronic equipment and a GPU (graphics processing unit) are reduced, the reasoning time of the model is favorably reduced, and the processing efficiency of the natural language is further improved.

In another embodiment, the loading of the word vector of the natural language text into the graphics processor by the processor 701 is performed to enable the graphics processor to obtain the query matrix, the key matrix, and the value matrix in a parallel processing manner based on the word vector of the natural language text, and the method includes:

In another embodiment, the loading of the query matrix, the key matrix, and the value matrix into the graphics processor by the processor 701 is performed to enable the graphics processor to obtain the first attention feature of the natural language text in a parallel processing manner based on the query matrix, the key matrix, and the value matrix, and includes:

In yet another embodiment, the query matrix, the key matrix, and the value matrix are inputs of each attention mechanism in a multi-attention mechanism, the first attention feature is an output of each attention mechanism, and the processor 701 is further configured to:

splicing the first attention features to obtain attention splicing features;

In another embodiment, the processor 701 performs the feature splicing according to attention to obtain a second attention feature of the natural language text, including:

alternatively, the first and second electrodes may be,

In yet another embodiment, before obtaining the word vector of the natural language text, the processor 701 is further configured to:

In another embodiment, the processor 701 performs smoothing on the splice of the attention splice feature to obtain a smoothed attention splice feature, including:

the processor 701 performs operations of average value calculation, average value normalization, splice determination, and average value filtering, including:

By way of example, electronic devices include, but are not limited to, a processor 701, an input device 702, an output device 703, and a computer storage medium 704. And the system also comprises a memory, a power supply, an application client module and the like. The input device 702 may be a keyboard, touch screen, radio frequency receiver, etc., and the output device 703 may be a speaker, display, radio frequency transmitter, etc. It will be appreciated by those skilled in the art that the schematic diagrams are merely examples of an electronic device and are not limiting of an electronic device and may include more or fewer components than those shown, or some components in combination, or different components.

It should be noted that, since the steps in the BERT based natural language processing method are implemented when the processor 701 of the electronic device executes the computer program, the embodiments of the BERT based natural language processing method are all applicable to the electronic device, and all can achieve the same or similar beneficial effects.

An embodiment of the present application further provides a computer storage medium (Memory), which is a Memory device in an electronic device and is used to store programs and data. It is understood that the computer storage medium herein may include a built-in storage medium in the terminal, and may also include an extended storage medium supported by the terminal. The computer storage medium provides a storage space that stores an operating system of the terminal. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by processor 701. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; alternatively, it may be at least one computer storage medium located remotely from the processor 701. In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by processor 701 to perform the corresponding steps described above with respect to the BERT based natural language processing method.

Illustratively, the computer program of the computer storage medium includes computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.

It should be noted that, since the computer program of the computer storage medium implements the steps of the BERT based natural language processing method when being executed by the processor, all the embodiments of the BERT based natural language processing method are applicable to the computer storage medium, and can achieve the same or similar beneficial effects.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A BERT-based natural language processing method, the method comprising:

acquiring a word vector of a natural language text;

loading the word vector of the natural language text into a graphic processor according to a first CUDA program, so that the graphic processor obtains a query matrix, a key matrix and a value matrix in a parallel processing mode based on the word vector of the natural language text, and storing the query matrix, the key matrix and the value matrix into a memory;

and acquiring the query matrix, the key matrix and the value matrix from the memory according to a second CUDA program, and loading the query matrix, the key matrix and the value matrix into the graphics processor, so that the graphics processor acquires the first attention feature of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix.

2. The method of claim 1, wherein loading the word vector of the natural language text into a graphics processor to cause the graphics processor to obtain a query matrix, a key matrix, and a value matrix in a parallel processing manner based on the word vector of the natural language text comprises:

loading the word vectors of the natural language text into a graphics processor to cause the graphics processor to construct corresponding query vectors, key vectors, and value vectors based on the word vectors of the natural language text;

loading the query weight matrix, the key weight matrix, and the value weight matrix into the graphics processor to cause the graphics processor to compute the query matrix, the key matrix, and the value matrix in parallel based on the query vector and the query weight matrix, the key vector and the key weight matrix, and the value vector and the value weight matrix.

3. The method of claim 1 or 2, wherein loading the query matrix, the key matrix, and the value matrix into the graphics processor to cause the graphics processor to obtain the first attention feature of the natural language text in a parallel processing manner based on the query matrix, the key matrix, and the value matrix comprises:

loading the query matrix, key matrix, and value matrix into the graphics processor to cause the graphics processor to compute attention weights in parallel based on the query matrix, the key matrix, and the value matrix;

4. The method of claim 3, wherein the query matrix, the key matrix, and the value matrix are inputs to each of a plurality of attention mechanisms, wherein the first attention feature is an output of each of the plurality of attention mechanisms, and wherein the method further comprises:

splicing the first attention features to obtain attention splicing features;

5. The method of claim 4, wherein the deriving a second attention feature of the natural language text from the attention-splicing feature comprises:

performing linear mapping on the attention splicing feature to obtain a second attention feature;

alternatively, the first and second electrodes may be,

and performing linear mapping on the attention splicing feature after the smoothing processing to obtain the second attention feature.

6. The method of claim 1, wherein prior to obtaining the word vector for the natural language text, the method further comprises:

and carrying out thermal independent coding on the words in the word segmentation list to obtain word vectors of the natural language text, and storing the word vectors of the natural language text into a memory.

7. A BERT-based natural language processing apparatus, comprising an acquisition unit and a processing unit;

the acquiring unit is used for acquiring word vectors of natural language texts;

the processing unit is used for loading the word vectors of the natural language texts into a graphic processor according to a first CUDA program, so that the graphic processor obtains a query matrix, a key matrix and a value matrix in a parallel processing mode based on the word vectors of the natural language texts, and stores the query matrix, the key matrix and the value matrix into a memory;

the processing unit is further configured to obtain the query matrix, the key matrix, and the value matrix from a memory according to a second CUDA program, and load the query matrix, the key matrix, and the value matrix into the graphics processor, so that the graphics processor obtains the first attention feature of the natural language text in a parallel processing manner based on the query matrix, the key matrix, and the value matrix.

8. The apparatus according to claim 7, wherein in loading the word vector of the natural language text into a graphics processor, so that the graphics processor obtains a query matrix, a key matrix, and a value matrix in a parallel processing manner based on the word vector of the natural language text, the processing unit is specifically configured to:

loading the query weight matrix, the key weight matrix, and the value weight matrix to the graphics processor to cause the graphics processor to compute the query matrix, the key matrix, and the value matrix in parallel based on the query vector and the query weight matrix, the key vector and the key weight matrix, and the value vector and the value weight matrix.

9. An electronic device comprising an input device and an output device, further comprising:

a processor adapted to implement one or more instructions; and the number of the first and second groups,

a computer storage medium having stored thereon one or more instructions adapted to be loaded by the processor and to perform the method of any of claims 1-6.

10. A computer storage medium having stored thereon one or more instructions adapted to be loaded by a processor and to perform the method of any of claims 1-6.