CN114065771A

CN114065771A - Pre-training language processing method and device

Info

Publication number: CN114065771A
Application number: CN202011155558.9A
Authority: CN
Inventors: 蒋子航; 周大权; 陈云鹏; 冯佳时; 颜水成
Original assignee: Eto Singapore Ltd Private
Current assignee: Eto Singapore Ltd Private
Priority date: 2020-08-01
Filing date: 2020-10-26
Publication date: 2022-02-18

Abstract

The application relates to the technical field of pre-training language representation and discloses a pre-training language processing method, medium and equipment. The pre-training language processing method comprises the following steps: receiving a word sequence comprising a plurality of words, embedding the word sequence to obtain an initial word sequence vector, and implementing at least one encoding process based on the initial word sequence vector to obtain a final word sequence vector. In each encoding process, at least one mixed attention process is executed, word sequence features are generated according to a linear output result obtained by the initial word sequence vector and attention weights generated based on local span perception of the initial word sequence vector, and at least one feed-forward process is executed, the word sequence features are processed into a final word sequence vector and output. According to the technical scheme, the hybrid attention processing is obtained by combining span-based dynamic convolution processing and self-attention operation, so that the local dependency relationship can be learned by extracting local span perception, the capability of coding local information is enhanced, the global dependency relationship of self-attention operation is kept, and the hybrid attention processing method has the advantages of lower training cost, fewer model parameters and better training performance.

Description

Pre-training language processing method and device

Technical Field

The present application relates to the field of natural language processing, and in particular, to a pre-training language processing method and a pre-training language processing apparatus based on span dynamic convolution.

Background

Natural Language Understanding (NLU) is regarded as one of the artificial intelligence completion problems (AI-complete) and is a fundamental task of Natural Language Processing (NLP). Natural language understanding refers to making a computer recognize and understand text information at a semantic level. In the process of natural language understanding, natural language is input into a computer in the form of text. The input text generally takes words (tokens) as basic units, and sentences and chapters are formed by word combinations. Therefore, the key to understanding words, sentences, chapters, etc. by computers is the proper encoding (encode) of words. The vector generated by the encoder contains semantic information contained in the word in the context, so that the vector can be further applied to subtasks in natural language understanding, such as language models, emotion analysis, text classification, machine translation and the like.

When a vector generation process is performed using an Encoder, a Bidirectional Encoder Representation (BERT) is a commonly used Encoder. A bidirectional coded representation encoder is an unsupervised, deep bidirectional pre-training language representation (language representation) method for pre-training NLPs, which utilizes large-scale unmarked corpus training to obtain semantic representations of texts, and then fine-tunes the semantic representations of the texts in a specific natural language processing task to obtain vectors as results. These vectors are ultimately applied in the natural language processing task. BERT relies heavily on the global self-attentional block, which occupies too much memory and incurs high computational costs. The problem to be solved is to provide a new pre-training language expression method which can learn global and local context information and reduce calculation redundancy.

Disclosure of Invention

The embodiment of the application provides a pre-training language processing method, a pre-training language processing device, pre-training language processing equipment and a computer readable medium.

In a first aspect, an embodiment of the present application provides a method for pre-training language processing, and the method includes:

receiving a sequence of words, the sequence of words comprising a plurality of words;

embedding the word sequence to obtain an initial word sequence vector;

performing at least one encoding process based on the initial word sequence vector to obtain a final word sequence vector;

wherein, in each of the encoding processes,

performing at least one time of mixed attention processing, and generating word sequence characteristics according to a linear output result obtained by the initial word sequence vector and attention weights generated based on local span perception of the initial word sequence vector;

and executing at least one feedforward treatment, and processing the word sequence characteristics into a final word sequence vector and outputting the final word sequence vector.

According to the pre-training language method, the span-based dynamic convolution processing and the self-attention operation are combined to obtain the mixed attention processing, so that the local dependency relationship can be learned through extracting the local span perception, the capability of coding local information is enhanced, the global dependency relationship of the self-attention operation is kept, and the pre-training language method has the advantages of being lower in training cost, fewer in model parameters and better in training performance.

In a possible implementation of the first aspect, the hybrid attention processing further includes: performing linear transformation on the initial word sequence vector to generate a first vector, a second vector and a third vector, and performing depth separable convolution operation on the initial word sequence vector to generate a fourth vector, wherein the fourth vector is used for representing the local span perception; performing a span-based dynamic convolution operation based on the first vector, the second vector and the fourth vector to generate span-based word sequence features; performing a self-attention operation based on the first vector, the second vector and the third vector, generating a self-attention-based word sequence feature; and combining the span-based word sequence characteristics and the self-attention-based word sequence characteristics to obtain the word sequence characteristics.

In a possible implementation of the first aspect, the span-based dynamic convolution operation further includes: generating the attention weight by performing a bit multiplication operation on the first vector and the fourth vector, and generating a dynamic convolution kernel based on the attention weight, as shown in the following formula:

f(Q,K_s)＝softmax(W_f(Q⊙K_s)),

performing light-weight convolution operation on the dynamic convolution kernel and the second vector to obtain the span-based word sequence characteristics, which are shown as the following formula:

SDConv(Q,K_s,V；W_f,i)＝LConv(V,softmax(W_f(Q⊙K_s)),i)；

wherein,

representing the attention weight; f represents the linear processing followed by the operation of the softmax function; q represents the first vector; v represents the second vector; ks represents the fourth vector; an indication bit product operation; LConv represents the lightweight convolution operation; SDConv represents the span-based dynamic convolution operation; i denotes the position i of the convolution kernel of the span-based dynamic convolution operation.

The mixed attention processing is calculated by the following formula to obtain the word sequence characteristics:

Mixed-Attn(K,Q,K_s,V；W_f)＝Cat(Self-Attn(Q,K,V),SDConv(Q,K_s,V；W_f))；

wherein,

representing the attention weight; f represents the linear processing followed by the operation of the softmax function; q represents the first vector; v represents the second vector; k represents the third vector; k_sRepresenting the fourth vector; an indication bit product operation; SDConv represents the span-based dynamic convolution operation; i represents the position i of the convolution kernel of the span-based dynamic convolution operation; Mix-Attn represents the mixed attention treatment; Self-Attn represents the Self-attentive maneuver; cat represents the merge operation.

In a possible implementation of the first aspect, the hybrid attention processing further includes: when the linear transformation is carried out on the initial word sequence vector, reducing the dimension of the d-dimension initial word sequence vector into d/gamma, and reducing the attention head number according to gamma; wherein d represents the embedding dimension of the initial word sequence vector, γ represents the reduction ratio, and γ > 1.

In a possible implementation of the first aspect, the feed-forward processing further includes: and dividing the word sequence characteristics into a plurality of groups in an embedding dimension, wherein each group is respectively subjected to linear processing, and merging the linear processing results to obtain the final word sequence vector.

In a possible implementation of the first aspect, the span-based dynamic convolution operation further includes: the convolution kernel of the depth separable convolution operation and the convolution kernel of the span-based dynamic convolution operation are the same size.

In one possible implementation of the first aspect described above, the encoding process is performed at least once in an iterative manner.

In a second aspect, an embodiment of the present application provides a pre-training language processing apparatus, including:

a receiving module that receives a word sequence, the word sequence including a plurality of words;

the embedding module is used for embedding the word sequence to obtain an initial word sequence vector;

the encoding module is used for implementing at least one encoding process based on the initial word sequence vector to obtain a final word sequence vector;

wherein, in each of the encoding modules,

the mixed attention sub-module is used for executing at least one time of mixed attention processing and generating word sequence characteristics according to a linear output result obtained by the initial word sequence vector and attention weights generated based on local span perception of the initial word sequence vector;

and the feedforward submodule executes at least one feedforward process and is used for processing the word sequence characteristics into a final word sequence vector and outputting the final word sequence vector.

In a possible implementation of the second aspect, the hybrid attention sub-module further includes: a linear transformation layer, configured to perform linear transformation on the initial word sequence vector to generate a first vector, a second vector, and a third vector, perform depth separable convolution on the initial word sequence vector to generate a fourth vector, where the fourth vector is used to represent the local span sensing; a span-based dynamic convolution layer to perform a span-based dynamic convolution operation based on the first vector, the second vector, and the fourth vector to generate span-based word sequence features; a self-attention layer, which performs self-attention operation based on the first vector, the second vector and the third vector to generate a self-attention-based word sequence feature; and the merging layer is used for merging the span-based word sequence characteristics and the self-attention-based word sequence characteristics to obtain the word sequence characteristics.

In a possible implementation of the second aspect, the span-based dynamic convolution layer further includes the following operations: generating the attention weight by performing a bit multiplication operation on the first vector and the fourth vector, and generating a dynamic convolution kernel based on the attention weight, as shown in the following formula:

f(Q,K_s)＝softmax(W_f(Q⊙K_s)),

SDConv(Q,K_s,V；W_f,i)＝LConv(V,softmax(W_f(Q⊙K_s)),i)；

wherein,

In a possible implementation of the second aspect, the mixed attention processing is calculated by the following formula to obtain the word sequence characteristics:

Mixed-Attn(K,Q,K_s,V；W_f)＝Cat(Self-Attn(Q,K,V),SDConv(Q,K_s,V；W_f))；

wherein,

In a possible implementation of the second aspect, the hybrid attention sub-module further includes the following operations: when the linear transformation is carried out on the initial word sequence vector, reducing the dimension of the d-dimension initial word sequence vector into d/gamma, and reducing the attention head number according to gamma; wherein d represents the embedding dimension of the initial word sequence vector, γ represents the reduction ratio, and γ > 1.

In a possible implementation of the second aspect, the feed-forward sub-module further includes: and dividing the word sequence characteristics into a plurality of groups in an embedding dimension, wherein each group is respectively subjected to linear processing, and merging the linear processing results to obtain the final word sequence vector.

In a possible implementation of the second aspect, the span-based dynamic convolution layer further includes: the convolution kernel of the depth separable convolution operation and the convolution kernel of the span-based dynamic convolution operation are the same size.

In one possible implementation of the second aspect described above, the encoding process is performed at least once in an iterative manner.

In a third aspect, an embodiment of the present application provides a pre-training language representation apparatus, including:

a memory for storing instructions for execution by one or more processors of the system, an

The processor, which is one of the processors of the system, is configured to execute the instructions to implement any one of the pre-training language processing methods in the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium encoded with a computer program, where the computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to perform any one of the pre-training language processing methods in the first aspect.

Drawings

FIG. 1 illustrates an architectural diagram of a pre-trained language representation model, according to some embodiments of the present application;

FIG. 2 illustrates a block diagram of an electronic device, according to some embodiments of the present application;

FIG. 3 illustrates a flow diagram of a pre-training language representation method, according to some embodiments of the present application;

FIG. 4 illustrates a schematic diagram of a process for generating self-attention weights from attention operations, in accordance with some embodiments of the present application;

FIG. 5 illustrates a schematic diagram of a self-attentive operational process, according to some embodiments of the present application;

FIG. 6 illustrates a schematic diagram of a process for generating convolution kernels for dynamic convolution processing, according to some embodiments of the present application;

FIG. 7 illustrates a schematic diagram of a dynamic convolution process, according to some embodiments of the present application;

FIG. 8 illustrates a schematic diagram of a process for span-based dynamic convolution to generate a span-based convolution kernel, according to some embodiments of the present application;

FIG. 9 illustrates a schematic diagram of a span-based dynamic convolution process, according to some embodiments of the present application;

FIG. 10 illustrates a schematic diagram of a hybrid attention process, according to some embodiments of the present application;

FIG. 11 illustrates a graph of results of span-based dynamic convolution with different convolution kernel sizes on a GLUE development set, according to some embodiments of the present application;

FIG. 12(a) illustrates a graph of a mean attention visualization of a pre-training language representation method BERT, according to some embodiments of the present application;

FIG. 12(b) illustrates a graph of a pre-training language representation ConvBERT mean attention visualization, according to some embodiments of the present application;

FIG. 13 illustrates a schematic diagram of a pre-training language representation apparatus, according to some embodiments of the present application.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

The illustrative embodiments of the present application include, but are not limited to, a pre-training language representation method, a pre-training language representation apparatus, a pre-training language representation device, and a computer readable medium.

In the present application, the pre-training language processing method is, in a specific embodiment, a pre-training language representation (language representation).

FIG. 1 provides an architectural diagram of a pre-trained language representation model, according to some embodiments of the present application. As shown in fig. 1, the pre-training language representation model receives a group of word sequences as input, performs a linear transformation on the word sequences through an embedding process, then adds a position code of each word in the word sequences to obtain an output X (i.e., an initial word sequence vector corresponding to the present application), and after completing the above basic processing, enters a loop process, i.e., a process of a multi-layer coding operation. In the loop process, an attention process is first performed, and specifically, the attention process may be a self-attention operation, a multi-head self-attention operation, or an improved attention process based on the self-attention operation. And performing residual error processing and summation processing on the output O of attention processing and the input X of attention processing, and then performing Normalization operation, wherein the Normalization is Layer Normalization and adopts the following formula:

where μ represents the mean and σ represents the standard deviation. The normalized result is then input into the fully-connected layer for feed-forward processing (fed forward), which can be understood as two consecutive linear transformations with an activation function in between. Thus, the coding operation of one loop processing is completed, and after N times of loop coding operations, one pre-training language representation is completed. Initializing parameters of a Transformer through a pre-training representation model, using the Transformer as a feature extractor, and then performing fine-tuning (fine-tuning) in specific application tasks, such as word segmentation, part-of-speech tagging, semantic role tagging, conversation slot filling, information extraction and the like which can be converted into standard sequence tagging problems, text classification, extraction type text abstract and the like which can be regarded as single sentence or document classification problems, and the like.

It is to be understood that the above description of the solution for pre-training the language representation model shown in fig. 1 is only exemplary and not limiting.

Fig. 2 illustrates a block diagram of an electronic device 100, according to some embodiments of the present application. Specifically, as shown in FIG. 2, electronic device 100 includes one or more processors 104, system control logic 108 coupled to at least one of processors 104, system memory 112 coupled to system control logic 108, non-volatile memory (NVM)116 coupled to system control logic 108, and network interface 120 coupled to system control logic 108.

In some embodiments, the processor 104 may include one or more single-core or multi-core processors. In some embodiments, the processor 104 may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In embodiments where the electronic device 100 employs an eNB (enhanced base station) or RAN (radio access network) controller, the processor 104 may be configured to perform various consistent embodiments.

In some embodiments, system control logic 108 may include any suitable interface controllers to provide any suitable interface to at least one of processors 104 and/or any suitable device or component in communication with system control logic 108.

In some embodiments, system control logic 108 may include one or more memory controllers to provide an interface to system memory 112. System memory 112 may be used to load and store data and/or instructions. Memory 112 of electronic device 100 may comprise any suitable volatile memory in some embodiments, such as suitable Dynamic Random Access Memory (DRAM). In some embodiments, the system memory 112 may be used to load or store instructions implementing the pre-trained language representation described above, or the system memory 112 may be used to load or store instructions implementing an application program that is pre-trained using the pre-trained language representation model described above.

NVM/memory 116 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, NVM/memory 116 may include any suitable non-volatile memory, such as flash memory, and/or any suitable non-volatile storage device, such as at least one of an HDD (Hard disk drive), CD (Compact Disc) drive, DVD (Digital Versatile Disc) drive. NVM/memory 116 may also be used to store the pre-trained language representation models used by the pre-trained language representation methods described above.

NVM/memory 116 may comprise a portion of a storage resource on a device on which electronic device 100 is installed, or it may be accessible by, but not necessarily a part of, the device. For example, NVM/storage 116 may be accessed over a network via network interface 120.

In particular, system memory 112 and NVM/storage 116 may each include: a temporary copy and a permanent copy of instructions 124. The instructions 124 may include: the instructions that, when executed by at least one of the processors 104, cause the electronic device 100 to implement the pre-training language representation methods provided herein. In some embodiments, the instructions 124, hardware, firmware, and/or software components thereof may additionally/alternatively be disposed in the system control logic 108, the network interface 120, and/or the processor 104.

Network interface 120 may include a transceiver to provide a radio interface for electronic device 100 to communicate with any other suitable device (e.g., front end module, antenna, etc.) over one or more networks. In some embodiments, the network interface 120 may be integrated with other components of the electronic device 100. For example, network interface 120 may be integrated with at least one of processor 104, system memory 112, NVM/storage 116, and a firmware device (not shown) having instructions that, when executed by at least one of processors 104, electronic device 100 implements the pre-training language representation methods provided herein.

The network interface 120 may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, network interface 120 may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem.

In some embodiments, at least one of the processors 104 may be packaged together with logic for one or more controllers of the system control logic 108 to form a System In Package (SiP). In some embodiments, at least one of the processors 104 may be integrated on the same die with logic for one or more controllers of the system control logic 108 to form a system on a chip (SoC).

The electronic device 100 may further include: input/output (I/O) devices 132. I/O device 132 may include a user interface to enable a user to interact with electronic device 100; the design of the peripheral component interface enables peripheral components to also interact with the electronic device 100. In some embodiments, the electronic device 100 further comprises a sensor for determining at least one of environmental conditions and location information associated with the electronic device 100.

FIG. 3 provides a pre-training language representation method, according to some embodiments of the present application. As shown in fig. 3, the pre-training language representation method includes the following steps:

step 201: a word sequence comprising a plurality of words is received.

Specifically, for example, a "he can a can" may be received, each word in a sentence is referred to as a word (token), and the sentence may be expressed as a word sequence including a plurality of words.

Step 202: and embedding the word sequence to obtain an initial word sequence vector.

Specifically, a representative vector X of each word in the input word sequence is obtained, X is obtained by adding word Embedding processing (Embedding) and word position coding (Positional Embedding), and representative vectors of all words in the word sequence are combined to obtain an initial word sequence vector X.

Step 203: and performing at least one encoding process based on the initial word sequence vector to obtain a final word sequence vector.

Specifically, in each encoding process, performing at least one mixed attention process, generating a word sequence feature from a linear output result obtained from an initial word sequence vector and an attention weight generated based on local span perception of the initial word sequence vector; and executing at least one feedforward process, processing the word sequence characteristics into a final word sequence vector and outputting the final word sequence vector.

The following describes in detail the hybrid attention process in the present application, with respect to the improvement made from the attention-focused operation, in conjunction with the drawings.

Fig. 4 illustrates a schematic diagram of a process for generating self-attention weights from attention operations, according to some embodiments of the present application. As shown in fig. 4, the self-attention operation requires a query over the entire sequence of words of the input for each word to obtain a dependency relationship with all the remaining words, to generate a self-attention weight, i.e., based on a global dependency relationship, thus creating a large amount of computational redundancy. FIG. 5 illustrates a schematic diagram of a self-attention operation process, as shown in FIG. 5, through an initial word sequence vector X ∈ R as input, according to some embodiments of the application^n×dWherein d is an embedding dimension, n is the number of words in the word sequence, Linear transformation (Linear transformation) is performed on the input X to generate a first vector, a second vector and a third vector, specifically, the first vector may be a query vector Q, the second vector is a value vector V, and the third vector is a key vector K, where Q, V, K ∈ R^n×dSuppose there are H attention heads, Q, V, K are evenly divided into d_kd/H dimensional fragment. Self-attention operation gives an output of the form:

the effect of the Softmax function: the input vector is converted to a vector with each element between 0 and 1. In connection with the application scenario, the Softmax function gives, for each word in the sequence, the other words and the attention coefficient of this word, the sum of these non-negative coefficients being 1.

FIG. 6 illustrates a schematic diagram of a process for generating convolution kernels for a dynamic convolution process, according to some embodiments of the present application. As shown in fig. 6, the dynamic convolution generates convolution kernels only by acquiring the current word, and thus causes the same word with different meanings to generate the same convolution kernel, for example, a "can" located in front of "a" and a "can" located behind "a" both generate the same convolution kernel, which results in that different meanings of the same word in different contexts cannot be distinguished, and an erroneous result is generated.

With W ∈ R^d×kA convolution kernel representing a light-weight convolution (LConv), then the depth separable convolution at position i and channel c can be expressed as:

by connecting weights along the channel dimension, the convolution kernel can be simplified to W ∈ R^kResulting in a lightweight convolution operation as described by:

the trained convolution kernel parameters are fixed for any input word and are not beneficial to capturing the diversity of words, so that the dynamic convolution capable of generating the convolution kernel parameters with a specific input word as a condition is provided. FIG. 7 is a schematic diagram illustrating a dynamic convolution process, such as that shown in FIG. 7, followed by a linear processing and Gate Linear Unit (GLU) process to generate a dynamic convolution kernel from a current word and apply it to convolution operations of words adjacent to the current word to generate a new representation embedding, according to some embodiments of the present application. Dynamic convolution can better utilize input information and generate convolution kernels based on input words than a standard convolution kernel that is fixed after training. Specifically, the position-dependent convolution kernel W ═ f (X) at position i_i) Where f is with a learnable weight

Followed by a softmax function. The dynamic convolution operation can be expressed as follows:

DConv(X,W_f,i)＝LConv(X,softmax(W_fX_i),i). (4)

as can be seen from fig. 6, the convolution kernel depends only on the input single word, and ignores the local context information, and when the same word has different meanings in different contexts, the same convolution kernel is still generated, which impairs the performance of the model.

FIG. 8 illustrates a schematic diagram of a process for span-based dynamic convolution to generate a span-based convolution kernel, according to some embodiments of the present application. As shown in fig. 8, the present application provides a span-based dynamic convolution, which obtains a dependency relationship by obtaining a span perception composed of a current word and words adjacent to the current word, that is, based on a local span perception, and can better utilize a local dependency relationship and distinguish different meanings of the same word in different contexts. For example, span-based dynamic convolution generates different convolution kernels for different positions of the word "can" in the word sequence, in combination with local span perception.

As shown in fig. 9, span-based dynamic convolution first collects word-local span perception using Depth-wise separable convolution (DWConv), and then dynamically generates convolution kernels. The deep separable convolution operation concentrates the information of a plurality of adjacent words to the position of the intermediate word, and the generation of a span convolution kernel is realized. By generating a local context based on it, the convolution kernel is helped to capture local dependencies more efficiently. In addition, in order to make the span-based dynamic convolution compatible with self-attention, the input X is linearly transformed to generate a first vector and a second vector, specifically, the first vector may be a query vector Q, and the second vector may be a value vector V, where Q, V ∈ R^n×dWhile performing a depth separable convolution operation on the input x to generate a fourth vector K_sTo represent local span perception. Then, Q and K are added_sPerforming a bit-multiplication operation (Multiply), generating an attention weight, and generating a dynamic convolution kernel through a linear process based on the attention weight, as shown in the following formula:

f(Q,K_s)＝softmax(W_f(Q⊙K_s)), (5)

and performing light convolution operation on the dynamic convolution kernel and the second vector to obtain a span-based word sequence characteristic, which is shown as the following formula:

SDConv(Q,K_s,V；W_f,i)＝LConv(V,softmax(W_f(Q⊙K_s)),i)； (6)

wherein Wf ∈ Rdk × k represents the attention weight; f represents the linear processing followed by the operation of the softmax function; q represents the first vector; v represents the second vector; ks represents the fourth vector; an indication bit product operation; LConv represents the lightweight convolution operation; SDConv represents the span-based dynamic convolution operation; i denotes the position i of the convolution kernel of the span-based dynamic convolution operation.

And finally, carrying out linear processing on the obtained result. Unless otherwise noted, the same size convolution kernel is maintained throughout for both depth separable convolutions and span based dynamic convolutions.

Based on span-based dynamic convolution, the technical solution of the present application combines it with self-attention operation to propose mixed attention processing, fig. 10 shows a schematic diagram of a mixed attention processing procedure according to some embodiments of the present application, as shown in fig. 10, span-based dynamic convolution operation and self-attention operation share the same Q, V, but different key values (key) are used to generate self-attention-based word sequence feature and span-based word sequence feature respectively, where d represents an embedding dimension of input X, and γ represents a reduction rate. The mixed attention processing is calculated by the following formula to obtain the word sequence characteristics:

Mixed-Attn(K,Q,K_s,V；W_f)＝Cat(Self-Attn(Q,K,V),SDConv(Q,K_s,V；W_f))； (7)

in some embodiments, the initial word sequence vector is linearly transformed by reducing the dimension of the initial word sequence vector to d/γ and reducing the number of attention heads by γ. Specifically, as is well known, some number of attention heads is redundant, and in the technical solution of the present application, the number of attention heads is reduced when span-based dynamic convolution is introduced, and in the original converter architecture of BERT, the initial word sequence vector of dimension d is projected to the space of dimension d, which is the same as Q, K, and V, through linear transformation. In contrast, as shown in fig. 10, in the present application, the input initial word sequence vector is projected into a lower-dimensional space, specifically, the initial word sequence vector in the dimension d is projected into a lower-dimensional space with the dimension d/γ, and the number of attention heads is reduced by γ, so that the calculation cost of the self-attention operation is greatly saved, and the attention heads generate more compact and useful attention information. The self-attention process and the span-based dynamic convolution process share the same Q and V, but generate a self-attention-based word sequence feature and a span-based word sequence feature using different keys as references, and subject the results to a merge operation (Concat) to output the word sequence feature. Hybrid attention processing integrates span-based dynamic convolution operations and self-attention processing to better model global and local dependencies and reduce computational redundancy.

In some embodiments, after the span-based word sequence features and the self-attention-based word sequence features are obtained through self-attention operation and span-based dynamic convolution processing, dimension-expanded linear transformation is performed on the span-based word sequence features and the self-attention-based word sequence features respectively, so that the dimensions of the span-based word sequence features and the self-attention-based word sequence features with the previous dimension being d/gamma are expanded to d/2, and then merging operation is performed.

In some embodiments, the word sequence features are divided into multiple groups in the embedding dimension, wherein each group is subjected to linear processing, and the results of the linear processing are combined to obtain a final word sequence vector. Specifically, in practice, it is found that a large number of parameters actually come from feedforward processing, and in order to reduce parameters and calculation cost and not reduce the language expression capability, in the present application, feedforward processing is performed in a grouping manner, word sequence features are divided into a plurality of independent groups in an embedding dimension, different parameters are set according to different groups, then independent linear processing is performed respectively, and linear processing results are combined to obtain a final word sequence vector and are output. Compared with the feedforward processing of the full connection layer in the prior art, the packet feedforward processing in the application is more efficient, and the cost reduction is negligible.

By stacking the mixed attention processing operations and the grouped feed-forward processing in an iterative manner, the ConvBERT model is constructed for implementing the pre-training language expression method of the present application, and then the advantages of ConvBERT over the prior art are respectively explained from different angles by combining experimental data. Experiments the ConvBERT model was trained using the open source dataset OpenWebText (32G) and evaluated against a generic language understanding evaluation (gluue) benchmark and a question and answer task SQuAD.

1) The effect of convolution kernel size. Fig. 11 is a diagram illustrating results of span-based dynamic convolution with different convolution kernel sizes on a GLUE development set according to some embodiments of the present application, where as shown in fig. 11, the larger the convolution kernel is, the better the effect is achieved as long as the acceptance domain does not cover the entire input word sequence. However, when the convolution kernel is large enough and the received field covers the entire sequence of input words, the advantage of using a large convolution kernel is reduced. In the experiments that follow, the convolution kernel size of all dynamic convolutions was set to 9, as not otherwise stated, so that the best results could be provided.

2) Different approaches that integrate convolution into self-attention operations. Table 1 shows a comparison of convberts with different convolutions based on the GLUE development set. As shown in table 1, directly adding a conventional depth separable convolution in parallel with the self-attention module compromises performance; the addition of the dynamic convolution provides little improvement over the baseline architecture relative to the average GLUE score; local dependencies may be further increased by using span-based dynamic convolution to generate convolution kernels using span-based input words, which may significantly improve performance.

TABLE 1

3) Based on the evaluation results of the GLUE. Table 2 lists the results of comparing models with similar sizes and pre-training computational costs on the glie test set. As can be seen from table 2, the mini-model and the base model in this application outperformed the baseline model of all other similar models, while requiring much lower pre-training costs. For example, ConvBERTbase achieves better performance results than the powerful baseline eletrabase, and requires less than 1/4 training costs.

TABLE 2

4) Evaluation results in SQuAD. Table 3 lists the results of comparing models with similar size and pre-training computational cost on the sqad test set. For the SMALL models, ConvBERT SMALL and ConvBERT MEDIUM-SMALL outperformed baseline ELECTRASALL and comparable results were obtained to BERTBase. The results of MobileBERT are much higher because they used knowledge refinement methods and searched their model architecture and hyper-parameters based on the development set of SQuAD. The basic size model of ConvBERT in this application is much less expensive to train and performs better than all other models of similar size.

TABLE 3

5) Comparison of mean attention visualizations. Fig. 12(a) illustrates a graph of a visualization of mean attention of a pre-training language representation method BERT, according to some embodiments of the present application, and fig. 12(b) illustrates a graph of a visualization of mean attention of a pre-training language representation method ConvBERT, according to some embodiments of the present application; this example is drawn randomly from the MRPC development set. The input X is "[ CLS ] he present the foods # # er # # vic # # e pie Business dosn't the company's long-term growth strategy. [ SEP ]". As shown in fig. 12(a), the average attention map of BERT shows a diagonal pattern, and many attention heads (attention heads) actually learn local dependencies (local dependency), but still calculate attention weights (attention weight) between all pairs of words (token pair). Illustrating that the self-attention operation in BERT has a large number of redundant computations. In the present application, improvements to self-attentive operation are proposed as such. Since the self-attention operation calculates attention weights between all word pairs as in equation (1). There is therefore no need to compute many attention weights for local contexts that are beyond the span, as they contribute much less than local attention, which can lead to unnecessary computational overhead and model redundancy. The ConvBERT model in the present application derives local dependencies through a depth separable convolution operation, and as can be seen from fig. 12(b), the self-attention operation in ConvBERT focuses more on global information, and the dynamic convolution acquires local dependency information, so the ConvBERT model in the present application is more suitable for learning local dependencies than in the BERT model.

Experiments have shown that the ConvBERT model has lower training costs, fewer model parameters and better training performance than the BERT model, and that ConvBERT is significantly superior to BERT and its variants in various downstream tasks. The ConvBERTbase model achieved a GLUE score of 86.1, 0.4 higher than ELECTRAbase, and used less training cost than 1/4. Unlike the trend of increasing model complexity for improved performance, ConvBERT turns to a direction that makes the model more efficient and saves training costs, benefiting rational application of limited computational resources.

According to some embodiments of the present application, a pre-training language representation apparatus 300 is provided, and fig. 13 illustrates a schematic structural diagram of a pre-training language representation apparatus according to some embodiments of the present application. As shown in fig. 13, the apparatus 300 for pre-training language representation using the ConvBERT model is as follows:

a receiving module 301, configured to receive a word sequence, where the word sequence includes a plurality of words;

an embedding module 302, which performs embedding processing on the word sequence to obtain an initial word sequence vector;

the encoding module 303 performs at least one encoding process based on the initial word sequence vector to obtain a final word sequence vector;

wherein, in each encoding module 303, including,

a mixed attention sub-module 3031, which performs at least one mixed attention process for generating word sequence features according to a linear output result obtained from the initial word sequence vector and an attention weight generated based on the local span perception of the initial word sequence vector;

and a feedforward submodule 3032, configured to perform at least one feedforward process, and configured to process the word sequence features into a final word sequence vector and output the final word sequence vector.

In some embodiments, hybrid attention submodule 3031, further comprises: the linear transformation layer is used for performing linear transformation on the basis of the initial word sequence vector to generate a first vector, a second vector and a third vector, performing depth separable convolution operation on the initial word sequence vector to generate a fourth vector, and the fourth vector is used for expressing local span perception; a span-based dynamic convolution layer to perform a span-based dynamic convolution operation based on the first vector, the second vector, and the fourth vector to generate a span-based word sequence feature; the self-attention layer is used for performing self-attention operation based on the first vector, the second vector and the third vector and generating word sequence characteristics based on self-attention; and the merging layer is used for merging the span-based word sequence characteristics and the self-attention-based word sequence characteristics to obtain the word sequence characteristics.

In some embodiments, the span-based dynamic convolutional layer, further comprises the operations of: generating an attention weight by performing a bit multiplication operation on the first vector and the fourth vector, and generating a dynamic convolution kernel based on the attention weight, as shown in the following formula:

f(Q,K_s)＝softmax(W_f(Q⊙K_s)),

SDConv(Q,K_s,V；W_f,i)＝LConv(V,softmax(W_f(Q⊙K_s)),i)；

wherein,

representing an attention weight; f represents the linear processing followed by the operation of the softmax function; q represents a first vector; v represents a second vector; ks represents a fourth vector; an indication bit product operation; LConv denotes lightweight convolution operations; SDConv represents a span-based dynamic convolution operation; i denotes the position i of the convolution kernel of the span-based dynamic convolution operation.

In some embodiments, the mixed attention process is calculated by the following formula to obtain word sequence features:

Mixed-Attn(K,Q,K_s,V；W_f)＝Cat(Self-Attn(Q,K,V),SDConv(Q,K_s,V；W_f))；

wherein,

representing an attention weight; f represents the linear processing followed by the operation of the softmax function; q represents a first vector; v represents a second vector; k represents a third vector; k_sRepresents a fourth vector; an indication bit product operation; SDConv represents a span-based dynamic convolution operation; i denotes the position i of the convolution kernel of the span-based dynamic convolution operation; Mix-Attn denotes mixed attention treatment; self-attentive operation is denoted by Self-attentive; cat denotes the merge operation.

In some embodiments, hybrid attention submodule 3031, further includes operations to: when the initial word sequence vector is linearly transformed, reducing the dimension of the d-dimension initial word sequence vector into d/gamma, and reducing the attention head number according to the gamma; where d represents the embedding dimension of the initial word sequence vector, γ represents the reduction ratio, and γ > 1.

In some embodiments, the feed forward sub-module further comprises: and dividing the word sequence characteristics into a plurality of groups in the embedding dimension, wherein each group is respectively subjected to linear processing, and merging the linear processing results to obtain a final word sequence vector.

In some embodiments, the span-based dynamic convolutional layer, further comprises: the convolution kernel of the depth separable convolution operation and the convolution kernel of the span-based dynamic convolution operation are the same size.

In some embodiments, the encoding process is performed at least once in an iterative manner.

According to some embodiments of the present application, a pre-training language representation apparatus is provided. It can be understood that the electronic device represented by the pre-training language corresponds to the pre-training language representation method provided by the present application, and the technical details in the above specific description of the pre-training language representation method provided by the present application are still applicable to the electronic device represented by the pre-training language, and the specific description is referred to above and is not repeated herein.

According to some embodiments of the present application, a computer-readable storage medium encoded with a computer program is provided. It is understood that the computer readable storage medium encoded with the computer program corresponds to the pre-training language representation method provided in the present application, and the technical details in the above detailed description of the pre-training language representation method provided in the present application are still applicable to the computer readable storage medium encoded with the computer program, and the detailed description is please refer to the above, and is not repeated herein.

It will be appreciated that exemplary applications of the pre-training language representation provided by embodiments of the present application include, but are not limited to, pre-training language representations in the field of natural language processing.

It will be appreciated that as used herein, the terms "module," "unit" may refer to or include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and/or memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable hardware components that provide the described functionality, or may be part of such hardware components.

It is to be appreciated that in various embodiments of the present application, the processor may be a microprocessor, a digital signal processor, a microcontroller, or the like, and/or any combination thereof. According to another aspect, the processor may be a single-core processor, a multi-core processor, the like, and/or any combination thereof.

It is to be appreciated that the pre-training language representation methods provided herein can be implemented on a variety of electronic devices including, but not limited to, a server, a distributed server cluster of multiple servers, a cell phone, a tablet, a laptop, a desktop, a wearable device, a head-mounted display, a mobile email device, a portable game console, a portable music player, a reader device, a personal digital assistant, a virtual reality or augmented reality device, a television or other electronic device having one or more processors embedded or coupled therein, and the like.

Particularly, the pre-training language representation method provided by the application is suitable for edge equipment, edge computing is a distributed open platform (framework) which integrates network, computing, storage and application core capabilities at the edge side of a network close to an object or a data source, edge intelligent service is provided nearby, and key requirements in the aspects of real-time business, data optimization, application intelligence, safety, privacy protection and the like can be met.

The embodiments disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code can also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in this application are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or via other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, read-only memories (CD-ROMs), magneto-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or a tangible machine-readable memory for transmitting information (e.g., carrier waves, infrared digital signals, etc.) using the internet in an electrical, optical, acoustical or other form of propagated signal. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some features of the structures or methods may be shown in a particular arrangement and/or order. However, it is to be understood that such specific arrangement and/or ordering may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the illustrative figures. In addition, the inclusion of a structural or methodical feature in a particular figure is not meant to imply that such feature is required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the apparatuses in the present application, each unit/module is a logical unit/module, and physically, one logical unit/module may be one physical unit/module, or may be a part of one physical unit/module, and may also be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logical unit/module itself is not the most important, and the combination of the functions implemented by the logical unit/module is the key to solve the technical problem provided by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-mentioned device embodiments of the present application do not introduce units/modules which are not so closely related to solve the technical problems presented in the present application, which does not indicate that no other units/modules exist in the above-mentioned device embodiments.

It is noted that, in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element.

While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims

1. A pre-training language processing method is characterized in that,

embedding the word sequence to obtain an initial word sequence vector;

wherein, in each of the encoding processes,

2. The method of claim 1, wherein the mixed attention process further comprises:

performing linear transformation on the initial word sequence vector to generate a first vector, a second vector and a third vector, and performing depth separable convolution operation on the initial word sequence vector to generate a fourth vector, wherein the fourth vector is used for representing the local span perception;

performing a span-based dynamic convolution operation based on the first vector, the second vector and the fourth vector to generate span-based word sequence features;

performing a self-attention operation based on the first vector, the second vector and the third vector, generating a self-attention-based word sequence feature;

and combining the span-based word sequence characteristics and the self-attention-based word sequence characteristics to obtain the word sequence characteristics.

3. The method of claim 2, wherein the span-based dynamic convolution operation further comprises:

generating the attention weight by performing a bit multiplication operation on the first vector and the fourth vector, and generating a dynamic convolution kernel based on the attention weight, as shown in the following formula:

f(Q,K_s)＝softmax(W_f(Q⊙K_s)),

SDConv(Q,K_s,V；W_f,i)＝LConv(V,softmax(W_f(Q⊙K_s)),i)；

wherein,

4. The method of claim 2, wherein the mixed attention process is calculated by the following formula to obtain the word sequence features:

Mixed-Attn(K,Q,K_s,V；W_f)＝Cat(Self-Attn(Q,K,V),SDConv(Q,K_s,V；W_f))；

wherein,

5. The method of claim 2, wherein the hybrid attention process, further comprises:

when the linear transformation is carried out on the initial word sequence vector, reducing the dimension of the d-dimension initial word sequence vector into d/gamma, and reducing the attention head number according to gamma;

wherein d represents the embedding dimension of the initial word sequence vector, γ represents the reduction ratio, and γ > 1.

6. The method of claim 1, wherein the feed forward processing further comprises:

and dividing the word sequence characteristics into a plurality of groups in an embedding dimension, wherein each group is respectively subjected to linear processing, and merging the linear processing results to obtain the final word sequence vector.

7. The method of claim 3, wherein the span-based dynamic convolution operation further comprises:

the convolution kernel of the depth separable convolution operation and the convolution kernel of the span-based dynamic convolution operation are the same size.

8. The method of claim 1, wherein the encoding process is performed at least once in an iterative manner.

9. A pre-training language processing device is characterized in that,

wherein, in each of the encoding modules,

10. The apparatus of claim 9, wherein the hybrid attention sub-module further comprises:

a linear transformation layer, configured to perform linear transformation on the initial word sequence vector to generate a first vector, a second vector, and a third vector, perform depth separable convolution on the initial word sequence vector to generate a fourth vector, where the fourth vector is used to represent the local span sensing;

a span-based dynamic convolution layer to perform a span-based dynamic convolution operation based on the first vector, the second vector, and the fourth vector to generate span-based word sequence features;

a self-attention layer, which performs self-attention operation based on the first vector, the second vector and the third vector to generate a self-attention-based word sequence feature;

and the merging layer is used for merging the span-based word sequence characteristics and the self-attention-based word sequence characteristics to obtain the word sequence characteristics.

11. The apparatus of claim 10, wherein the span-based dynamic convolutional layer, further comprising operations of:

f(Q,K_s)＝softmax(W_f(Q⊙K_s)),

SDConv(Q,K_s,V；W_f,i)＝LConv(V,softmax(W_f(Q⊙K_s)),i)；

wherein,

12. The apparatus of claim 10, wherein the mixed attention process is computed by the following formula to obtain the word sequence features:

Mixed-Attn(K,Q,K_s,V；W_f)＝Cat(Self-Attn(Q,K,V),SDConv(Q,K_s,V；W_f))；

wherein,

13. The apparatus of claim 10, wherein the hybrid attention sub-module further comprises operations of:

14. The apparatus of claim 9, the feed forward sub-module, further comprising:

15. The apparatus of claim 11, wherein the span-based dynamic convolutional layer, further comprising:

16. The apparatus of claim 9, wherein the encoding process is performed at least once in an iterative manner.

17. A pre-training language processing device, comprising:

A processor, being one of the processors of the system, for executing the instructions to implement the image generation method of any of claims 1-8.

18. A computer-readable storage medium encoded with a computer program, having instructions stored thereon, which when executed on a computer, cause the computer to perform the pre-training language representation method of any one of claims 1-8.