CN115688804A

CN115688804A - Representation generation based on embedding vector sequence abstraction

Info

Publication number: CN115688804A
Application number: CN202110869638.9A
Authority: CN
Inventors: 王栋; 夏云庆; 邓维维
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2023-02-03
Also published as: WO2023009220A1

Abstract

The present disclosure presents methods, apparatuses, and computer program products for representation generation based on embedded vector sequence abstractions. Text may be obtained. A block-level embedded vector sequence of the text may be generated. An embedding vector sequence abstraction may be performed on the block-level embedding vector sequence to obtain an abstract embedding vector sequence of the text. A text representation of the text may be generated based on the abstract embedded vector sequence.

Description

Representation generation based on embedded vector sequence abstraction

Background

Natural Language Processing (NLP) is a technology for communicating with a computer using a Natural Language, and is intended to enable a computer to understand and use a Natural Language to realize communication between human and machine, thereby performing various tasks related to a Natural Language instead of a human. Tasks performed using NLP techniques may be referred to as NLP tasks. Examples of NLP tasks may include Click-Through Rate (CTR) pre-estimation tasks, question Answering (QA) tasks, machine Reading Comprehension (MRC) tasks, and the like. The NLP task can be performed by a neural network model. For example, the NLP task may be performed by a neural network model based on a Fully Connected Layer (full Connected Layer) structure, a neural network model based on a Transformer Layer (Transformer Layer) structure, or the like.

Disclosure of Invention

This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the present disclosure propose methods, apparatuses, and computer program products for representation generation based on embedded vector sequence abstraction. Text may be obtained. A block-level embedded vector sequence of the text may be generated. An embedding vector sequence abstraction may be performed on the block-level embedding vector sequence to obtain an abstract embedding vector sequence of the text. A text representation of the text may be generated based on the abstract embedded vector sequence.

It should be noted that one or more of the above aspects include features that are specifically pointed out in the following detailed description and claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.

Drawings

The disclosed aspects will hereinafter be described in conjunction with the appended drawings, which are provided to illustrate, but not to limit, the disclosed aspects.

Fig. 1 illustrates an exemplary process for representation generation based on embedded vector sequence abstraction in accordance with an embodiment of the disclosure.

Fig. 2A-2C illustrate an exemplary process for performing embedded vector sequence abstraction according to an embodiment of the disclosure.

Fig. 3 illustrates another exemplary process for representation generation based on embedded vector sequence abstraction in accordance with an embodiment of the disclosure.

Fig. 4 illustrates another exemplary process for embedding vector sequence abstraction in accordance with an embodiment of the disclosure.

FIG. 5 illustrates an exemplary process for representation generation for multimodal input in accordance with an embodiment of the disclosure.

Fig. 6 illustrates an exemplary process for performing self-attention calculations in accordance with an embodiment of the present disclosure.

FIG. 7 illustrates an exemplary process for embedded vector sequence abstraction-based representation generation for multimodal input, according to an embodiment of the disclosure.

Fig. 8 is a flow diagram of an exemplary method for representation generation based on embedded vector sequence abstraction in accordance with an embodiment of the present disclosure.

Fig. 9 illustrates an example apparatus for representation generation based on embedded vector sequence abstraction in accordance with an embodiment of the present disclosure.

FIG. 10 illustrates an example apparatus for representation generation based on embedded vector sequence abstraction according to embodiments of this disclosure.

Detailed Description

The present disclosure will now be discussed with reference to several exemplary embodiments. It is to be understood that the discussion of these embodiments is merely intended to enable those skilled in the art to better understand and thereby practice the embodiments of the present disclosure, and does not teach any limitation as to the scope of the present disclosure.

The neural network model based on the converter layer structure may include, for example, a Bidirectional Encoder representation from converters (BERT) model, a Generative Pre-trained converter (GPT) model, a brute force optimized BERT method (RoBERTa) model, a Decoding-enhanced BERT with distributed attachment (DeBERTa) model for Attention separation, and the like. Text or vectors corresponding to the text for a particular NLP task may be provided to a model based on the converter layer structure. The model based on the converter layer structure may perform a self-attention calculation to generate a textual representation of the text through a self-attention mechanism. Further, a task result corresponding to the NLP task may be obtained based on the generated textual representation by a task output layer corresponding to the particular NLP task. Because the model based on the converter layer structure is usually a complex model that relies on deep networks with a huge number of parameters, it can generate accurate textual representations and obtain accurate task results. At the same time, however, online prediction using such complex models is very time consuming, which results in high prediction delays, especially in the case of long text inputs comprising a large number of words.

Embodiments of the present disclosure propose representation generation based on embedded vector sequence abstraction (embedding sequence abstraction). Herein, an embedding vector (embedding) may refer to a set of information generated based on raw data in a form that facilitates neural network model processing. For a given text, a textual representation of the text may be generated by a converter layer structure based neural network model according to embodiments of the present disclosure. The neural network model used to generate the textual representation of the text may be referred to herein as a textual representation generation model. The text representation generation model may first generate a block-level embedded vector sequence of the text through a set of converter layers below it. The sequence of chunk-level embedding vectors may include a set of embedding vectors corresponding to a set of chunks (tokens) in the text. In this document, a word block may broadly refer to a basic unit of language that constitutes text in different languages. An embedding vector sequence abstraction may be performed on the block-level embedding vector sequence to obtain an abstract embedding vector sequence for the text. The process of abstracting an embedded vector containing key information in text from a block-level embedded vector sequence of text may be referred to herein as embedded vector sequence abstraction. The sequence of embedded vectors obtained abstractly from the sequence of embedded vectors may be combined into an abstract sequence of embedded vectors. The sequence of abstract embedding vectors includes a smaller number of embedding vectors than the sequence of block-level embedding vectors. The abstract embedded vector sequence may be provided to an upper set of converter layers in the text representation generation model to be further processed to generate a text representation of the text. The number of converters included in each converter layer corresponds to the number of embedded vectors. In case the number of embedded vectors of the sequence of abstract embedded vectors to be processed is reduced, the number of converters may also be reduced accordingly. Further, the amount of calculation of the self-attention calculation that needs to be performed is also reduced accordingly. Thus, the latency of generating the textual representation and the latency of obtaining the task results may be reduced. This is particularly beneficial for long text. Long text may include a large number of sentences, some of which may include a large number of word blocks, and thus may include a large number of word blocks. Accordingly, a sequence of block-level embedded vectors corresponding to long text will include a large number of block-level embedded vectors. Existing models based on the structure of the converter layer will be very delayed in processing long texts. However, some NLP tasks, such as CTR estimation task, QA task, etc., have high requirements on real-time. By embedding the vector sequence abstraction, the number of embedded vectors provided to the upper converter layer can be substantially reduced, thereby substantially reducing the latency of generating the textual representation and the latency of obtaining the task results. This makes it possible to use a model based on the converter layer structure for processing these NLP tasks, even for processing NLP tasks with long text inputs.

In one aspect, embodiments of the present disclosure propose embedding vector sequence abstraction to obtain an abstract embedding vector sequence in a variety of ways. In one embodiment, the abstract embedded vector sequence may be obtained by selecting a representative set of word-chunk-level embedded vectors from the word-chunk-level embedded vector sequence. The set of representative block-level embedding vectors may include a set of representative block-level embedding vectors corresponding to a set of representative blocks of words in the input text. Important word blocks in the input text that contain key semantic information of the input text may be referred to herein as representative word blocks. The number of embedding vectors included in the representative set of word block level embedding vectors will be significantly less than the number of embedding vectors included in the sequence of word block level embedding vectors. In another embodiment, the sequence of abstract embedding vectors may be obtained by generating a segment-level embedding vector corresponding to each text segment in the text based on the sequence of word-block-level embedding vectors. The segment-level embedded vector may characterize the information at the text segment level. Fragment-level embedded vectors can represent more refined, abstract information in text than block-level embedded vectors. The number of segment-level embedding vectors will be much smaller than the number of embedding vectors comprised by the sequence of word-block-level embedding vectors. Further, the above two embodiments may be performed in combination. For example, the abstract embedded vector sequence may be obtained by first selecting a representative set of word-block-level embedded vectors from the sequence of word-block-level embedded vectors, and then generating a segment-level embedded vector corresponding to each text segment in the text based on the representative set of word-block-level embedded vectors. The sequence of abstract embedded vectors obtained in the above-described various ways may sufficiently include key, dominant information in the text. Thus, the text representation obtained by processing such an abstract embedded vector sequence will be accurate. Meanwhile, the abstract embedded vector sequence only comprises a small number of embedded vectors, so that the time delay for generating the text representation is greatly reduced compared with the existing model based on the converter layer structure without embedded vector sequence abstraction.

In another aspect, embodiments of the present disclosure propose representation generation for multimodal input. An input having both textual and non-textual features may be referred to herein as a multi-modal input. In this context, a text feature may refer to a set of information generated based on text in a form that facilitates neural network model processing. For example, the aforementioned text-based generated word block-level embedded vector sequence may be an example of a text feature. In this context, non-textual features may refer to a set of information generated based on the non-textual information in a form that facilitates neural network model processing, which may broadly include other features in addition to textual features, such as counting features related to counting information, identifier features related to identifier information, and the like. A comprehensive representation of both textual features and non-textual features may be generated by a single neural network model according to embodiments of the present disclosure. The generated composite representation may be further provided to a task output layer corresponding to the particular NLP task to obtain a task result corresponding to the NLP task. Since the non-text features may contain rich information related to the NLP task, more accurate task results may be obtained when considering the non-text features when performing the NLP task. The neural network model used to generate the composite representation of textual features and non-textual features may be referred to herein as a composite representation generation model. The synthetic representation generative model may be a model based on a converter layer structure. The integrated representation generation model can deeply learn the interaction between textual features and non-textual features using a set of converter layers that it includes, and thus can generate an accurate integrated representation that fuses textual information and non-textual information. Performing NLP tasks with such a comprehensive representation helps to obtain more accurate task results.

Fig. 1 illustrates an exemplary process 100 for representation generation based on embedded vector sequence abstraction in accordance with an embodiment of the present disclosure. In process 100, text 102 may be obtained and a text representation 112 of text 102 may be generated by text representation generating model 110. With the text representation generation model 110, a block-level embedded vector sequence 122 of the text 102 may be generated, and an embedded vector sequence abstraction may be performed on the generated block-level embedded vector sequence 122 to obtain an abstract embedded vector sequence 132 of the text. The number of embedding vectors of the abstract embedding vector sequence 132 may be less than the number of embedding vectors of the block-level embedding vector sequence 122. Subsequently, a text representation 112 of the text 102 may be generated based on the abstract embedded vector sequence 132.

Different forms of input text may be possible for different NLP tasks. The CTR pre-estimation task is used as an example to describe the text 102. The CTR pre-estimation task aims to pre-estimate the probability of a user clicking on specific content. The text 102 for the CTR prediction task may include, for example, information related to a query input by the user, a search, click history of the user, a title of a web page retrieved based on the query, a title, address, etc. of a specific content. This information may be organized in the form of text segments. Each text segment may be a sentence that includes one or more word blocks. Text 102 may be formed by combining individual text segments.

The text 102 may be pre-processed to obtain an initial block-level embedded vector sequence 104 suitable for processing by the text representation generation model 110. The sequence of embedded vectors provided to the representation generator model may be referred to herein as an initial block-level sequence of embedded vectors. The initial sequence of word block-level embedded vectors 104 may include an initial set of word block-level embedded vectors corresponding to a set of word blocks in the text 102. For each word block, an initial word block level embedding vector corresponding to the word block may be generated based on at least one of a token embedding vector (token embedding), a segment embedding vector (segment embedding), and a position embedding vector (position embedding) of the word block. Accordingly, the initial word-block-level embedding vector sequence 104 may be generated based on at least one of a set of word-block embedding vectors, a set of fragment embedding vectors, and a set of position embedding vectors corresponding to a set of word blocks in the text 102. A predefined code, such as a classification (classification) predefined code "[ CLS ]", may be added before the text 102. Further, in the text 102, a predefined code, such as a delimiter (delimiter) predefined code "[ SEP ]", may be inserted after the last word block of each text segment to separate between different text segments. An initial classified embedded vector of the classified predefined encoding "[ CLS ]" may be generated based on at least one of a block embedded vector, a segment embedded vector, and a location embedded vector of the classified predefined encoding. Similarly, an initial delimiter embedded vector for a delimiter predefined encoding "[ SEP ]" may be generated based on at least one of a block of words embedded vector, a segment embedded vector, and a position embedded vector for the delimiter predefined encoding. The initial classification embedding vector and the initial delimiter embedding vector may together with a set of initial word block level embedding vectors corresponding to the text 102 form an initial word block level embedding vector sequence 104. Thus, the initial sequence of word-block-level embedding vectors 104 may include, for example, a set of initial word-block-level embedding vectors corresponding to the text 102, an initial classification embedding vector, at least one initial delimiter embedding vector, etc., where the number of initial delimiter embedding vectors may correspond to the number of text segments.

The text representation generative model 110 may be a neural network model that includes K layers of converters. The K converter layers may include a lower K1 converter layer and an upper K2 converter layer, where K1+ K2= K. The structure and function of each converter layer may be similar to that of a converter layer in known converter layer structure-based models, such as the BERT model. For example, at each converter level, the respective embedded vectors may be updated by performing a self-attention calculation through a self-attention mechanism. According to embodiments of the present disclosure, the amount of computation from attention calculation may be reduced by reducing the number of embedded vectors, thereby reducing the latency of generating a textual representation and obtaining a task result. In one embodiment, the number of embedded vectors may be reduced by performing an embedded vector sequence abstraction on the embedded vector sequence.

As shown in fig. 1, in the text representation generation model 110, first, a word block level embedded vector sequence 122 may be generated by a self-attention mechanism based on the initial word block level embedded vector sequence 104 through K1 converter layers 120. Since the sequence of word block level embedding vectors 122 includes a set of embedding vectors corresponding to a set of word blocks in the text 102, the sequence of word block level embedding vectors 122 may include word block level information in the text 102. Subsequently, the block-level embedding vector sequence 122 may be subjected to an embedding vector sequence abstraction by the embedding vector sequence abstraction unit 130 to obtain an abstract embedding vector sequence 132 of the text 102. The embedding vector sequence abstraction can be made in a number of ways. An exemplary process of embedding vector sequence abstraction will be described later in conjunction with fig. 2A-2C. The number of embedding vectors of the abstract embedding vector sequence 132 will be less than the number of embedding vectors of the block-level embedding vector sequence 122. Next, a target sequence of embedding vectors 142 may be generated by the K2 converter layers 140 through a self-attention mechanism based on the abstract sequence of embedding vectors 132. The number of converters included in each converter layer corresponds to the number of embedded vectors. In the case where the number of embedded vectors of the sequence of abstract embedded vectors 132 to be processed is reduced, the number of converters included in each converter layer 140 may be reduced accordingly. At least one embedding vector may be selected from the target sequence of embedding vectors 142 as the text representation 112 of the text 102. Taking the CTR prediction task as an example, a classification embedded vector may be selected from the target embedded vector sequence 142 as the text representation 112. The textual representation 112 may be provided to a task output layer (not shown) for a particular NLP task to obtain a task result corresponding to the NLP task. Continuing with the CTR predictor task example, the textual representation 112 may be provided to a task output layer for the CTR predictor task. The task output layer may output the estimated probability of the user clicking on the particular content.

As can be seen in FIG. 1, the K converter layers in the text representation generative model 110 may be divided into two groups by the embedded vector sequence abstraction unit 130, namely, the lower K1 converter layers 120 and the upper K2 converter layers 140. An embedding vector sequence abstraction can be made between the two sets of converter layers to reduce the number of embedding vectors provided to the upper K2 converter layers 140. Accordingly, the number of converters required for each of the K2 converter layers 140 may also be reduced, which will result in a reduced amount of computation from attention calculations, thereby reducing the latency in generating the textual representation and obtaining the task results. Since the number of converters in the upper K2 converter layers 140 is smaller than the number of converters in the lower K1 converter layers 120, the textual representation generative model 110 may have a structure similar to a pyramid. The textual representation generative model 110 may also be referred to as a pyramid model. Accordingly, the process of generating a text representation by the text representation generation model 110 may also be referred to as a pyramidal distillation (distillation) process, wherein distillation may be considered a process in which knowledge or information is migrated from the bottom layer of the model to the upper layer.

The location of embedded vector sequence abstraction unit 130 in text representation generation model 110 may be determined based on the accuracy and real-time requirements of the text representation generation according to the actual application requirements. Placing the embedded vector sequence abstraction unit 130 higher in the text representation generation model 110, i.e., the number K2 of upper converter layers 140 is smaller than the number K1 of lower converter layers 120, may improve the accuracy of the text representation generation. Placing the embedded vector sequence abstraction unit 130 lower in the text representation generation model 110, i.e., the number K2 of upper converter layers 140 is greater than the number K1 of lower converter layers 120, may improve the real-time performance of the text representation generation.

It should be appreciated that the process described above in connection with fig. 1 for representation generation based on embedded vector sequence abstraction is merely exemplary. Steps in the process for the generation of a representation based on an embedded vector sequence abstraction may be replaced or modified in any manner, and the process may include more or fewer steps, depending on the requirements of the actual application. For example, where the text representation generation model 110 can process raw text, the process of generating the initial block-level embedding vector sequence 104 can be omitted from the process 100, such that the text 102 is provided directly to the text representation generation model 110. Further, the particular order or hierarchy of steps in process 100 is merely exemplary, and the processes for representation generation based on embedded vector sequence abstraction may be performed in an order different from that described.

Fig. 2A-2C illustrate exemplary processes 200 a-200C, respectively, for performing embedded vector sequence abstraction, according to an embodiment of the present disclosure. For ease of description, processes 200a through 200c may each be performed with respect to a word block level embedded vector sequence 202. However, it should be understood that any of processes 200a through 200c may be selected for embedding vector sequence abstractions for each block level embedding vector sequence, depending on the actual application requirements. In process 200a, an embedding vector sequence abstraction may be performed by selecting a representative set of word-block-level embedding vectors from the word-block-level embedding vector sequence 202. In process 200b, an embedded vector sequence abstraction may be performed by generating a segment-level embedded vector corresponding to each text segment in the text based on the block-level embedded vector sequence 202. Process 200c may include the operations of both process 200a and process 200 b. In process 200c, an embedded vector sequence abstraction may be performed by first selecting a representative set of word block level embedding vectors from the sequence of word block level embedding vectors 202, and then generating a segment level embedding vector corresponding to each text segment in the text based on the representative set of word block level embedding vectors.

The word block level embedded vector sequence 202 may correspond to the word block level embedded vector sequence 122 in fig. 1. The sequence of word block level embedding vectors 202 may be an initial sequence of word block level embedding vectors based on text, generated by a self-attention mechanism. For example, the word block level embedding vector sequence 202 may be generated by K1 converter layers 120 based on the initial word block level embedding vector sequence 104 of the text 102. For convenience of description, it is assumed that the text includes two text segments, where a first text segment includes M word blocks and a second text segment includes N word blocks (N = M or N ≠ M). Accordingly, the sequence of word block-level embedding vectors 202 may include a word block-level embedding vector corresponding to each word block in the first text segment

Embedding vectors to the word block level

Where K1 indicates that the block-level embedding vector is obtained through K1 self-attention computations, i.e., through K1 operations of the converter layer. The sequence of word-block-level embedding vectors 202 may also include a word-block-level embedding vector corresponding to each word block in the second text segment

Embedding vectors to the word block level

Additionally, the sequence of block-level embedded vectors 202 may also include 1 classified embedded vector

And 2 separator embedding vectors respectively corresponding to the 2 text segments

Thus, the word block-level sequence of embedded vectors 202 may include (M + N + 3) embedded vectors.

In process 200a in fig. 2A, the word block level embedded vector sequence 202 may be subjected to embedded vector sequence abstraction by embedded vector sequence abstraction unit 210 to obtain a representative word block level embedded vector sequence 212. The embedded vector sequence abstraction unit 210 and the representative block-level embedded vector sequence 212 may correspond to the embedded vector sequence abstraction unit 130 and the abstract embedded vector sequence 132 of fig. 1, respectively.

The embedded vector sequence abstraction unit 210 may include a digest layer 220. A summarization operation may be performed by the summarization layer 220 to select a representative set of word block level embedding vectors from the sequence of word block level embedding vectors 202. The set of representative word block level embedding vectors may include a set of representative word block level embedding vectors corresponding to a set of significant representative word blocks in the input text that contain key semantic information of the input text. In one implementation, a representative set of word block level embedding vectors may be selected from the sequence of word block level embedding vectors 202 based on a self-attentive weight of each word block level embedding vector in the sequence of word block level embedding vectors 202. As an example, the selected representative set of word-block-level embedding vectors may include a word-block-level embedding vector corresponding to a word block in the first text segment

Word block level embedded vector

And word block level embedded vectors

And a block-level embedded vector corresponding to a block in the second text segment

Word block level embedded vector

And word block level embedded vectors

The selected representative set of word-block-level embedding vectors may be combined into a representative word-block-level embedding vector sequence 212.

Where the sequence of word block-level embedded vectors 202 further includes a predefined coded embedded vector, for example, where the sequence of word block-level embedded vectors 202 further includes a classified embedded vector

And separator embedded vector

In this case, the selected representative word block level may be embedded into a vector set and the vector may be embedded by classification

And separator embedded vector

Are combined into a representative block-level embedded vector sequence 212. Categorizing embedded vectors

May precede the selected representative word block level embedding vector set. In addition, a delimiter-embedded vector corresponding to each text segment

May be located after the last representative word block level embedding vector for the text segment such that the representative word block level embedding vectors corresponding to different text segments in the sequence of representative word block level embedding vectors 212 may be separated from each other.

Thus, the representative word block level embedded vector sequence 212 may be packagedIncluding, for example, a set of representative block-level embedded vectors and 1 class embedded vector

And 2 separator embedding vectors

The representative set of word-block-level embedding vectors typically includes a significantly smaller number of embedding vectors than the sequence of word-block-level embedding vectors. For example, in process 200a, a representative set of word block-level embedding vectors may include, for example, 6 embedding vectors. Thus, a representative block-level embedded vector sequence 212 may include, for example, 9 embedded vectors. As previously described, the sequence of word block-level embedded vectors 202 includes (M + N + 3) embedded vectors. When the number of words included in each text segment is large, i.e., M and N are large, the number of embedding vectors included in the representative word block level embedding vector sequence 212 will be much smaller than the number of embedding vectors included in the word block level embedding vector sequence 202. Thus, forming a representative word-block-level embedded vector sequence by selecting a representative set of word-block-level embedded vectors from the word-block-level embedded vector sequence may significantly reduce the number of embedded vectors.

In process 200a of fig. 2, a representative sequence of word block-level embedding vectors 212 includes a word block-level embedding vector corresponding to a word block in a first text segment and a word block-level embedding vector corresponding to a word block in a second text segment. That is, the representative word block level embedding vector sequence 212 includes a word block level embedding vector for each text segment in the input text. However, it should be understood that in some cases, the representative sequence of word block-level embedding vectors 212 may not include word block-level embedding vectors corresponding to word blocks in certain text segments in the input text. For example, some text segments may not include important, representative words. The representative word block level embedding vector sequence 212 may not include word block level embedding vectors corresponding to word blocks in such text segments. Additionally, it should be understood that in process 200a, for ease of description, the word block level embedded vector sequence 202 includes only word block level embedded vectors corresponding to two text segments, but embodiments of the present disclosure are not so limited and word block level embedded vector sequences including word block level embedded vectors corresponding to more than two text segments may be processed in a similar manner.

Furthermore, in connection with fig. 1, it can be seen that the embedded vector sequence abstraction unit 130 is located above the K1 converter layers 120. That is, the selection of the representative set of word block level embedding vectors performed by the summarization layer 220 is performed after the operations of the K1 translation layers 120 are performed. However, it should be understood that a representative set of word block level embedding vectors may also be selected prior to performing the translator layer operations. For example, a representative set of word block level embedding vectors may be selected from the initial word block level embedding vector sequence 104 by a variety of known techniques, such as text ranking (TextRank), semantic similarity-based clustering, term Frequency-Inverse Document Frequency (TF-IDF), and so on.

Turning to fig. 2B, in process 200B, the block-level embedded vector sequence 202 may be subjected to embedded vector sequence abstraction by embedded vector sequence abstraction unit 230 to obtain a segment-level embedded vector sequence 232. The embedding vector sequence abstraction unit 230 and the fragment-level embedding vector sequence 232 may correspond to the embedding vector sequence abstraction unit 130 and the abstract embedding vector sequence 132 of fig. 1, respectively.

The sequence of word block level embedding vectors 202 may include a subsequence of word block level embedding vectors corresponding to respective text segments in the text. For example, the sequence of word block level embedded vectors 202 may include a sequence of word block level embedded vectors corresponding to the first text segment into a quantum sequence, i.e., a word block level embedded vector

Embedding vectors to the word block level

And a block level embedding direction corresponding to the second text segmentQuantum sequences, i.e. word-block-level embedded vectors

Embedding vectors to the word block level

For each text segment, a set of segment-level embedding vectors corresponding to the text segment may be generated based on the block-level embedding vector corresponding to the text segment in the block-level embedding vector sequence 202. For example, for a first text segment, vectors may be embedded based on the word block level

Embedding vectors to the word block level

A set of segment-level embedding vectors corresponding to the first text segment is generated. Similarly, for the second text segment, vectors may be embedded based on the word block level

Embedding vectors to the word block level

A set of segment-level embedding vectors corresponding to the second text segment is generated.

A set of pooling operations may be performed on each of the block-level embedded vector subsequences by a pooling (Pooling) layer 240 to generate a corresponding set of fragment-level embedded vectors. For example, at 242, a vector may be embedded into the quantum sequence at the word block level, i.e., the word block level, corresponding to the first text segment

Embedding vectors to the word block level

Performing max pooling (ma)x posing) operation to obtain a segment-level embedding vector corresponding to the first text segment

At 244, an average pooling operation may be performed on the subsequence of block-level embedded vectors corresponding to the first segment of text to obtain a segment-level embedded vector corresponding to the first segment of text

Segment-level embedded vectors

And segment level embedded vectors

A set of fragment-level embedding vectors corresponding to the first fragment may be composed. Similarly, at 246, a quantum sequence may be embedded at the word block level corresponding to the second text segment, i.e., a word block level embedding vector

Embedding vectors to the word block level

Performing a max-pooling operation to obtain a segment-level embedded vector corresponding to the second text segment

At 248, an average pooling operation may be performed on the subsequence of block-level embedded vectors corresponding to the second text segment to obtain a segment-level embedded vector corresponding to the second text segment

Segment-level embedded vectors

Hexipian (Chinese character of 'He')Segment level embedded vector

A set of segment-level embedding vectors corresponding to the second segment may be composed. A set of segment-level embedding vectors corresponding to the first segment and a set of segment-level embedding vectors corresponding to the second segment may be combined into a sequence of segment-level embedding vectors 232.

In case the sequence of word-block-level embedded vectors 202 further comprises predefined coded embedded vectors, for example in case the sequence of word-block-level embedded vectors 202 further comprises classified embedded vectors

And separator embedded vector

In the case of (2), segment-level embedding vectors can be set and classification embedding vectors can be set

And separator embedded vector

Are combined into a representative block-level embedded vector sequence 232. Categorizing embedded vectors

May precede the set of segment-level embedded vectors. In addition, a delimiter-embedded vector corresponding to each text segment

May be located after the last segment-level embedding vector of the text segment such that segment-level embedding vectors corresponding to different text segments in the sequence of segment-level embedding vectors 232 may be spaced apart from each other.

Thus, segment-level embedded vector sequence 232 may include, for example, 2 segment levels corresponding to a first segmentAn embedding vector, 2 segment-level embedding vectors corresponding to the second segment, and 1 classified embedding vector

And 2 separator embedded vectors

That is, where the input text includes 2 text segments, the segment-level embedded vector sequence 232 may include, for example, 7 embedded vectors. As previously described, the sequence of word block-level embedded vectors 202 may include (M + N + 3) embedded vectors. When the number of words included in each text segment is large, i.e., M and N are large, the number of embedding vectors included in the segment-level embedding vector sequence 232 is much smaller than the number of embedding vectors included in the block-level embedding vector sequence 202. Thus, the number of embedded vectors can be significantly reduced by generating segment-level embedded vectors.

It should be appreciated that the process described above in connection with fig. 2B for performing a pooling operation on a subsequence of block-level embedded vectors to generate fragment-level embedded vectors is merely exemplary. For example, in process 200b, for each text segment, both maximum pooling and average pooling are employed to generate a set of segment-level embedding vectors for the text segment. However, depending on the actual application requirements, only one of maximum pooling and average pooling may be employed to generate the segment-level embedding vectors for the individual text segments. In addition, various other pooling operations, such as attention pooling (attention pooling), may also be employed to generate fragment-level embedding vectors, depending on the actual application requirements. In addition, the fragment-level embedding vector may be generated by other types of operations besides pooling operations. Further, it should be understood that in process 200b, for ease of description, word block level embedded vector sequence 202 includes only word block level embedded vectors corresponding to two text segments, but embodiments of the present disclosure are not so limited and word block level embedded vector sequences including word block level embedded vectors corresponding to more than two text segments may be processed in a similar manner. When the input text includes more text segments and the sequence of segment-level embedded vectors still includes a significant number of embedded vectors, further pooling may be performed on the sequence of segment-level embedded vectors to further reduce the number of embedded vectors. For example, a pooling operation may be performed on segment-level embedding vectors for a number of segments to obtain embedding vectors corresponding to the number of segments.

Turning to fig. 2C, process 200C may include the operations of both process 200a and process 200 b. In process 200c, the block-level embedded vector sequence 202 may be subjected to an embedded vector sequence abstraction by an embedded vector sequence abstraction unit 250 to obtain a fragment-level embedded vector sequence 252. The embedded vector sequence abstraction unit 250 and the segment-level embedded vector sequence 252 may correspond to the embedded vector sequence abstraction unit 130 and the abstract embedded vector sequence 132, respectively, of fig. 1.

The embedded vector sequence abstraction unit 250 may include a digest layer 260. A summarization operation may be performed by the summarization layer 260 to select a representative set of word block level embedding vectors from the sequence of word block level embedding vectors 202. Summarization layer 260 may perform similar operations as summarization layer 220 in fig. 2A. As an example, the selected representative set of word-block-level embedding vectors may include a word-block-level embedding vector corresponding to a word block in the first text segment

Word block level embedded vector

And word block level embedded vectors

Word block level embedded vector

And word block level embedded vectors

The selected representative set of word-block-level embedding vectors may be combined into a representative word-block-level embedding vector sequence 262.

And separator embedded vector

And separator embedded vector

The combination into a representative block-level embedded vector sequence 262. Categorizing embedded vectors

May be located after the last representative block-level embedding vector of the text segment.

An embedding vector sequence abstraction may be further performed on the representative block-level embedding vector sequence 262 to further reduce the number of embedding vectors. The set of representative block-level embedding vectors may include at least one representative block-level embedding vector corresponding to at least one text segment in the input textA subset of quantities. For example, a representative set of word block-level embedding vectors in representative word block-level embedding vector sequence 262 may include a representative subset of word block-level embedding vectors corresponding to a first text segment, i.e., word block-level embedding vectors

Word block level embedded vector

And word block level embedded vectors

And a representative subset of word-block-level embedded vectors, i.e., word-block-level embedded vectors, corresponding to the second text segment

Word block level embedded vector

And word block level embedded vectors

For each text segment, a set of segment-level embedding vectors corresponding to the text segment may be generated based on the representative subset of block-level embedding vectors corresponding to the text segment. For example, for a first text segment, vectors may be embedded based on the word block level

Word block level embedded vector

And word block level embedded vectors

A set of segment-level embedding vectors corresponding to the first text segment is generated. Similarly, theFor the second text segment, vectors may be embedded based on the word block level

Word block level embedded vector

And word block level embedded vectors

A set of pooling operations may be performed on each representative subset of block-level embedding vectors by a pooling layer 270 embedded in the vector sequence abstraction unit 250 to generate a corresponding set of fragment-level embedding vectors. Pooling layer 270 may perform similar operations as pooling layer 240 in FIG. 2B. For example, at 272, a max pooling operation may be performed on the representative subset of block-level embedding vectors corresponding to the first text segment to obtain a segment-level embedding vector corresponding to the first text segment

At 274, an average pooling operation may be performed on the representative subset of block-level embedded vectors corresponding to the first text segment to obtain a segment-level embedded vector corresponding to the first text segment

Segment level embedded vector

And segment level embedded vector

A set of fragment-level embedding vectors corresponding to the first fragment may be composed. Similarly, at 276, a max pooling operation may be performed on the representative word block level embedded vector subset corresponding to the second text segmentTo obtain a segment-level embedded vector corresponding to the second text segment

At 278, an average pooling operation may be performed on the representative subset of block-level embedding vectors corresponding to the second text segment to obtain a segment-level embedding vector corresponding to the second text segment

Segment-level embedded vectors

And segment level embedded vector

A set of segment-level embedding vectors corresponding to the second segment may be composed. A set of segment-level embedding vectors corresponding to a first segment and a set of segment-level embedding vectors corresponding to a second segment may be combined into a sequence of segment-level embedding vectors 252. Categorizing embedded vectors

May be located at the beginning of the segment-level embedded vector sequence 252. In addition, a delimiter-embedded vector corresponding to each text segment

May be located after the last segment level embedding vector of the text segment.

In process 200c, the number of embedded vectors may be reduced by first selecting a representative set of word-block-level embedded vectors from the sequence of word-block-level embedded vectors 202. A segment-level embedding vector corresponding to each text segment in the text may then be generated based on the set of representative word-block-level embedding vectors to further reduce the number of embedding vectors. The number of embedded vectors can be greatly reduced by performing the two types of embedded vector sequence abstraction operations in combination, and key and main information can be retained.

Referring back to FIG. 1, the K converter layers in the textual representation generative model 110 may be divided into two groups, namely, a lower K1 converter layer 120 and an upper K2 converter layer 140. An abstraction of the sequence of embedded vectors can be made between the two sets of converter layers to reduce the number of embedded vectors. According to embodiments of the present disclosure, self-attention calculations may be performed between a digest layer and a pooling layer when embedding vector sequence abstractions through both the digest layer and the pooling layer. For example, in an embedded vector sequence abstraction unit, there may be a set of converter layers between the digest layer and the pooling layer. Fig. 3 illustrates another exemplary process 300 for representation generation based on embedded vector sequence abstraction in accordance with an embodiment of the disclosure. In process 300, text 302 and an initial block-level embedded vector sequence 304 of text 302 may be obtained. A textual representation 312 of the text 302 may be generated based on the initial block-level embedding vector sequence 304 by a textual representation generation model 310.

The text representation generative model 310 may have a structure similar to the text representation generative model 110 of FIG. 1. The difference is that in the textual representation generative model 310, the converter layers are divided into three groups. In addition to having a set of converter layers below and above the embedded vector sequence abstraction unit 330, respectively, the embedded vector sequence abstraction unit 330 may also have a set of converter layers inside. Assuming that the text representation generative model 310 has K converter layers, there may be K3 converter layers 320 below the embedded vector sequence abstraction unit 330, K4 converter layers 334 inside the embedded vector sequence abstraction unit 330, and K5 converter layers 340 above the embedded vector sequence abstraction unit 330, where K3+ K4+ K5= K. The K4 converter layers 334 embedded inside the vector sequence abstraction unit 330 may be located between the digest layer and the pooling layer.

A word block level embedded vector sequence 322 may be generated by K3 translator layers 320 based on the initial word block level embedded vector sequence 304. The block-level embedded vector sequence 322 may be subjected to an embedded vector sequence abstraction by an embedded vector sequence abstraction unit 330 to obtain an abstract embedded vector sequence 332 of the text 302. The exemplary process of performing the abstraction of the embedded vector sequence will be described later in conjunction with fig. 4. Next, a target embedding vector sequence 342 may be generated by the K5 converter layers 340 through a self-attention mechanism based on the abstract embedding vector sequence 332. At least one embedding vector may be selected from the target sequence of embedding vectors 342 as the text representation 312 of the text 302.

Fig. 4 illustrates another exemplary process 400 for performing embedded vector sequence abstraction according to an embodiment of the disclosure. Process 400 may correspond to the process of embedding vector sequence abstraction by embedding vector sequence abstraction unit 330 in fig. 3.

Process 400 may be performed for a word block level embedded vector sequence 402. Word block level embedding vector sequence 402 may correspond to word block level embedding vector sequence 322 in fig. 3. Word block level embedding vector sequence 402 may be a text-based initial word block level embedding vector sequence generated by a self-attention mechanism. For example, the word block level embedded vector sequence 402 may be generated by the K3 converter layers 320 based on the initial word block level embedded vector sequence 304 of the text 302. For convenience of description, it is assumed that the text includes two text segments, where a first text segment includes M word blocks and a second text segment includes N word blocks. Word block level embedding vector sequence 402 may include a word block level embedding vector corresponding to each word block in a first text segment

Embedding vectors to the word block level

Where K3 indicates that the block-level embedding vector is obtained through K3 self-attention calculations, i.e., through operations of K3 layers of the converter. Word block level embedded vector sequence 402 may also include a word block level embedded vector corresponding to each word block in the second text segment

Embedding vectors to the word block level

In addition, the sequence of word block level embedded vectors 402 may also include 1 classified embedded vector

And 2 separator embedded vectors respectively corresponding to the 2 text segments

Thus, the word block level embedded vector sequence 402 may include (M + N + 3) embedded vectors.

In process 400, the block-level embedded vector sequence 402 may be subjected to an embedded vector sequence abstraction by an embedded vector sequence abstraction unit 410 to obtain a segment-level embedded vector sequence 412. The embedding vector sequence abstraction unit 410 and the fragment-level embedding vector sequence 412 may correspond to the embedding vector sequence abstraction unit 330 and the abstract embedding vector sequence 332, respectively, in fig. 3.

Process 400 may be similar to process 200C in fig. 2C. The difference is that in process 400, a set of self-attention calculations may be performed by a set of translator layers, between performing a summarization operation by a summarization layer and performing a pooling operation by a pooling layer.

The embedded vector sequence abstraction unit 410 may include a digest layer 420. A summarization operation may be performed by summarization layer 420 to select a representative set of word block level embedding vectors from the sequence of word block level embedding vectors 402. Summarization layer 420 may perform similar operations as summarization layer 220 in fig. 2A. As an example, the selected representative set of word-block-level embedding vectors may include a word-block-level embedding vector corresponding to a word block in the first text segment

Word block level embedded vector

And word block level embedded vectors

And a word-block-level embedded vector corresponding to a word block in the second text segment

Word block level embedded vector

And word block level embedded vectors

The selected representative set of word-block-level embedding vectors may be combined into a representative word-block-level embedding vector sequence 422.

Where the sequence of word block-level embedded vectors 402 further includes a predefined coded embedded vector, for example, embedding the sequence of vectors 402 at the word block level further includes classifying the embedded vectors

And separator embedded vector

In the case of (2), the embedded vector is classified

A delimiter embedded vector corresponding to each text segment, which may precede the selected set of representative block-level embedded vectors

A set of self-attention calculations may be performed on the representative block-level embedded vector sequence 422 by the K4 translator layers 430 to generate a representative block-level embedded vector sequence 432. The representative word block level embedded vector sequence 432 may include a scoreClass-embedded vector

2 separator embedded vectors

A representative subset of word-block-level embedded vectors, i.e., word-block-level embedded vectors, corresponding to the first text segment

Word block level embedded vector

And word block level embedded vectors

Word block level embedded vector

And word block level embedded vectors

The embedding vector sequence abstraction may be further performed on the representative block-level embedding vector sequence 432 to further reduce the number of embedding vectors. A set of pooling operations may be performed on each representative subset of block-level embedding vectors by a pooling layer 440 embedded in the vector sequence abstraction unit 410 to generate a corresponding set of fragment-level embedding vectors. Pooling layer 440 may perform similar operations as pooling layer 240 in FIG. 2B. For example, at 442, a max pooling operation may be performed on the representative subset of block-level embedding vectors corresponding to the first text segment to obtain a segment-level embedding vector corresponding to the first text segment

At 444, an average pooling operation may be performed on the representative subset of block-level embedding vectors corresponding to the first text segment to obtain a segment-level embedding vector corresponding to the first text segment

Segment-level embedded vectors

And segment level embedded vectors

A set of fragment-level embedding vectors corresponding to the first fragment may be composed. Similarly, at 446, a maximum pooling operation may be performed on the representative word block-level embedded vector subset corresponding to the second text segment to obtain a segment-level embedded vector corresponding to the second text segment

At 448, an average pooling operation may be performed on the representative subset of block-level embedding vectors corresponding to the second text segment to obtain a segment-level embedding vector corresponding to the second text segment

Segment-level embedded vectors

And segment level embedded vectors

A set of segment-level embedding vectors corresponding to the second segment may be composed. A set of segment-level embedding vectors corresponding to the first segment and a set of segment-level embedding vectors corresponding to the second segment may be combined into a sequence of segment-level embedding vectors 412. Classified embedding(Vector)

May be located at the beginning of the segment-level embedded vector sequence 412. In addition, a delimiter-embedded vector corresponding to each text segment

May be located after the last segment-level embedding vector for the text segment.

In process 400, after a representative word block level embedding vector sequence 422 is obtained by the summarization layer 420, a set of self-attention calculations may be performed on the representative word block level embedding vector sequence 422 by the K4 converter layers 430 to generate a representative word block level embedding vector sequence 432. The representative block-level embedding vector sequence 432 may be further used to perform pooling operations. By performing the self-attention calculation, the respective representative word-block-level embedded vectors can learn the self-attention information bi-directionally from each other, and thus richer context information can be known, thereby obtaining more accurate representative word-block-level embedded vectors. Pooling such representative block-level embedding vectors may result in more accurate fragment-level embedding vectors.

Currently, models based on the structure of the converter layer, such as the BERT model, can only handle text features. For example, current BERT models can only process embedded vector sequences generated based on input text. However, for some NLP tasks, there are also some non-textual features. The non-textual features may include, for example, a count feature related to the count information, an identifier feature related to the identifier information, and the like. Taking the CTR pre-evaluation task as an example, the counting features may include, for example, features related to a user's previous CTR, features related to a user's previous search frequency, features related to a number of previous clicks of the content, and so forth; the identifier features may include, for example, features related to category identifiers of content, features related to category identifiers of queries, and so forth. Embodiments of the present disclosure propose representation generation for multimodal input. A composite representation of both textual features and non-textual features may be generated by a composite representation generation model according to embodiments of the present disclosure. The generated composite representation may be further provided to a task output layer corresponding to the particular NLP task to obtain a task result corresponding to the NLP task. Since the non-text features may contain rich information related to the NLP task, more accurate task results may be obtained when considering the non-text features when performing the NLP task.

FIG. 5 illustrates an exemplary process 500 for representation generation for multimodal input in accordance with an embodiment of the disclosure. In process 500, a multimodal input including both textual features and non-textual features may be provided to a comprehensive representation generation model 510. For example, an initial word block level embedding vector sequence 504 corresponding to text 502 and an initial non-text embedding vector 508 corresponding to non-text features 506-1 through 506-J may be provided to the integrated representation generation model 510, where J ≧ 1 represents the number of non-text features. Comprehensive representation generation model 510 may utilize a set of converter layers that it includes to deep learn interactions between textual features and non-textual features and generate a comprehensive representation 512 of the textual features and non-textual features.

Text 502 and an initial word block level embedded vector sequence 504 corresponding to the text 502 may be obtained. Since the initial word block level embedded vector sequence 504 corresponds to text 502, the embedded vectors in the initial word block level embedded vector sequence 504 may also be referred to as text embedded vectors. Non-textual features 506-1 through 506-J may include features other than textual features, such as counting features, identifier features, and the like. Non-textual features 506-1 through 506-J may be obtained by processing non-textual information. The non-textual information may include, for example, a number indicating a quantity, frequency, number, and the like. The processing performed on the non-text information may include normalization processing such as min-max normalization (min-max normalization) processing, Z-score normalization (Z-score normalization) processing, or the like. An initial non-text embedding vector 508 may be generated based on non-text features 506-1 through non-text features 506-J. For example, non-textual features 506-1 through 506-J may be combined into an initial non-textual embedding vector 508 through a full join operation. The initial non-text embedding vector 508 may have the same hidden layer size as the hidden layer size (hidden size) of the text embedding vector in the initial word block level embedding vector sequence 504.

The initial word block level embedded vector sequence 504 and the initial non-text embedded vector 508 may be provided to the integrated representation generation model 510 simultaneously. The integrated representation generative model 510 may include K number of converter layers 520.

A sequence of target embedded vectors 522 corresponding to the text 502 and target multimodal embedded vectors 524 corresponding to the non-text features 506-1 through non-text features 506-J may be generated by the K converter layers 520. Since the initial non-text embedding vector is a stronger embedding vector generated based on a set of non-text features, the initial non-text embedding vector may not be considered in generating the sequence of target embedding vectors to avoid text information from being attenuated by the strong information included with the initial non-text embedding vector. Thus, only the initial word block level embedding vector sequence 504 may be considered in generating the target embedding vector sequence 522. While for non-textual features, a target multimodal embedding vector 524 may be generated based on the text 502 and the initial non-textual embedding vector 508 so that textual information and non-textual information may be fused. For example, the target multimodal embedding vector 524 may be generated based on both the initial word block level embedding vector sequence 504 and the initial non-text embedding vector 508. Since initial non-text embedded vector 508 has the same hidden layer size as the hidden layer size of the text embedded vectors in initial word block level embedded vector sequence 504, a self-attention calculation can be made between initial non-text embedded vector 508 and the text embedded vectors.

For example, a target embedded vector sequence 522 may be generated by iteratively performing a self-attention calculation through each of the K converter layers 520 based on the initial word block level embedded vector sequence 504. A target multimodal embedding vector 524 may be generated by iteratively performing a self-attention calculation through each of K converter layers 520 based on the initial word block level embedding vector sequence 504 and the initial non-text embedding vector 508.

Fig. 6 illustrates an exemplary process 600 for performing self-attention calculations in accordance with an embodiment of the disclosure. In process 600, the previously embedded vector sequence 602 may be a text embedded vector sequence corresponding to text that is provided to the kth (1 ≦ K ≦ K) converter layer 610, i.e., converter layer K. The sequence of previously embedded vectors 602 may include L (L ≧ 1) text embedded vectors, e.g., text embedded vectors

Embedding vectors into text

Including text embedded vectors

Embedding vectors into text

I.e., k =1, the previously embedded vector sequence 602 may correspond to the initial word block level embedded vector sequence 504 in fig. 5. The current embedded vector sequence 612 may be a text embedded vector sequence corresponding to text that is output by the converter layer k. Including text embedded vectors

Embedding vectors into text

I.e., K = K, the current embedded vector sequence 612 may correspond to the target embedded vector sequence 522 in fig. 5.

Previous multi-modal embedded vector 604

Can be a multimodal embedded vector provided to the converter layer k. Previous multi-modal embedded vector

I.e., k =1, may correspond to the initial non-text embedding vector 508 in fig. 5. Current multimodal embedded vector 614

Can be a multimodal embedded vector output by the converter layer k. Current multimodal embedded vector

I.e., K = K, may correspond to the target multimodal embedding vector 524 in fig. 5.

Converter layer k may include a set of converters, such as converter 610-1 to converter 610-L +1.

Each of converters 610-1 through 610-L may correspond to a previous text embedding vector in previous embedding vector sequence 602. For example, converter 610-1 may embed a vector with previous text

Correspondingly, converter 610-2 may embed vectors with previous text

Correspondingly, and so on. Converter 610-1 through converter 610-L may generate a current sequence of embedded vectors 612 based on a previous sequence of embedded vectors 602. The respective text-embedded vectors can be learned from attention information bi-directionally from each other. Since the multi-modal embedded vector includes stronger non-textual information, the multi-modal embedded vector may not be considered when generating the current textual embedded vector to avoid text information being degraded by the stronger non-textual information included in the multi-modal embedded vector. Accordingly, the text embedding vector will not learn from the multimodal embedding vector from attention information. In particular, the converter corresponding to each previous text embedding vector may generate a current text embedding vector for the previous text embedding vector based on some or all of the previous text embedding vectors in the sequence of previous embedding vectors 602, as indicated by the dashed arrows in fig. 6. E.g. with previous textInput vector

The corresponding converter 610-1 may generate the previous text embedding vector based on some or all of the previous text embedding vectors in the sequence of previous embedding vectors 602

Current text embedding vector

Current text embedded vector

Embedding vectors into current text

May be combined into a current embedded vector sequence 612.

Converter 610-L +1 can embed vectors with previous multimodalities

And correspondingly. The multimodal embedding vector can be uni-directionally learned from attention information from the text embedding vector. For example, converter 610-L +1 may be based on some or all of the previous text embedding vectors in the sequence of previous embedding vectors 602 and the previous multi-modal embedding vectors

Generating a current multi-modal embedded vector 614

As indicated by the solid arrows in fig. 6.

Referring back to fig. 5, a target embedded vector sequence 522 may be generated by iteratively performing a self-attention calculation as shown in fig. 6 based on the initial word block level embedded vector sequence 504. The target multi-modal embedding vector 524 may be generated by iteratively performing a self-attention calculation as shown in fig. 6 based on the initial word block level embedding vector sequence 504 and the initial non-text embedding vector 508.

After target embedded vector sequence 522 is generated, at least one embedded vector may be selected from target embedded vector sequence 522 as text representation 526 of text 502. Taking the CTR pre-estimation task as an example, a classification embedded vector may be selected from the target embedded vector sequence 522 as the text representation 526.

The textual representation 526 and the target multi-modal embedding vector 524 may be combined into the integrated representation 512 by a concatenation unit 530. The composite representation 512 may be provided to a task output layer (not shown) for a particular NLP task to obtain task results corresponding to the NLP task.

In process 500, the integrated representation generation model 510 may utilize a set of converter layers that it includes to deep learn interactions between textual features and non-textual features, and thus may generate an accurate integrated representation that fuses textual information and non-textual information. Performing NLP tasks using such a composite representation helps to obtain more accurate task results.

It should be appreciated that the process for representation generation for multimodal input described above in connection with FIG. 5 is merely exemplary. The steps in the process for representation generation for multimodal input may be replaced or modified in any manner, and the process may include more or fewer steps, depending on the actual application requirements. For example, in process 500, all non-text features are combined into a single initial non-text embedding vector. However, it is also possible to combine non-textual features into multiple initial non-textual embedded vectors, depending on the requirements of the actual application. Further, the particular order or hierarchy of steps in process 500 is merely exemplary, and processes for representation generation for multimodal input may be performed in an order different from that described.

FIG. 7 illustrates an example process 700 for embedded vector sequence abstraction-based representation generation for multimodal input in accordance with an embodiment of the disclosure. Process 700 may be similar to process 500 in fig. 5. The difference is that in process 700, the number of text embedding vectors can be reduced by embedding the vector sequence abstraction. In process 700, multimodal input including both textual features and non-textual features can be provided to a comprehensive representation generation model 710. For example, an initial block-level embedding vector sequence 704 corresponding to text 702 and an initial non-text embedding vector 708 corresponding to non-text features 706-1 through 706-J may be provided to an integrated representation generation model 710. Comprehensive representation generation model 710 may utilize a set of converter layers that it includes to deep learn interactions between textual features and non-textual features and generate a comprehensive representation 712 of the textual features and non-textual features.

Text 702 and an initial word block level embedded vector sequence 704 corresponding to the text 702 may be obtained. Non-textual features 706-1 through 706-J may include features other than textual features, such as counting features, identifier features, and the like. Initial non-text embedding vector 708 may be generated based on non-text features 706-1 through 706-J. For example, non-textual feature 706-1 through non-textual feature 706-J may be combined into initial non-textual embedded vector 708 by a full join operation. The initial non-text embedding vector 708 may have a hidden layer size that is the same as the hidden layer size of the text embedding vector in the initial word block level embedding vector sequence 704.

The initial block-level embedded vector sequence 704 and the initial non-text embedded vector 708 may be provided to the integrated representation generation model 710 simultaneously. The integrated representation generative model 710 may include K number of converter layers. Similar to the textual representation generative model 110 in FIG. 1, the K layers of converters may include a lower K1 layer of converters 720 and an upper K2 layer of converters 740.

The integrated representation generation model 710 may generate a target sequence of embedding vectors 742 corresponding to the text 702 based on the initial sequence of word block-level embedding vectors 704. In addition, the integrated representation generation model 710 can generate a target multi-modal embedding vector 744 based on the text 702 and the initial non-text embedding vector 708 so that the textual information and the non-textual information can be fused. In the integrated text representation generation model 710, first, a word block level embedding vector sequence 722 may be generated based on the initial word block level embedding vector sequence 704. In addition, an intermediate multimodal embedding vector 724 may be generated based on the initial block-level embedding vector sequence 704 and the initial non-text embedding vector 708. For example, a word block level embedded vector sequence 722 may be generated by each of the K1 converter layers 720 by iteratively performing a self-attention calculation based on the initial word block level embedded vector sequence 704. This self-attention calculation may be performed by, for example, process 600 shown in fig. 6. An intermediate multimodal embedding vector 724 may be generated by iteratively performing a self-attention calculation through each of the K1 converter layers 720 based on the initial block-level embedding vector sequence 704 and the initial non-text embedding vector 708. This self-attention calculation may be performed by, for example, process 600 shown in fig. 6.

Subsequently, by embedding vector sequence abstraction unit 730, the block-level embedding vector sequence 722 may be subjected to an embedding vector sequence abstraction to obtain an abstract embedding vector sequence 732 for text 702. The embedding vector sequence abstraction can be made in a number of ways. For example, the embedded vector sequence abstraction can be performed by the process shown in fig. 2A-2C. Next, a target embedded vector sequence 742 may be generated by iteratively performing a self-attention calculation based on the abstract embedded vector sequence 732 by each of the K2 converter layers 740. This self-attention calculation may be performed by, for example, process 600 shown in fig. 6. Similarly, target multi-modal embedded vector 744 may be generated by iteratively performing a self-attention computation through each of K2 converter layers 740 based on abstract embedded vector sequence 732 and intermediate multi-modal embedded vectors 724. This self-attention calculation may be performed by, for example, process 600 shown in fig. 6.

At least one embedding vector may be selected from the target embedding vector sequence 742 as a text representation 746 of the text 702. The text representation 746 and the target multimodal embedding vector 744 may be combined into the composite representation 712 by the concatenation unit 750. The composite representation 712 may be provided to a task output layer (not shown) for a particular NLP task to obtain task results corresponding to the NLP task.

It should be appreciated that the process for embedded vector sequence abstraction-based representation generation for multimodal input described above in connection with FIG. 7 is merely exemplary. The steps in the process for embedded vector sequence abstraction based representation generation for multimodal inputs may be replaced or modified in any manner, and the process may include more or fewer steps, depending on the actual application requirements. For example, in process 700, all non-text features are combined into a single initial non-text embedding vector. However, it is also possible to combine non-textual features into multiple initial non-textual embedded vectors, depending on the requirements of the actual application. Where the non-text features are combined into a plurality of initial non-text embedding vectors, the plurality of initial non-text embedding vectors may be subjected to an embedding vector sequence abstraction, such as performing a summarization operation and/or a pooling operation. Further, the particular order or hierarchy of steps in process 700 is merely exemplary, and the process for embedded vector sequence abstraction-based representation generation for multimodal inputs may be performed in an order different from that described.

Further, it should be appreciated that in process 700, the K converter layers in the integrated representation generative model 710 may be divided into two groups, namely, a lower K1 converter layer 720 and an upper K2 converter layer 740. An abstraction of the sequence of embedded vectors can be made between the two sets of converter layers to reduce the number of embedded vectors. However, the converter layers may be divided into three groups according to the requirements of practical application. For example, in addition to having a set of converter layers below and above the embedded vector sequence abstraction unit 730, respectively, the embedded vector sequence abstraction unit 730 may also have a set of converter layers inside it, as shown in fig. 3. In this case, a set of self-attention calculations may be performed between the digest layer and the pooling layer when the embedded vector sequence abstraction is made by both the digest layer and the pooling layer.

Fig. 8 is a flow diagram of an example method 800 for representation generation based on embedded vector sequence abstraction in accordance with an embodiment of the present disclosure.

At 810, text can be obtained.

At 820, a block-level embedded vector sequence of the text may be generated.

At 830, an embedding vector sequence abstraction may be performed on the block-level embedding vector sequence to obtain an abstract embedding vector sequence for the text.

At 840, a textual representation of the text may be generated based on the abstract embedded vector sequence.

In one embodiment, the generating the sequence of word-block-level embedded vectors may include: generating an initial sequence of word-block-level embedding vectors for the text based on at least one of a set of word-block embedding vectors, a set of segment embedding vectors, and a set of position embedding vectors corresponding to a set of word blocks in the text; and generating the sequence of word block level embedded vectors by a self-attention mechanism based on the initial sequence of word block level embedded vectors.

In one embodiment, the number of embedding vectors of the sequence of abstract embedding vectors may be less than the number of embedding vectors of the sequence of block-level embedding vectors.

In one embodiment, the performing embedded vector sequence abstraction may include: selecting a representative word block level embedding vector set from the word block level embedding vector sequence; and combining the set of representative word block level embedding vectors into the sequence of abstract embedding vectors.

The sequence of block-level embedded vectors may further include at least one predefined coded embedded vector. The combination may include: combining the set of representative word-block-level embedding vectors and the at least one predefined coded embedding vector into the sequence of abstract embedding vectors.

In one embodiment, the text may include at least one text segment. The performing the embedded vector sequence abstraction may include: for each text segment, generating a set of segment-level embedding vectors corresponding to the text segment based on a word block-level embedding vector sequence corresponding to the text segment in the word block-level embedding vector sequence; and combining at least one set of segment-level embedding vectors corresponding to the at least one text segment into the abstract embedding vector sequence.

The sequence of block-level embedded vectors may further include at least one predefined coded embedded vector. The combination may include: combining the at least one set of segment-level embedding vectors and the at least one predefined encoding embedding vector into the sequence of abstract embedding vectors.

In one embodiment, the performing the embedded vector sequence abstraction may include: selecting a set of representative word block level embedding vectors from the sequence of word block level embedding vectors, the set of representative word block level embedding vectors comprising at least one representative subset of word block level embedding vectors corresponding to at least one text segment in the text; for each text segment, generating a set of segment-level embedding vectors corresponding to the text segment based on a representative subset of word-block-level embedding vectors corresponding to the text segment; and combining at least one set of segment-level embedding vectors corresponding to the at least one text segment into the abstract embedding vector sequence.

The selecting a representative set of word-block-level embedding vectors may include: selecting the representative set of word block-level embedding vectors from the sequence of word block-level embedding vectors based on a self-attentive weight of each word block-level embedding vector in the sequence of word block-level embedding vectors.

The generating the set of segment-level embedded vectors may include: generating the set of segment-level embedding vectors by performing a set of pooling operations on the subsequence of block-level embedding vectors or the representative subset of block-level embedding vectors.

In one embodiment, the generating the text representation may include: generating a target embedding vector sequence by a self-attention mechanism based on the abstract embedding vector sequence; and selecting at least one embedding vector from the target sequence of embedding vectors as the text representation.

In one embodiment, the method 800 may further include: task results corresponding to the NLP task are obtained based on the textual representation.

In one embodiment, the method 800 may further include: obtaining at least one non-textual feature; generating an initial non-text embedding vector based on the at least one non-text feature; and generating a target multi-modal embedding vector based on the text and the initial non-text embedding vector.

The generating an initial non-text embedding vector may include: combining the at least one non-text feature into the initial non-text embedded vector by a full join operation.

The generating the target multimodal embedding vector may include: generating an intermediate multi-modal embedded vector based on the initial block-level embedded vector sequence of the text and the initial non-text embedded vector; and generating the target multi-modal embedding vector based on the sequence of abstract embedding vectors and the intermediate multi-modal embedding vector.

The generating the intermediate multi-modal embedding vector may include: generating the intermediate multi-modal embedding vector by iteratively performing a self-attentive computation based on the initial sequence of word-block-level embedding vectors and the initial non-text embedding vector. The generating the target multimodal embedding vector may include: generating the target multi-modal embedding vector by iteratively performing a self-attention calculation based on the sequence of abstract embedding vectors and the intermediate multi-modal embedding vector.

The self-attention calculation may include: a current multi-modal embedding vector is generated based on the sequence of previous embedding vectors and the previous multi-modal embedding vector.

The method 800 may further include: obtaining a task result corresponding to an NLP task based on the text representation and the target multi-modal embedding vector.

It should be understood that method 800 may also include any steps/processes for embedded vector sequence abstract based representation generation according to embodiments of the present disclosure described above.

Fig. 9 illustrates an example apparatus 900 for representation generation based on embedded vector sequence abstraction in accordance with an embodiment of this disclosure.

The apparatus 900 may include: a text obtaining module 910, configured to obtain a text; an embedded vector sequence generating module 920, configured to generate a block-level embedded vector sequence of the text; an embedded vector sequence abstraction module 930, configured to perform embedded vector sequence abstraction on the block-level embedded vector sequence to obtain an abstract embedded vector sequence of the text; and a text representation generation module 940 for generating a text representation of the text based on the abstract embedded vector sequence. Furthermore, apparatus 900 may also include any other module configured for embedded vector sequence abstract based representation generation according to embodiments of the present disclosure described above.

Fig. 10 illustrates an example apparatus 1000 for representation generation based on embedding vector sequence abstractions in accordance with an embodiment of this disclosure.

The apparatus 1000 may include: at least one processor 1010; and a memory 1020 storing computer-executable instructions. The computer-executable instructions, when executed, may cause the at least one processor 1010 to: the method includes obtaining text, generating a sequence of block-level embedded vectors for the text, performing an embedded vector sequence abstraction on the sequence of block-level embedded vectors to obtain a sequence of abstract embedded vectors for the text, and generating a text representation for the text based on the sequence of abstract embedded vectors.

It should be understood that processor 1010 may also perform any other steps/processes of the methods for embedded vector sequence abstract based representation generation according to embodiments of the present disclosure described above.

Embodiments of the present disclosure propose computer program products for representation generation based on embedded vector sequence abstractions, comprising a computer program for execution by at least one processor to: obtaining a text; generating a word block level embedded vector sequence of the text; performing embedding vector sequence abstraction on the word block level embedding vector sequence to obtain an abstract embedding vector sequence of the text; and generating a text representation of the text based on the abstract embedding vector sequence. Furthermore, the computer program may also be executed for implementing any other steps/processes of the method for representation generation based on embedded vector sequence abstraction according to embodiments of the present disclosure described above.

Embodiments of the present disclosure may be embodied in non-transitory computer readable media. The non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any operations of a method for embedded vector sequence abstraction-based representation generation in accordance with embodiments of the present disclosure as described above.

It should be understood that all operations in the methods described above are exemplary only, and the present disclosure is not limited to any operations in the methods or the order of the operations, but rather should encompass all other equivalent variations under the same or similar concepts. In addition, the articles "a" and "an" as used in this specification and the appended claims should generally be construed to mean "one" or "one or more" unless specified otherwise or clear from context to be directed to a singular form.

It should also be understood that all of the modules in the above described apparatus may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. In addition, any of these modules may be further divided functionally into sub-modules or combined together.

The processor has been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software depends upon the particular application and design constraints imposed on the system as a whole. As an example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented with a microprocessor, a microcontroller, a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a state machine, gated logic units, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented using software executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software should be viewed broadly as meaning instructions, instruction sets, code segments, program code, programs, subroutines, software modules, applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, and the like. The software may reside in a computer readable medium. The computer-readable medium may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), a register, or a removable disk. Although the memory is shown as being separate from the processor in the aspects presented in this disclosure, the memory may also be located internal to the processor, such as a cache or registers.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described herein that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

1. A method for representation generation based on embedded vector sequence abstraction, comprising:

obtaining a text;

generating a word block level embedded vector sequence of the text;

performing embedding vector sequence abstraction on the word block level embedding vector sequence to obtain an abstract embedding vector sequence of the text; and

generating a text representation of the text based on the abstract embedded vector sequence.

2. The method of claim 1, wherein the generating a sequence of block-level embedded vectors comprises:

generating an initial sequence of word-block-level embedding vectors for the text based on at least one of a set of word-block embedding vectors, a set of segment embedding vectors, and a set of position embedding vectors corresponding to a set of word blocks in the text; and

generating the sequence of word-block-level embedded vectors by a self-attention mechanism based on the initial sequence of word-block-level embedded vectors.

3. The method of claim 1, wherein a number of embedding vectors of the sequence of abstract embedding vectors is less than a number of embedding vectors of the sequence of block-level embedding vectors.

4. The method of claim 1, wherein the performing embedded vector sequence abstraction comprises:

selecting a representative word block level embedding vector set from the word block level embedding vector sequence; and

and combining the representative word block level embedding vector set into the abstract embedding vector sequence.

5. The method of claim 4, wherein the sequence of block-level embedded vectors further comprises at least one predefined coded embedded vector, and the combining comprises:

combining the set of representative word-block-level embedding vectors and the at least one predefined coded embedding vector into the sequence of abstract embedding vectors.

6. The method of claim 1, wherein the text comprises at least one text segment, and the performing embedded vector sequence abstraction comprises:

for each text segment, generating a set of segment-level embedding vectors corresponding to the text segment based on a word-block-level embedding vector corresponding to the text segment in the sequence of word-block-level embedding vectors; and

combining at least one set of segment-level embedding vectors corresponding to the at least one text segment into the abstract embedding vector sequence.

7. The method of claim 6, wherein the sequence of block-level embedded vectors further comprises at least one predefined coded embedded vector, and the combining comprises:

combining the at least one set of segment-level embedding vectors and the at least one predefined encoding embedding vector into the sequence of abstract embedding vectors.

8. The method of claim 1, wherein the performing embedding vector sequence abstraction comprises:

selecting a set of representative word block level embedding vectors from the sequence of word block level embedding vectors, the set of representative word block level embedding vectors comprising at least one representative subset of word block level embedding vectors corresponding to at least one text segment in the text;

for each text segment, generating a set of segment-level embedding vectors corresponding to the text segment based on a representative subset of block-level embedding vectors corresponding to the text segment; and

9. The method of claim 4 or 8, wherein said selecting a representative set of word-block-level embedding vectors comprises:

selecting the set of representative word block-level embedding vectors from the sequence of word block-level embedding vectors based on a self-attention weight of each word block-level embedding vector in the sequence of word block-level embedding vectors.

10. The method of claim 6 or 8, wherein the generating a set of segment-level embedding vectors comprises:

generating the set of segment-level embedding vectors by performing a set of pooling operations on the subsequence of block-level embedding vectors or the representative subset of block-level embedding vectors.

11. The method of claim 1, wherein the generating a textual representation comprises:

generating a target embedding vector sequence by a self-attention mechanism based on the abstract embedding vector sequence; and

selecting at least one embedding vector from the target sequence of embedding vectors as the text representation.

12. The method of claim 1, further comprising:

task results corresponding to a Natural Language Processing (NLP) task are obtained based on the textual representation.

13. The method of claim 1, further comprising:

obtaining at least one non-textual feature;

generating an initial non-text embedding vector based on the at least one non-text feature; and

generating a target multi-modal embedding vector based on the text and the initial non-text embedding vector.

14. The method of claim 13, wherein the generating an initial non-text embedding vector comprises:

combining the at least one non-text feature into the initial non-text embedded vector by a full join operation.

15. The method of claim 13, wherein the generating a target multi-modal embedding vector comprises:

generating an intermediate multi-modal embedded vector based on the initial block-level embedded vector sequence of the text and the initial non-text embedded vector; and

generating the target multi-modal embedding vector based on the sequence of abstract embedding vectors and the intermediate multi-modal embedding vector.

16. The method of claim 15, wherein,

the generating an intermediate multi-modal embedding vector comprises: generating the intermediate multi-modal embedding vector by iteratively performing a self-attentive computation based on the initial sequence of word-block-level embedding vectors and the initial non-text embedding vector, and/or

The generating the target multi-modal embedding vector comprises: generating the target multi-modal embedding vector by iteratively performing a self-attention calculation based on the sequence of abstract embedding vectors and the intermediate multi-modal embedding vector.

17. The method of claim 16, wherein the self-attention calculation comprises:

a current multi-modal embedding vector is generated based on the sequence of previous embedding vectors and the previous multi-modal embedding vector.

18. The method of claim 13, further comprising:

obtaining a task result corresponding to an NLP task based on the text representation and the target multi-modal embedding vector.

19. An apparatus for representation generation based on embedding vector sequence abstractions, comprising:

at least one processor; and

a memory storing computer-executable instructions that, when executed, cause the at least one processor to:

the text is obtained and the text is presented,

generating a block-level embedded vector sequence of the text,

performing an embedding vector sequence abstraction on the block-level embedding vector sequence to obtain an abstract embedding vector sequence of the text, an

Generating a text representation of the text based on the abstract embedding vector sequence.

20. A computer program product for representation generation based on embedded vector sequence abstraction, comprising a computer program executed by at least one processor for:

obtaining a text;

generating a word block level embedded vector sequence of the text;