WO2022156561A1

WO2022156561A1 - Method and device for natural language processing

Info

Publication number: WO2022156561A1
Application number: PCT/CN2022/071285
Authority: WO
Inventors: 张鹏; 张静; 魏俊秋
Original assignee: 华为技术有限公司
Priority date: 2021-01-20
Filing date: 2022-01-11
Publication date: 2022-07-28
Also published as: CN112883149B; CN112883149A

Abstract

A method and device for natural language processing, for use in the field of artificial intelligence. The method comprises: acquiring an input sequence (1201), which comprises an initial vector expression corresponding to at least one word in a first corpus; and producing an output sequence with the input sequence serving as an input for a self-attention model (1202), the output sequence expressing semantic information of each word in the first corpus. The self-attention model comprises multiple layers of networks, each layer of network comprises multiple self-attention modules, used for calculating degrees of relevance on the basis of an input vector, that is, the degrees of relevance between each word and neighboring words in the first corpus, and merging the degrees of relevance with the input vector to produce first sequences; the first sequences outputted by the multiple self-attention modules and outputs of the multiple self-attention modules in the preceding layer of network are merged to produce outputs of the current layer of network. The method allows, during natural language processing, the efficient interpretation of improved sematic information for each word in the corpus thereof.

Description

A natural language processing method and device

This application claims the priority of the Chinese patent application with the application number CN202110077612.0 and the application title "A Natural Language Processing Method and Device", which was filed with the China Patent Office on January 20, 2021, the entire contents of which are incorporated by reference in in this application.

technical field

The present application relates to the field of artificial intelligence, and in particular, to a natural language processing method and device.

Background technique

Self-attention (SA) model is the main part of the field of natural language processing and has a very wide range of applications, such as machine translation, pre-trained language models, etc. The self-attention model effectively encodes a sequence of words into several vector representations by calculating the dependencies between words, so that the output word vector representation contains its contextual semantic information. Therefore, how to interpret the better semantic information of each word through the self-attention model and convert the word into a better vector to represent it has become an urgent problem to be solved.

SUMMARY OF THE INVENTION

The present application provides a natural language processing method and device, which are used to more efficiently interpret each network layer by adding the hidden state reuse of the upper layer of the network in each network layer of the self-attention model in the process of natural language processing. The better semantic information of a word in its corpus.

In view of this, in the first aspect, the present application provides a method for natural language processing, including: first acquiring an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus, usually one word corresponds to an initial vector representation Vector representation; the input sequence is used as the input of the self-attention model, and the output sequence is obtained, and the output sequence includes at least one word in the first corpus after natural language processing (NLP) by the self-attention model. Vector representation, the output sequence represents the semantic information of each word in the first corpus in the first corpus, and the semantic information combines the context information of the first corpus, which can accurately represent the exact meaning of each word in the first corpus; Among them, the self-attention model includes a multi-layer network, the input of any layer of the network except the first layer in the multi-layer network is the output of the previous layer of network, each layer of network includes multiple self-attention modules, each self-attention module The attention module is used to calculate the degree of association based on the input vector, that is, the degree of association between each word in the first corpus and at least one adjacent word, and fuse the degree of association and the input vector to obtain the first sequence, each layer of the network. The output of is obtained by fusing the first sequence output by multiple self-attention modules and the first sequence output by multiple self-attention modules in the previous layer of network.

Therefore, in the embodiment of the present application, the self-attention model is realized through the mechanism of the neural network of frequent differential equations, and the neural differentiation through a layer of self-attention modules (that is, multiple calls and summations in the self-attention model) ), to achieve multi-layer self-attention fitting, and obtain an output sequence that can more accurately express semantics. In addition, the state reuse mechanism is introduced into the normal differential equation, and the hidden state information inside the previous layer is reused during the fitting process of the normal differential equation to each layer (except the first layer), thereby improving the calculation speed. That is, the current network layer can reuse the output of the SA module of the previous layer of the network, so that the self-attention model can quickly obtain more accurate output results, reduce the computational complexity of the model, and improve the training and inference of the self-attention model. efficiency.

In a possible implementation, the self-attention model further includes a feature extraction network, and the above method may further include: the feature extraction network extracts features according to multiple adjacent initial vector representations in the input sequence to obtain a local feature sequence, and The local feature sequence is used as the input of the multi-layer network.

Therefore, in the embodiment of the present application, local features can be extracted from the input sequence, so that the local features can be used as the input of the multi-layer network, so that the multi-layer network can refer to the local information to calculate the correlation degree, and improve the attention to the local information. Make the interpretation of the output sequence better for local information.

In a possible implementation, the self-attention model further includes a fusion network, and the above method may further include: the fusion network fuses the local feature sequence and the output result of the multi-layer network to obtain an output sequence.

Therefore, in the embodiment of the present application, local features can be extracted from the input sequence, and the local features and the output result of the SA network can be fused to obtain the output sequence, so that the local features are fused in the output sequence. In this way, the final self-attention result focuses on local features, and a sequence that can more accurately represent the semantics of each word is obtained.

In a possible implementation manner, the above-mentioned fusion network fuses the local feature sequence and the output result of the multi-layer network, which may specifically include: the fusion network calculates the similarity between the local feature sequence and the output result, and fuses the similarity and the local feature. sequence to get the output sequence.

In the embodiments of the present application, the similarity and local features can be fused by calculating the similarity, so as to obtain an output sequence, so that the output sequence contains local information, and the represented semantics is more accurate.

In a possible implementation manner, the self-attention model further includes a classification network, and the above method may further include: using the input of the classification network as an output sequence, and outputting the category corresponding to the first corpus. Therefore, the method provided by the embodiment of the present application can be applied to the classification scene, and the classification of the corpus can be realized by adding a classification network to the self-attention model.

In a possible implementation, the self-attention model further includes a translation network, and the above method may further include: using the output sequence as the input of the translation network, and outputting a second corpus, the language type of the first corpus is the first category, and the first The language category of the second corpus is the second category, and the first category and the second category are different language categories. Therefore, the method provided by the embodiment of the present application can also be applied to translation scenarios, that is, adding a translation network to the self-attention model can realize the translation of corpus, and translate the corpus corresponding to the input sequence into corpus of different languages.

In a possible implementation, the parameters of multiple self-attention modules in each layer of the multi-layer network are the same. Therefore, in the embodiments of the present application, the self-attention model can be implemented through the same parameters of the SA module, so that the self-attention model occupies a reduced amount of storage, and the efficiency of training and forward inference of the self-attention model can be improved. By introducing ODE into the SA mechanism, and realizing the SA model by means of ODE, the problem of parameter redundancy caused by the multi-layer overlapping in some SA models is solved, and the parameters of the original multi-layer network can be achieved by using the parameters of one layer of network. effect that can be achieved.

In a possible implementation, before acquiring the input sequence, a training set may also be used to train the self-attention model, the training set may include at least one corpus, each corpus includes at least one word, and each corpus is converted into To include a sequence of initial vector representations for each word, this sequence is then fed into the self-attention model, resulting in an output sequence. Then, the gradient value can be calculated by the adjoint ODE algorithm, and the parameters of the self-attention model can be updated based on the gradient, so that the output of the self-attention model is closer to the label corresponding to the corpus. Therefore, in the embodiment of the present application, the numerical solution method is used to obtain the gradient, which greatly reduces memory consumption and gradient error compared with the backpropagation algorithm.

In a second aspect, the present application provides a self-attention model, the self-attention model includes a multi-layer network, the input of any layer of the multi-layer network is the output of the previous layer of network, and each layer of the network includes multiple self-attention models. Attention module, each self-attention module is used to calculate the correlation degree based on the input vector, and fuse the correlation degree and the input vector to obtain the first sequence, and the output of each layer of the network is the output of the fusion of multiple self-attention modules. A sequence is obtained from the first sequence output by multiple self-attention modules in the previous layer of network, and the degree of association represents the degree of association between each word in the first corpus and at least one adjacent word; the input sequence is used as the self-attention module. The input of the attention model, the output sequence is obtained. The output sequence includes the vector representation corresponding to at least one word in the first corpus after the natural language processing NLP by the self-attention model, and the output sequence indicates that each word in the first corpus is in Semantic information in the first corpus.

In a possible implementation, the self-attention model further includes a feature extraction network, which is used to extract features from the input sequence in units of multiple adjacent initial vector representations to obtain a local feature sequence, and combine the local feature sequence as the input to the multi-layer network.

In a possible implementation, the self-attention model further includes a fusion network for fusing the local feature sequence and the output result of the multi-layer network to obtain an output sequence.

In a possible implementation, the fusion network is specifically used to calculate the similarity between the local feature sequence and the output result, and fuse the similarity and the local feature sequence to obtain the output sequence.

In a possible implementation, the self-attention model further includes a classification network, the input of the classification network is an output sequence, and the category corresponding to the first corpus is output. Therefore, the method provided by the embodiment of the present application can be applied to the classification scene, and the classification of the corpus can be realized by adding a classification network to the self-attention model.

In a possible implementation, the self-attention model further includes a translation network, the language type of the first corpus is the first type, the input of the translation network is the output sequence, and outputs the second corpus, and the language type of the second corpus is the first type The second category, the first category and the second category are different language types. Therefore, the method provided by the embodiment of the present application can also be applied to translation scenarios, that is, adding a translation network to the self-attention model can realize the translation of corpus, and translate the corpus corresponding to the input sequence into corpus of different languages.

In a third aspect, an embodiment of the present application provides a natural language processing apparatus, and the natural language processing apparatus has a function of implementing the natural language processing method of the first aspect. This function can be implemented by hardware or by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions.

In a fourth aspect, an embodiment of the present application provides a natural language processing apparatus, including: a processor and a memory, wherein the processor and the memory are interconnected through a line, and the processor invokes program codes in the memory to execute any one of the above-mentioned first aspects The processing-related functions in the natural language processing method shown in item. Optionally, the natural language processing device may be a chip.

In a fifth aspect, an embodiment of the present application provides a natural language processing device. The natural language processing device may also be referred to as a digital processing chip or a chip. The chip includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface. The instructions are executed by a processing unit, and the processing unit is configured to perform processing-related functions as in the first aspect or any of the optional embodiments of the first aspect.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, including instructions, which, when executed on a computer, cause the computer to execute the method in the first aspect or any optional implementation manner of the first aspect.

In a seventh aspect, an embodiment of the present application provides a computer program product including instructions, which, when run on a computer, enables the computer to execute the method in the first aspect or any optional implementation manner of the first aspect.

Description of drawings

Fig. 1 is a schematic diagram of a main frame of artificial intelligence applied by the application;

2 is a schematic diagram of a system architecture provided by the application;

3 is a schematic diagram of another system architecture provided by the present application;

4A is a schematic structural diagram of a self-attention model provided by an embodiment of the present application;

4B is a schematic structural diagram of a sequence conversion provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;

6A is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;

6B is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;

FIG. 10 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a converged network provided by an embodiment of the application;

12 is a schematic flowchart of a natural language processing method provided by an embodiment of the present application;

13 is a schematic structural diagram of a natural language processing apparatus according to an embodiment of the present application;

14 is a schematic structural diagram of another natural language processing apparatus provided by an embodiment of the present application;

FIG. 15 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

First, the overall workflow of the artificial intelligence system will be described. Please refer to Figure 1. Figure 1 shows a schematic structural diagram of the main frame of artificial intelligence. The above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis). Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom". The "IT value chain" reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communicate with the outside through sensors; computing power is provided by intelligent chips, such as central processing unit (CPU), network processor (neural-network processing unit, NPU), graphics processor (English: graphics processing unit, GPU), Application specific integrated circuit (ASIC) or field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips) are provided; the basic platform includes distributed computing framework and network related platform guarantee and support, It can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.

(2) Data

The data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3) Data processing

Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.

Among them, machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.

Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.

Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.

(4) General ability

After the above-mentioned data processing, some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.

(5) Smart products and industry applications

Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall solution of artificial intelligence, and the productization of intelligent information decision-making to achieve landing applications. Its application areas mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, safe city, etc.

The embodiments of this application involve related applications of neural networks and natural language processing (NLP). concept is introduced.

Corpus: Also known as free text, it can be words, words, sentences, fragments, articles, and any combination thereof. For example, "The weather is really nice today" is a corpus.

The self-attention model refers to effectively encoding a sequence of data (such as natural corpus "Your mobile phone is very good.") into several multi-dimensional vectors, which are convenient for numerical operations. The similarity information between each element is called self-attention.

ODENet (neural ordinary differential equations networks, ODENet) is an implementation for time-dependent neural networks or multi-step/layer neural networks. ODENets can fit the given neural network in the For each different continuous time point or the output under each asynchronous/layer, a set of parameters is used to fit the output of the original neural network under multiple continuous time points or multiple steps/layers, which has high parameter efficiency.

Loss function (loss function): also known as cost function (cost function), a measure that compares the difference between the prediction output of the machine learning model for the sample and the real value (also known as the supervision value) of the sample, that is, used to measure The difference between the predicted output of a machine learning model for a sample and the true value of the sample. The loss function may generally include loss functions such as mean square error, cross entropy, logarithm, and exponential. For example, the mean squared error can be used as a loss function, defined as

Specifically, a specific loss function can be selected according to the actual application scenario.

Gradient: The vector of derivatives of the loss function with respect to the parameters.

Stochastic gradient: The number of samples in machine learning is large, so the loss function of each calculation is calculated from the data obtained by random sampling, and the corresponding gradient is called stochastic gradient.

Back propagation (BP): An algorithm that calculates the gradient of model parameters and updates model parameters according to the loss function.

Ordinary differential equation adjoint solution (adjoint ODE) algorithm: is a reverse update algorithm for training ordinary differential equations (ordinary differential equation, ODE). algorithm, which greatly reduces memory consumption and gradient error.

Neural machine translation: Neural machine translation is a typical task of natural language processing. The task is the technique of outputting a sentence in the target language given a sentence in the source language. In the commonly used neural machine translation model, the words in the sentences of the source language and the target language are encoded into vector representations, and the association between words and words and sentences and sentences is calculated in the vector space to perform translation tasks.

Pre-trained language model (PLM): It is a natural language sequence encoder that encodes each word in the natural language sequence into a vector representation for prediction tasks. The training of PLM consists of two stages, namely the pre-training stage and the fine-tuning stage. In the pre-training stage, the model is trained on large-scale unsupervised text for language model tasks, thereby learning word representations. In the fine-tuning stage, the model is initialized with the parameters learned in the pre-training stage, and can be successfully trained on downstream tasks such as text classification or sequence labeling with fewer steps. The semantic information obtained by pre-training is successfully transferred to downstream tasks.

Embedding: refers to the feature representation of the sample.

The natural language processing method provided by the embodiments of the present application may be executed on a server, and may also be executed on a terminal device. The terminal device can be a mobile phone with image processing function, tablet personal computer (TPC), media player, smart TV, laptop computer (LC), personal digital assistant (PDA) ), a personal computer (PC), a camera, a video camera, a smart watch, a wearable device (WD), or an autonomous vehicle, etc., which are not limited in this embodiment of the present application.

Referring to FIG. 2 , an embodiment of the present application provides a system architecture 200 . The system architecture includes a database 230 and a client device 240 . The data collection device 260 is used to collect data and store it in the database 230 , and the training module 202 generates the target model/rule 201 based on the data maintained in the database 230 . The following will describe in more detail how the training module 202 obtains the target model/rule 201 based on the data. The target model/rule 201 is the neural network mentioned in the following embodiments of the present application. For details, please refer to the relevant descriptions in FIGS. 4A-12 below. .

The computing module may include a training module 202, and the target model/rule obtained by the training module 202 may be applied to different systems or devices. In FIG. 2, the execution device 210 configures a transceiver 212, which can be a wireless transceiver, an optical transceiver, or a wired interface (such as an I/O interface), etc., to perform data interaction with external devices, and a "user" can Data is input to transceiver 212 through client device 240. For example, client device 240 may send target tasks to execution device 210, request the execution device to train a neural network, and send execution device 210 a database for training.

The execution device 210 can call data, codes, etc. in the data storage system 250 , and can also store data, instructions, etc. in the data storage system 250 .

The calculation module 211 uses the target model/rule 201 to process the input data. Specifically, the calculation module 211 is used to: obtain an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus; use the input sequence as the input of the self-attention model to obtain an output sequence, where the output sequence includes The vector representation corresponding to at least one word in the first corpus after the attention model performs natural language processing (NLP), the output sequence represents the semantic information of each word in the first corpus in the first corpus, and the self-attention model is obtained through the training set. Obtained from training, the training set includes at least one corpus, and each corpus includes at least one word; wherein, the self-attention model includes a multi-layer network, and the input of any layer of the multi-layer network is the output of the previous layer of network, each The layer network includes multiple self-attention modules, each self-attention module is used to calculate the degree of association representing the degree of association between each word in the first corpus and at least one adjacent word based on the input vector, and fuse the association degree and the input vector to get the first sequence, each layer in the multi-layer network is connected, and the output of each layer of the network is the first sequence that fuses the output of multiple self-attention modules and multiple layers in the previous layer of network. The first sequence output from the attention module is obtained.

Finally, transceiver 212 returns the constructed neural network to client device 240 to deploy the neural network in client device 240 or other devices.

More deeply, the training module 202 can obtain corresponding target models/rules 201 based on different data for different tasks, so as to provide users with better results.

In the case shown in FIG. 2 , the data input into the execution device 210 can be determined according to the input data of the user, for example, the user can operate in the interface provided by the transceiver 212 . In another case, the client device 240 can automatically input data to the transceiver 212 and obtain the result. If the client device 240 automatically inputs data and needs to obtain the authorization of the user, the user can set the corresponding permission in the client device 240 . The user can view the result output by the execution device 210 on the client device 240, and the specific presentation form can be a specific manner such as display, sound, and action. The client device 240 can also act as a data collection end to store the collected data associated with the target task into the database 230 .

The training or update process mentioned in this application may be performed by the training module 202 . It can be understood that the training process of the neural network is to learn the way to control the spatial transformation, and more specifically, to learn the weight matrix. The purpose of training a neural network is to make the output of the neural network as close to the expected value as possible, so you can compare the predicted value and expected value of the current network, and then update the weight of each layer of the neural network in the neural network according to the difference between the two. vector (of course, the weight vector can usually be initialized before the first update, that is, the parameters are pre-configured for each layer in the deep neural network). For example, if the predicted value of the network is too high, the value of the weight in the weight matrix is adjusted to reduce the predicted value, and after continuous adjustment, the value output by the neural network is close to or equal to the expected value. Specifically, the difference between the predicted value and the expected value of the neural network can be measured by a loss function or an objective function. Taking the loss function as an example, the higher the output value of the loss function (loss), the greater the difference. The training of the neural network can be understood as the process of reducing the loss as much as possible. For the process of updating the weight of the starting point network and training the serial network in the following embodiments of the present application, reference may be made to this process, which will not be repeated below.

As shown in FIG. 2, a target model/rule 201 is obtained by training according to the training module 202. In this embodiment of the present application, the target model/rule 201 may be a self-attention model in the present application, and the self-attention model may include a depth volume Deep convolutional neural networks (DCNN), recurrent neural networks (RNNS) and other networks. The neural network mentioned in this application may include various types, such as deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN) or residual network other neural networks, etc.

Wherein, in the training phase, the database 230 may be used to store a sample set for training. The execution device 210 generates a target model/rule 201 for processing samples, and uses the sample set in the database to iteratively train the target model/rule 201 to obtain a mature target model/rule 201. The target model/rule 201 is embodied as Neural Networks. The neural network obtained by executing the device 210 can be applied in different systems or devices.

In the inference phase, the execution device 210 may call data, codes, etc. in the data storage system 250 , and may also store data, instructions, etc. in the data storage system 250 . The data storage system 250 may be placed in the execution device 210 , or the data storage system 250 may be an external memory relative to the execution device 210 . The calculation module 211 can process the samples obtained by the execution device 210 through a neural network to obtain a prediction result, and the specific manifestation of the prediction result is related to the function of the neural network.

It should be noted that FIG. 2 is only an exemplary schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 2 , the data storage system 250 is an external memory relative to the execution device 210 . In other scenarios, the data storage system 250 may also be placed in the execution device 210 .

The target model/rule 201 trained according to the training module 202 can be applied to different systems or devices, such as mobile phones, tablet computers, laptop computers, augmented reality (AR)/virtual reality (VR) , a vehicle terminal, etc., or a server or a cloud device.

The target model/rule 201 may be the self-attention model in the present application in the embodiment of the present application. Specifically, the self-attention model provided in the embodiment of the present application may include CNN, deep convolutional neural networks (deep convolutional neural networks) , DCNN), recurrent neural network (RNN) and other networks.

Referring to FIG. 3 , an embodiment of the present application further provides a system architecture 300 . The execution device 210 is implemented by one or more servers, and optionally, cooperates with other computing devices, such as: data storage, routers, load balancers and other devices; the execution device 210 may be arranged on a physical site, or distributed in multiple on the physical site. The execution device 210 can use the data in the data storage system 250, or call the program code in the data storage system 250 to implement the steps of the training method for a computing device corresponding to FIG. 12 below in this application.

A user may operate respective user devices (eg, local device 301 and local device 302 ) to interact with execution device 210 . Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, etc.

Each user's local device can interact with the execution device 210 through any communication mechanism/standard communication network, which can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof. Specifically, the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, and the like. The wireless network includes but is not limited to: the fifth generation mobile communication technology (5th-Generation, 5G) system, the long term evolution (long term evolution, LTE) system, the global system for mobile communication (global system for mobile communication, GSM) or code division Multiple access (code division multiple access, CDMA) network, wideband code division multiple access (wideband code division multiple access, WCDMA) network, wireless fidelity (wireless fidelity, WiFi), Bluetooth (bluetooth), Zigbee protocol (Zigbee), Any one or a combination of radio frequency identification technology (radio frequency identification, RFID), long range (Long Range, Lora) wireless communication, and near field communication (near field communication, NFC). The wired network may include an optical fiber communication network or a network composed of coaxial cables, and the like.

In another implementation, one or more aspects of the execution device 210 may be implemented by each local device, for example, the local device 301 may provide the execution device 210 with local data or feedback calculation results. The local device may also be referred to as a computing device.

It should be noted that all the functions of the execution device 210 can also be implemented by the local device. For example, the local device 301 implements the functions of the execution device 210 and provides services for its own users, or provides services for the users of the local device 302 .

Some commonly used SA models are usually obtained by stacking multiple SA modules, and each SA module has corresponding parameters. For natural language representation, SA models learn more abstract semantic information by stacking multiple layers, which often results in parameter Redundancy can easily lead to a decrease in the efficiency of training and inference. Moreover, the SA model can capture the global information, but the local information fusion is ignored and cannot learn the actual meaning represented by the text.

Therefore, this application proposes a self-attention model and a natural language processing method based on the self-attention model, which are used in the process of natural language processing by adding a pair of layers to each network layer of the self-attention model. The hidden state reuse of the network can more efficiently interpret the better semantic information of each word in its corpus. In addition, by fusing global information and local information, the semantic information of each word in the text can be interpreted, that is, the context semantics represented by each word in the text in the corpus in which it is located can be more accurately interpreted.

The self-attention model provided by the present application and the natural language processing method based on the self-attention model will be introduced in detail below.

First, the self-attention model provided by this application is introduced.

The input of the self-attention model provided by this application is the input sequence, and the output is the output sequence obtained by performing self-attention processing on the input sequence. As shown in FIG. 4A , the input sequence may be a sequence composed of vectors corresponding to a corpus, each word in the corpus corresponds to a vector, and one or more vectors form a sequence. The input sequence is input into the self-attention model 401, and then the self-attention model 401 interprets the semantics of each word in the corresponding corpus according to the dependencies between the sequences in the input sequence, thereby obtaining the output sequence.

For example, the self-attention model can effectively encode words into several vector representations by calculating the dependencies between words, so that the output word vector represents the semantic information contained in the corpus, and combines the context information of the corpus, Make the semantics represented by the word vector representation, which is also called the hidden state in deep learning (Deep Learning), more accurate. As shown in Figure 4B, each word in the corpus "Your mobile phone is very good." corresponds to an initial vector, thus forming a sequence [x ₁ x ₂ x ₃ x ₄ x ₅ x ₆ x ₇ x ₈ ], such as ""you" corresponds to the initial vector x ₁ , "the" corresponds to the initial vector x ₂ and so on. After encoding by the self-attention model, the output vector corresponding to each input vector is output, and the output sequence [h ₁ h ₂ h ₃ h ₄ h ₅ h ₆ h ₇ h ₈ ] is obtained. Usually, the length of the output sequence is the same as that of the input sequence. have the same length or have a corresponding mapping relationship. It can be understood that the self-attention model can interpret the semantics of each word in the corpus according to the similarity or dependency between each vector and adjacent vectors, so as to obtain an output sequence representing the semantics of each word in the corpus. , the initial vector of the word is further optimized into a better vector representation.

Therefore, in the embodiment of the present application, an ODE-based SA module is introduced into the self-attention model, so that self-attention is calculated by means of ODE. Compared with the SA module stacking mechanism of some existing solutions, the ODE-based SA module is introduced into the self-attention model used in this application, which has higher parameter efficiency and improves the training and performance of the self-attention model. Efficiency of reasoning.

Exemplarily, referring to FIG. 5 , a schematic structural diagram of a self-attention model provided by the present application is described as follows.

The self-attention model may include multiple network layers, each of which includes one or more self-attention modules. For the convenience of distinction, in the following embodiments of the present application, the multi-layer network layer is referred to as the SA network.

Specifically, the input of the first layer of the network layer is an input sequence, and the input sequence includes an initial vector representation corresponding to each word in the corpus to be processed (or referred to as the first corpus). The input of each network layer after the first layer is the output of the previous network layer. The last layer of the network layer outputs the output result of the SA network.

More specifically, in the SA network of the self-attention model provided by the embodiment of the present application, each layer of the network includes one or more SA modules. When each layer network includes multiple SA modules, the input end of each SA module is connected to one or more SA modules, and the input of each SA module includes the output of one or more SA modules connected to its input end.

Or, in some possible scenarios, in addition to being connected to one or more SA modules, the input end of each SA module may also be connected to the output end of the upper-layer network or to the input end of the SA network, that is, each SA module The input can also include the input of the current network layer. Generally, if the current network layer is the first layer network, the input of the current network layer is the input sequence, and if the current network layer is a non-first layer network, the input of the current network layer is the output of the previous layer network. When the input of a certain SA module includes multiple sequences (each sequence includes one or more vectors), the multiple sequences can be fused to obtain the input of the current SA module. For example, if the input of a certain SA module includes the outputs of multiple SA modules, the outputs of the multiple SA modules can be fused to obtain the input of the current SA module; or, when the input of a certain SA module includes one or more SA modules The output of each SA module and the output of the upper-layer network are merged with the output of the one or more SA modules and the output of the upper-layer network to obtain the input of the current SA module.

For ease of understanding, the structure of the SA network can be exemplarily shown in FIG. 6A , wherein, the SA network includes i+1 layers of network layers. Except for the first layer, each network layer can reuse the network of the previous layer. The hidden state inside the layer is combined with the hidden state inside the previous network layer and the calculation result of the SA module in the current layer to obtain the output result of the current layer. The right side of FIG. 6A is a schematic structural diagram of one of the network layers, wherein the ODE layer i contains several SA modules and connections between them. Each SA module usually has the same structure and parameters, and each SA module has the same structure and parameters. A connection contains a weight (a ₁ , a ₂ , a ₃ . get.

In more detail, as shown in FIG. 6B , the specific structure of the SA network is described by taking the first-layer network layer and the second-layer network layer as examples.

Wherein, after the input sequence is obtained, the input sequence can be used as the input of each SA module in the first-layer network (ie, the network layer 1 shown in FIG. 6B ).

In addition, the input of the first SA module in the first network layer may be an input sequence, and the inputs of other SA modules except the first SA module may include, in addition to the input sequence, the input sequence of the previous SA module connected to it. output. When the input of the SA module includes the input sequence and the output of the previous SA module connected to it, the input sequence and the output of the previous SA module connected to it can be fused to obtain a new vector and used as the input of the current SA module.

The output of the first network layer can be obtained by fusing the outputs of all SA modules in the first layer network side layer.

The output of the first-layer network is used as the input of the second-layer network (ie, network layer 2 shown in FIG. 6B ), and the calculation process of each SA module in the second-layer network is the same as that of each SA module in the first-layer network. The calculation method is similar. In addition, the output of each SA module in the first-layer network is used as the input of the second-layer network, that is, the hidden state obtained in the first-layer network is used as the input of the second-layer network. The second-layer network fuses the output of each SA module in the first-layer network to obtain the first sequence, and then fuses the first sequence with the output of each SA module in the second-layer network to obtain the second-layer network Output.

Therefore, in the self-attention model provided in this application, not only the ODE mechanism is introduced into the self-attention model, but each layer of network can also reuse the hidden state of the previous layer of network, so that the self-attention model can be calculated more accurately. Attention can improve the training efficiency of the self-attention model, so that the self-attention model can quickly converge, and the output accuracy of the self-attention model can also be improved.

More specifically, the calculation flow inside each SA module is exemplified below.

The SA module can be used to calculate the degree of association based on the input vector, that is, the degree of association between each word in the corpus and one or more adjacent words, and then fuse the input vector and the degree of association to obtain the SA module's degree of association. Output the result. Algorithms for calculating the degree of association may include various algorithms, such as multiplication, transposition and multiplication, etc. Specifically, an algorithm suitable for practical application scenarios can be selected.

Illustratively, in the process of converting the sequence [x ₁ x ₂ x ₃ x ₄ x ₅ x ₆ x ₇ x ₈ ] to the output sequence [h ₁ h ₂ h ₃ h ₄ h ₅ h ₆ h ₇ h ₈ ] In the process, if the length of the input sequence is N, and the dimension of each word (that is, each initial vector) is d, the input sequence vector constitutes a

The matrix X, this matrix X will be combined with the three matrices W _k , W _v ,

Do matrix multiplication (ie linear transformation) to get three matrices K, V,

As an input to the computation computes self-attention (i.e. relevance). When calculating self-attention, the product of K and Q will be calculated to obtain an N×N attention matrix (attention matrix), which represents the dependence between each element in the input sequence, that is, the degree of association, and finally this matrix and V Multiplied and processed by softmax, converted to a

The sequence representation of , contains N d-dimensional vectors, that is, the output of the SA module. The self-attention model incorporates the similarity information of the word vector x _i in each input sequence and all other word vectors in the input sequence into h _i , that is, h _i is dependent on the information of each input word vector in the sentence, that is It is said that the output sequence contains global information, that is, the vector representation of the sequence learned by the output sequence can capture long-distance dependencies, and the training and inference process of the SA model has better parallelism, so as to achieve more efficient training of the SA model. and inference effects.

Optionally, in the SA model provided by this application, the parameters of the SA modules in each network layer are the same, or the parameters of each SA module in the SA model are the same, and so on. Therefore, the SA model provided in the present application has higher parameter efficiency, improves the training speed of the model, occupies less memory, and improves the generalization ability of the model.

In addition, the self-attention model can also include a feature extraction network for extracting local features of the input sequence, that is, extracting features from the input sequence in units of multiple adjacent initial vector representations to obtain a local feature sequence, and can The local feature sequence is used as the input of the aforementioned multilayer network. It can be understood that, in the embodiments of the present application, the feature extraction network is used to perform feature extraction in units of multiple adjacent vectors, so as to extract a vector including local information.

Therefore, in the embodiments of the present application, local feature extraction can be used to focus on the local information of the input sequence, and to extract detailed information in the input sequence, thereby obtaining an output sequence that can more accurately represent semantics.

Specifically, the feature extraction network can be composed of multiple convolution kernels, and usually the width of the convolution kernel is the length of the unit for extracting features. For example, if the width of the convolution kernel is 3, it can indicate that when extracting features, the feature extraction can be performed in units of three adjacent initial vectors, that is, the three adjacent initial vectors are used as the input of the convolution kernel, so as to output The extracted local feature sequence.

In addition, optionally, the self-attention model can also include a fusion network for fusing the local feature sequence and the output results of the multi-layer network to obtain the output sequence. It can be understood that the sequence output by the SA network carries the global information, while the local feature sequence carries the local information. After the global information and the local information are fused, the output sequence that can represent the global and local information can be obtained.

Exemplarily, as shown in FIG. 7 , the self-attention model may include a feature extraction network 701 , an SA network 702 and a fusion network 703 , and the SA network is the network shown in the foregoing FIGS. 5 to 6B .

The feature extraction network 701 can extract features from the input sequence, and then use the obtained local feature sequence as the input of the SA network 702 .

The SA network 702 can be used to obtain an output result by calculating the degree of association between each word in the corpus and at least one adjacent word. The input of the SA network 702 can be the local feature sequence output by the feature extraction network 701, and the SA network 702 can specifically calculate the degree of association between each word in the corpus and at least one adjacent word based on the local feature sequence, so as to obtain the output result , please refer to the related descriptions in the aforementioned FIG. 5 to FIG. 6B , which will not be repeated here.

The fusion network 703 can fuse the output result of the SA network 702 and the local feature sequence output by the feature extraction network 701 to obtain the output sequence.

For example, the fusion network 703 can calculate the similarity between the local feature sequence and the output result of the fusion SA network 702, and then fuse the similarity and the local feature sequence to obtain the output sequence.

Therefore, in the embodiment of the present application, global and local semantic information is fused to obtain more accurate contextual semantic information, so that the final output sequence can better represent the semantics of each word in the corpus, thereby improving downstream tasks. accuracy.

In addition, the structure of another self-attention model can be shown in FIG. 8. The local feature sequence extracted by the feature extraction network 701 can also not be used as the input of the SA network 702, that is, the input sequence is used as the input of the SA network 702. The local feature sequence output by the extraction network 701 and the output result of the SA network 702 are used as the input of the fusion network 703 to obtain the output sequence. That is, the fusion network fuses the local feature sequence and the output result of the SA network to obtain the output sequence, which is equivalent to the output sequence containing both global information and local information, thus paying attention to the global and local semantics in the corpus, so that the output sequence contains both global and local information. The interpreted semantics are more accurate.

It should be noted that the fusion mentioned in the above or the following embodiments of this application may refer to operations such as multiplying two values, weighted fusion, or direct splicing, and the specific fusion method can be selected according to the actual application scenario. , which is not limited in this application. For example, after obtaining two sequences, multiply the two sequences to obtain the fused sequence. For another example, two sequences with dimensions of 5 and 8 are obtained, and the two sequences can be directly spliced to obtain a sequence with a dimension of 13. For another example, after two sequences are obtained, a corresponding weight value is assigned to each sequence, and then a new sequence is obtained by means of weighted fusion.

In addition, the self-attention model provided in this application may also include networks related to downstream tasks, such as neural machine translation, text classification, or pre-trained language models. In the following, several kinds of networks are taken as examples for exemplary introduction.

Example 1. Classification network

Wherein, the self-attention model may further include a classification network, or the output of the self-attention model may be used for input to the classification network to identify the category of the corpus corresponding to the input sequence.

For example, the structure of a self-attention model including a classification network can be shown in Figure 9, where an N-Gram convolutional layer is used to learn the local information of the input text representation matrix (ie, the input sequence). The N-Gram convolutional layer performs convolution calculation on the input sequence representation. The N-Gram convolutional layer contains several convolution kernels, and each convolution kernel is associated with each fixed-length subsequence (for example, N=3, the The subsequence is the initial vector corresponding to each adjacent three words shown in Figure 9) to perform the convolution operation, and finally the values calculated by all the convolution kernels are summarized to obtain a new vector representation of each initial vector in the sequence. , due to the limited width of each convolution kernel, the hidden state can only be obtained by convolution calculation on the representation of several adjacent elements. As shown in Figure 9, each hidden state consists of two or three adjacent elements. It is represented by a vector, so each hidden state can only contain the dependencies between the element and adjacent finite elements, that is, local information.

As shown in Figure 10, the ODE-based SA module in this application is shown. The figure shows the ODE-based SA calculation method of the first layer and the second layer. The first layer contains 4 SA modules, and the second layer contains 3 SA modules. In addition, it also includes the multiplexing of the state of the previous layer. The architecture and parameters of each layer above the second layer are usually the same as those of the second layer, and each SA module is usually the same, as shown in the figure 10 is shown in the lower half of the figure. In each layer in the figure, G() represents the hidden state of the output of the corresponding layer. In Figure 10, a _i and c _i are the model parameters of the ODE SA module, and r is the hyperparameter of the SA model, which is usually a fixed value. For example, the output of the SA network can be expressed as:

K=(XW _K ), Q=(XW _Q ), V=(XW _V ), X is the input sequence.

Second, the fusion network can be seen in Figure 11. In the figure, h ^L represents the word vector sequence containing local information output from the feature extraction network, and h ^G represents the word vector sequence containing global information output from the ODE SA network. The fusion network uses the attention mechanism (attention) and gate. The (gate) mechanism fuses the two to obtain a word vector sequence that combines local information and global information, namely h ^O . The specific fusion method may include performing a Euclidean Attention operation on h ^L and h ^G , specifically, calculating the similarity between each word vector in h ^L and h ^G according to the Euclidean distance , the similarity matrix E is obtained, as expressed as

After that, matrix multiplication will be performed on E and h ^L , such as h ^E =Eh ^L , to obtain a new word vector representing h ^E . Finally, h ^L , h ^E and the two parameter matrices are multiplied and the final output h ^O is obtained through layer normalization (Layernorm), such as h ^O =layer_norm(W ^L *h ^L +W ^E *h ^E ) .

The classification network may be used to identify the type of sentence of the input vector, or the classification of the nouns included in the sentence of the input vector. For example, if the text corresponding to the input sequence is "Your mobile phone is very good", the category corresponding to the text is identified as "mobile phone".

Example 2. Translation Network

The self-attention model may also include a translation network, the output sequence is used as the input of the translation network, and the language corresponding to the input sequence is output in different languages, or the corpus obtained after translation is referred to as the second corpus here. The language of the first corpus and the language of the second corpus are different.

For example, after the input sequence of the first corpus passes through the SA network, the SA network is used to analyze the meaning of each word in the first corpus, so as to obtain an output sequence that can represent the semantics of the first corpus. The output sequence is used as the input of the translation network, and the translation result corresponding to the first corpus is obtained. For example, if the text corresponding to the input sequence is "Your cell phone is very good", the SA network can analyze the semantics of each word in the text in the text, and then translate the corresponding English "Your cell phone" through the translation network. is very nice”.

The structure of the self-attention model including the translation network is similar to the structure shown in the aforementioned Figures 9-11, the difference is only in the structure of the translation network.

The classification network or translation network mentioned in this application can be selected from deep convolutional neural networks (DCNN), recurrent neural networks (RNNS), and so on. The neural network mentioned in this application may include various types, such as deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN) or residual network other neural networks, etc.

Therefore, in the embodiment of the present application, the ODE-based self-attention model can be applied to a variety of scenarios, complete a variety of downstream tasks, has strong generalization ability, and can adapt to more scenarios.

In addition, when training the self-attention model provided by this application, algorithms such as BP or adjoint ODE can be used to update. For example, taking the use of the adjoint ODE algorithm to update the self-attention model as an example, after the self-attention model completes a forward inference to obtain the inference result, the adjoint ODE algorithm can be used to obtain the gradient numerically, that is, the gradient can be directly calculated. There is no need for back-propagation, which greatly reduces memory consumption and gradient errors.

The structure of the self-attention model provided by the present application has been introduced in the foregoing, and the natural language processing method provided by the present application will be described in detail below based on the self-attention model provided by the foregoing FIGS. 4A to 11 .

Referring to FIG. 12 , a schematic flowchart of a natural language processing method provided by the present application is as follows.

1201. Obtain an input sequence.

Among them, the input sequence includes the initial vector representation corresponding to each word in the corpus.

For example, after a sentence of corpus (or a piece of text) is obtained, each word in the corpus is converted into an initial vector representation according to a preset mapping relationship, each word corresponds to an initial vector representation, one or more initial representations Vectors form an input sequence. Specifically, for example, to obtain a piece of corpus to be processed "how is the weather today", a mapping table can be set up in advance, and each word is set in the mapping table to correspond to a vector, such as "now" corresponds to a vector x ₁ , and "day" corresponds to a vector x ₂ etc.

1202. Use the input sequence as the input of the self-attention model to obtain the output sequence.

The output sequence includes a vector representation corresponding to at least one word in the first corpus after natural language processing by the self-attention model, and each vector in the output sequence indicates that each word in the first corpus is in the first corpus. The semantic information in the first corpus is combined with the context information of the first corpus, which can accurately represent the exact meaning of each word in the first corpus.

Generally, the self-attention model includes a multi-layer network, the input of any layer of the multi-layer network is the output of the previous layer network, each layer of the network includes multiple self-attention modules, and each self-attention module is used for Calculate the degree of association representing the degree of association between each word in the first corpus and at least one adjacent word based on the input vector, and fuse the degree of association and the input vector to obtain the first sequence and fuse multiple self-attentions The output of each layer of network can be obtained from the first sequence output by the module and the first sequence output by multiple self-attention modules in the previous layer of network.

More specifically, for the self-attention model mentioned in this step, reference may be made to the relevant descriptions in FIG. 4A to FIG. 11 , which will not be repeated here.

Specifically, if the SA network is included in the self-attention model, but the feature extraction network and fusion network are not included, after the input sequence is obtained, the input sequence can be directly used as the input of the SA network to obtain the output sequence. For example, the self-attention model can be the self-attention model shown in the aforementioned Figures 5-6B. After the input sequence is obtained, the input sequence can be input into the SA network, and then the SA network outputs the corresponding output sequence.

If the self-attention model includes not only the SA network, but also the feature extraction network and the fusion network, as shown in Figure 7, the input sequence can be used as the input of the feature extraction network to extract the local feature sequence of the input sequence, and then the The local feature sequence is used as the input of the SA network, and the output result of the SA network and the local feature sequence are input into the fusion network, and the fusion network fuses the output result of the SA network and the local feature sequence to obtain the final output sequence. In addition, as shown in Figure 8 above, the local feature sequence extracted by the feature extraction network can also not be used as the input of the SA network, but the input sequence can be directly used as the input of the SA network, and the fusion network is used to fuse the output of the SA network and the local feature sequence to obtain the final output sequence.

In a possible scenario, the parameters of each SA module in each layer of the aforementioned SA network are the same, or the parameters of each SA module in the entire SA network are the same. Therefore, in the embodiments of the present application, the self-attention model can be implemented through the same parameters of the SA module, so that the self-attention model occupies a reduced amount of storage, and the efficiency of training and forward inference of the self-attention model can be improved. By introducing ODE into the SA mechanism, the self-attention model is realized by means of ODE, which solves the parameter redundancy problem caused by the multi-layer overlapping in some SA models. The original multi-layer can be achieved by using the parameters of one layer of network. effect of the network.

In addition, before step 1201, the self-attention model can also be trained by using a training set. The training set can include at least one corpus, each corpus includes at least one word and a corresponding label, and each label can include the corresponding corpus. A vector representing the semantics of each word in it. Specifically, for example, the corpus in the training set can be used as the input of the self-attention model, the inference result of the self-attention model can be obtained, and then the gradient value can be calculated using the adjoint ODE algorithm, and the parameters of the self-attention model can be updated based on the gradient, so that the The output of the self-attention model is closer to the label corresponding to the corpus. Therefore, in the embodiment of the present application, the numerical solution method is used to obtain the gradient, which greatly reduces memory consumption and gradient error compared with the backpropagation algorithm.

The self-attention model and the natural language processing method provided by the present application are described in detail above, and the apparatus for carrying the self-attention model or for executing the foregoing natural language processing method is described in detail below.

Referring to FIG. 13, the present application provides a natural language processing apparatus, including:

an acquisition module 1301, configured to acquire an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus;

The processing module 1302 is used to take the input sequence as the input of the self-attention model, and obtain the output sequence, and the output sequence includes the vector representation corresponding to at least one word in the first corpus after the natural language processing NLP by the self-attention model, The output sequence represents the semantic information of each word in the first corpus in the first corpus;

Among them, the self-attention model includes a multi-layer network, the input of any layer of the multi-layer network is the output of the previous layer of network, each layer of the network includes multiple self-attention modules, and each self-attention module is used based on The input vector calculates the correlation degree, and fuses the correlation degree and the input vector to obtain the first sequence. The output of each layer of the network is the first sequence output by the fusion of multiple self-attention modules and multiple self-attentions in the previous layer of network. The first sequence output by the force module is obtained, and the degree of association represents the degree of association between each word in the first corpus and at least one adjacent word.

In a possible implementation, the self-attention model further includes a feature extraction network, which is used to extract features from the input sequence in units of multiple adjacent initial vector representations to obtain a local feature sequence, where the local feature sequence is used as a multi-dimensional feature sequence. input to the layer network.

In a possible implementation, the self-attention model further includes a fusion network, and the fusion network is used to fuse the local feature sequence and the output result of the multi-layer network to obtain the output sequence.

In a possible implementation, the fusion network is specifically used to: calculate the similarity between the local feature sequence and the output result, and fuse the similarity and the local feature sequence to obtain the output sequence.

In a possible implementation, the self-attention model further includes a classification network, the input of the classification network is an output sequence, and the category corresponding to the first corpus is output.

In a possible implementation, the self-attention model further includes a translation network, the language type of the first corpus is the first type, the input of the translation network is the output sequence, and outputs the second corpus, and the language type of the second corpus is the first type The second category, the first category and the second category are different language types.

In a possible implementation, the parameters of multiple self-attention modules in each layer of the multi-layer network are the same.

In addition, for the self-attention model mentioned above, reference may be made to the relevant descriptions in the aforementioned FIG. 4A to FIG. 12 , and details are not repeated here.

Please refer to FIG. 14 , which is a schematic structural diagram of another natural language processing apparatus provided by the present application, as described below.

The natural language processing apparatus may include a processor 1401 and a memory 1402 . The processor 1401 and the memory 1402 are interconnected by wires. Among them, the memory 1402 stores program instructions and data.

The memory 1402 stores program instructions and data corresponding to the steps in the foregoing FIGS. 4A to 12 .

The processor 1401 is configured to execute the method steps executed by the natural language processing apparatus shown in any of the foregoing embodiments in FIG. 4A to FIG. 12 .

Optionally, the natural language processing apparatus may further include a transceiver 1403 for receiving or transmitting data.

Embodiments of the present application also provide a computer-readable storage medium, where a program for generating a vehicle's running speed is stored in the computer-readable storage medium, and when the computer is running on a computer, the computer is made to execute the above-mentioned FIG. 4A-FIG. 12 The illustrated embodiment describes the steps in the method.

Optionally, the aforementioned natural language processing apparatus shown in FIG. 14 is a chip.

The embodiments of the present application also provide a natural language processing device, which may also be referred to as a digital processing chip or a chip. The chip includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface, and the program instructions are processed. The unit is executed, and the processing unit is configured to execute the method steps executed by the natural language processing apparatus shown in any of the foregoing embodiments in FIG. 4A to FIG. 12 .

The embodiments of the present application also provide a digital processing chip. The digital processing chip integrates circuits and one or more interfaces for realizing the above-mentioned processor 1401 or the functions of the processor 1401 . When a memory is integrated in the digital processing chip, the digital processing chip can perform the method steps of any one or more of the foregoing embodiments. When the digital processing chip does not integrate the memory, it can be connected with the external memory through the communication interface. The digital processing chip implements the actions performed by the natural language processing apparatus in the above embodiment according to the program codes stored in the external memory.

Embodiments of the present application also provide a computer program product that, when driving on a computer, causes the computer to execute the steps performed by the natural language processing apparatus in the method described in the embodiments shown in the foregoing FIGS. 4A-12 .

The natural language processing apparatus provided in this embodiment of the present application may be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit, etc. . The processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip in the server executes the natural language processing method described in the embodiments shown in FIG. 4A to FIG. 12 . Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.

Specifically, the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), a digital signal processing digital signal processor (DSP), application specific integrated circuit (ASIC) or field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or it may be any conventional processor or the like.

Exemplarily, please refer to FIG. 15. FIG. 15 is a schematic structural diagram of a chip provided by an embodiment of the application. The chip can be represented as a neural network processor NPU 150, and the NPU 150 is mounted on the main CPU ( Host CPU), the task is allocated by the Host CPU. The core part of the NPU is the arithmetic circuit 1503, which is controlled by the controller 1504 to extract the matrix data in the memory and perform multiplication operations.

In some implementations, the arithmetic circuit 1503 includes multiple processing units (process engines, PEs). In some implementations, the arithmetic circuit 1503 is a two-dimensional systolic array. The arithmetic circuit 1503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1503 is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1502 and buffers it on each PE in the arithmetic circuit. The arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1501 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 1508 .

Unified memory 1506 is used to store input data and output data. The weight data is directly passed through the storage unit access controller (direct memory access controller, DMAC) 1505, and the DMAC is transferred to the weight memory 1502. Input data is also moved into unified memory 1506 via the DMAC.

A bus interface unit (BIU) 1510 is used for the interaction between the AXI bus and the DMAC and an instruction fetch buffer (instruction fetch buffer, IFB) 1509.

The bus interface unit 1510 (bus interface unit, BIU) is used for the instruction fetch memory 1509 to obtain instructions from the external memory, and is also used for the storage unit access controller 1505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1506 or the weight data to the weight memory 1502 or the input data to the input memory 1501 .

The vector calculation unit 1507 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.

In some implementations, the vector computation unit 1507 can store the vector of processed outputs to the unified memory 1506 . For example, the vector calculation unit 1507 may apply a linear function and/or a nonlinear function to the output of the operation circuit 1503, such as linear interpolation of the feature plane extracted by the convolutional layer, such as a vector of accumulated values, to generate activation values. In some implementations, the vector computation unit 1507 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation input to the arithmetic circuit 1503, such as for use in subsequent layers in a neural network.

The instruction fetch buffer (instruction fetch buffer) 1509 connected to the controller 1504 is used to store the instructions used by the controller 1504;

The unified memory 1506, the input memory 1501, the weight memory 1502 and the instruction fetch memory 1509 are all On-Chip memories. External memory is private to the NPU hardware architecture.

The operation of each layer in the recurrent neural network can be performed by the operation circuit 1503 or the vector calculation unit 1507 .

Wherein, the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the programs of the above-mentioned methods of FIGS. 4A-12 .

In addition, it should be noted that the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.

From the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary general-purpose hardware. Special components, etc. to achieve. Under normal circumstances, all functions completed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structures used to implement the same function can also be various, such as analog circuits, digital circuits or special circuit, etc. However, a software program implementation is a better implementation in many cases for this application. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that make contributions to the prior art. The computer software products are stored in a readable storage medium, such as a floppy disk of a computer. , U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), disk or CD, etc., including several instructions to make a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present application.

In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

Finally, it should be noted that: the above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. Any person skilled in the art who is familiar with the technical scope disclosed by the present application can easily think of changes. Or replacement should be covered within the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A method for natural language processing, comprising:

obtaining an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus;

The input sequence is used as the input of the self-attention model, and the output sequence is obtained, and the output sequence includes the vector corresponding to at least one word in the first corpus after the natural language processing NLP is performed by the self-attention model means that the output sequence represents the semantic information of each word in the first corpus in the first corpus;

Wherein, the self-attention model includes a multi-layer network, the input of any layer of the network in the multi-layer network is the output of the previous layer of network, each layer of network includes a plurality of self-attention modules, each self-attention module The module is used to calculate the correlation degree based on the input vector, and fuse the correlation degree and the input vector to obtain the first sequence, and the output of each layer of the network is the first sequence and the output of the fusion of the multiple self-attention modules. The first sequence output by multiple self-attention modules in the upper layer network is obtained, and the degree of association represents the degree of association between each word in the first corpus and at least one adjacent word.
The method according to claim 1, wherein the self-attention model further comprises a feature extraction network, and the method further comprises:

The feature extraction network extracts features according to multiple adjacent initial vector representations in the input sequence, obtains a local feature sequence, and uses the local feature sequence as the input of the multi-layer network.
The method of claim 2, wherein the self-attention model further comprises a fusion network, the method further comprising:

The fusion network fuses the local feature sequence and the output result of the multi-layer network to obtain the output sequence.
The method according to claim 3, wherein the fusion network fuses the local feature sequence and the output result of the multi-layer network, comprising:

The fusion network calculates the similarity between the local feature sequence and the output result, and fuses the similarity and the local feature sequence to obtain the output sequence.
The method according to any one of claims 1-4, wherein the self-attention model further comprises a classification network, and the method further comprises:

The output sequence is used as the input of the classification network, and the classification network outputs the category corresponding to the first corpus.
The method according to any one of claims 1-5, wherein the self-attention model further comprises a translation network, and the method further comprises:

The output sequence is used as the input of the translation network, and a second corpus is output. The language type of the first corpus is the first type, the language type of the second corpus is the second type, and the first type and the The second category is a different language category.
The method according to any one of claims 1-6, wherein the parameters of multiple self-attention modules in each layer of the multi-layer network are the same.
A natural language processing device, comprising:

an acquisition module, configured to acquire an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus;

A processing module, configured to use the input sequence as the input of the self-attention model to obtain an output sequence, the output sequence includes at least one of the first corpus after the natural language processing NLP is performed by the self-attention model. A vector representation corresponding to a word, and the output sequence represents the semantic information of each word in the first corpus in the first corpus;

Wherein, the self-attention model includes a multi-layer network, the input of any layer of the network in the multi-layer network is the output of the previous layer of network, each layer of network includes a plurality of self-attention modules, each self-attention module The module is used to calculate the correlation degree based on the input vector, and fuse the correlation degree and the input vector to obtain the first sequence, and the output of each layer of the network is the first sequence and the output of the fusion of the multiple self-attention modules. The first sequence output by multiple self-attention modules in the upper layer network is obtained, and the degree of association represents the degree of association between each word in the first corpus and at least one adjacent word.
The device according to claim 8, wherein the self-attention model further comprises a feature extraction network, which is used for extracting features according to multiple adjacent initial vector representations in the input sequence to obtain a local feature sequence, and use the local feature sequence as the input of the multi-layer network.
The apparatus according to claim 9, wherein the self-attention model further comprises a fusion network, and the fusion network is used to fuse the local feature sequence and the output result of the multi-layer network to obtain the output sequence.
The device according to claim 10, wherein the fusion network is specifically configured to: calculate the similarity between the local feature sequence and the output result, and fuse the similarity and the local feature sequence, to obtain the output sequence.
The device according to any one of claims 8-11, wherein the self-attention model further comprises a classification network, the input of the classification network is the output sequence, and the output sequence corresponding to the first corpus is output. category.
The device according to any one of claims 8-12, characterized in that,

The self-attention model further includes a translation network, the language type of the first corpus is the first type, the input of the translation network is the output sequence, and the output of the second corpus, the language type of the second corpus is The second category, the first category and the second category are different language types.
The apparatus according to any one of claims 8-13, wherein the parameters of the multiple self-attention modules in each layer of the multi-layer network are the same.
A natural language processing device, characterized in that it comprises a processor, the processor is coupled with a memory, the memory stores a program, and when the program instructions stored in the memory are executed by the processor, claims 1 to 1 are implemented. The method of any one of 7.
A computer-readable storage medium comprising a program which, when executed by a processing unit, performs the method of any one of claims 1 to 7.
A natural language processing device, characterized in that it includes a processing unit and a communication interface, the processing unit obtains a program instruction through the communication interface, and when the program instruction is executed by the processing unit, the program instructions in claims 1 to 7 are implemented The method of any one.