WO2022156561A1 - Method and device for natural language processing - Google Patents

Method and device for natural language processing Download PDF

Info

Publication number
WO2022156561A1
WO2022156561A1 PCT/CN2022/071285 CN2022071285W WO2022156561A1 WO 2022156561 A1 WO2022156561 A1 WO 2022156561A1 CN 2022071285 W CN2022071285 W CN 2022071285W WO 2022156561 A1 WO2022156561 A1 WO 2022156561A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
sequence
self
output
input
Prior art date
Application number
PCT/CN2022/071285
Other languages
French (fr)
Chinese (zh)
Inventor
张鹏
张静
魏俊秋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022156561A1 publication Critical patent/WO2022156561A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a natural language processing method and device.
  • SA Self-attention
  • the self-attention model effectively encodes a sequence of words into several vector representations by calculating the dependencies between words, so that the output word vector representation contains its contextual semantic information. Therefore, how to interpret the better semantic information of each word through the self-attention model and convert the word into a better vector to represent it has become an urgent problem to be solved.
  • the present application provides a natural language processing method and device, which are used to more efficiently interpret each network layer by adding the hidden state reuse of the upper layer of the network in each network layer of the self-attention model in the process of natural language processing.
  • the better semantic information of a word in its corpus is used to more efficiently interpret each network layer by adding the hidden state reuse of the upper layer of the network in each network layer of the self-attention model in the process of natural language processing.
  • the present application provides a method for natural language processing, including: first acquiring an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus, usually one word corresponds to an initial vector representation Vector representation; the input sequence is used as the input of the self-attention model, and the output sequence is obtained, and the output sequence includes at least one word in the first corpus after natural language processing (NLP) by the self-attention model.
  • NLP natural language processing
  • the output sequence represents the semantic information of each word in the first corpus in the first corpus, and the semantic information combines the context information of the first corpus, which can accurately represent the exact meaning of each word in the first corpus;
  • the self-attention model includes a multi-layer network, the input of any layer of the network except the first layer in the multi-layer network is the output of the previous layer of network, each layer of network includes multiple self-attention modules, each self-attention module The attention module is used to calculate the degree of association based on the input vector, that is, the degree of association between each word in the first corpus and at least one adjacent word, and fuse the degree of association and the input vector to obtain the first sequence, each layer of the network.
  • the output of is obtained by fusing the first sequence output by multiple self-attention modules and the first sequence output by multiple self-attention modules in the previous layer of network.
  • the self-attention model is realized through the mechanism of the neural network of frequent differential equations, and the neural differentiation through a layer of self-attention modules (that is, multiple calls and summations in the self-attention model) ), to achieve multi-layer self-attention fitting, and obtain an output sequence that can more accurately express semantics.
  • the state reuse mechanism is introduced into the normal differential equation, and the hidden state information inside the previous layer is reused during the fitting process of the normal differential equation to each layer (except the first layer), thereby improving the calculation speed. That is, the current network layer can reuse the output of the SA module of the previous layer of the network, so that the self-attention model can quickly obtain more accurate output results, reduce the computational complexity of the model, and improve the training and inference of the self-attention model. efficiency.
  • the self-attention model further includes a feature extraction network
  • the above method may further include: the feature extraction network extracts features according to multiple adjacent initial vector representations in the input sequence to obtain a local feature sequence, and The local feature sequence is used as the input of the multi-layer network.
  • local features can be extracted from the input sequence, so that the local features can be used as the input of the multi-layer network, so that the multi-layer network can refer to the local information to calculate the correlation degree, and improve the attention to the local information. Make the interpretation of the output sequence better for local information.
  • the self-attention model further includes a fusion network
  • the above method may further include: the fusion network fuses the local feature sequence and the output result of the multi-layer network to obtain an output sequence.
  • local features can be extracted from the input sequence, and the local features and the output result of the SA network can be fused to obtain the output sequence, so that the local features are fused in the output sequence.
  • the final self-attention result focuses on local features, and a sequence that can more accurately represent the semantics of each word is obtained.
  • the above-mentioned fusion network fuses the local feature sequence and the output result of the multi-layer network, which may specifically include: the fusion network calculates the similarity between the local feature sequence and the output result, and fuses the similarity and the local feature. sequence to get the output sequence.
  • the similarity and local features can be fused by calculating the similarity, so as to obtain an output sequence, so that the output sequence contains local information, and the represented semantics is more accurate.
  • the self-attention model further includes a classification network
  • the above method may further include: using the input of the classification network as an output sequence, and outputting the category corresponding to the first corpus. Therefore, the method provided by the embodiment of the present application can be applied to the classification scene, and the classification of the corpus can be realized by adding a classification network to the self-attention model.
  • the self-attention model further includes a translation network
  • the above method may further include: using the output sequence as the input of the translation network, and outputting a second corpus, the language type of the first corpus is the first category, and the first The language category of the second corpus is the second category, and the first category and the second category are different language categories. Therefore, the method provided by the embodiment of the present application can also be applied to translation scenarios, that is, adding a translation network to the self-attention model can realize the translation of corpus, and translate the corpus corresponding to the input sequence into corpus of different languages.
  • the parameters of multiple self-attention modules in each layer of the multi-layer network are the same. Therefore, in the embodiments of the present application, the self-attention model can be implemented through the same parameters of the SA module, so that the self-attention model occupies a reduced amount of storage, and the efficiency of training and forward inference of the self-attention model can be improved.
  • a training set may also be used to train the self-attention model
  • the training set may include at least one corpus, each corpus includes at least one word, and each corpus is converted into To include a sequence of initial vector representations for each word, this sequence is then fed into the self-attention model, resulting in an output sequence.
  • the gradient value can be calculated by the adjoint ODE algorithm, and the parameters of the self-attention model can be updated based on the gradient, so that the output of the self-attention model is closer to the label corresponding to the corpus. Therefore, in the embodiment of the present application, the numerical solution method is used to obtain the gradient, which greatly reduces memory consumption and gradient error compared with the backpropagation algorithm.
  • the present application provides a self-attention model
  • the self-attention model includes a multi-layer network
  • the input of any layer of the multi-layer network is the output of the previous layer of network
  • each layer of the network includes multiple self-attention models.
  • Attention module each self-attention module is used to calculate the correlation degree based on the input vector, and fuse the correlation degree and the input vector to obtain the first sequence
  • the output of each layer of the network is the output of the fusion of multiple self-attention modules.
  • a sequence is obtained from the first sequence output by multiple self-attention modules in the previous layer of network, and the degree of association represents the degree of association between each word in the first corpus and at least one adjacent word; the input sequence is used as the self-attention module.
  • the input of the attention model the output sequence is obtained.
  • the output sequence includes the vector representation corresponding to at least one word in the first corpus after the natural language processing NLP by the self-attention model, and the output sequence indicates that each word in the first corpus is in Semantic information in the first corpus.
  • the self-attention model is realized through the mechanism of the neural network of frequent differential equations, and the neural differentiation through a layer of self-attention modules (that is, multiple calls and summations in the self-attention model) ), to achieve multi-layer self-attention fitting, and obtain an output sequence that can more accurately express semantics.
  • the state reuse mechanism is introduced into the normal differential equation, and the hidden state information inside the previous layer is reused during the fitting process of the normal differential equation to each layer (except the first layer), thereby improving the calculation speed. That is, the current network layer can reuse the output of the SA module of the previous layer of the network, so that the self-attention model can quickly obtain more accurate output results, reduce the computational complexity of the model, and improve the training and inference of the self-attention model. efficiency.
  • the self-attention model further includes a feature extraction network, which is used to extract features from the input sequence in units of multiple adjacent initial vector representations to obtain a local feature sequence, and combine the local feature sequence as the input to the multi-layer network.
  • a feature extraction network which is used to extract features from the input sequence in units of multiple adjacent initial vector representations to obtain a local feature sequence, and combine the local feature sequence as the input to the multi-layer network.
  • local features can be extracted from the input sequence, so that the local features can be used as the input of the multi-layer network, so that the multi-layer network can refer to the local information to calculate the correlation degree, and improve the attention to the local information. Make the interpretation of the output sequence better for local information.
  • the self-attention model further includes a fusion network for fusing the local feature sequence and the output result of the multi-layer network to obtain an output sequence.
  • local features can be extracted from the input sequence, and the local features and the output result of the SA network can be fused to obtain the output sequence, so that the local features are fused in the output sequence.
  • the final self-attention result focuses on local features, and a sequence that can more accurately represent the semantics of each word is obtained.
  • the fusion network is specifically used to calculate the similarity between the local feature sequence and the output result, and fuse the similarity and the local feature sequence to obtain the output sequence.
  • the similarity and local features can be fused by calculating the similarity, so as to obtain an output sequence, so that the output sequence contains local information, and the represented semantics is more accurate.
  • the self-attention model further includes a classification network, the input of the classification network is an output sequence, and the category corresponding to the first corpus is output. Therefore, the method provided by the embodiment of the present application can be applied to the classification scene, and the classification of the corpus can be realized by adding a classification network to the self-attention model.
  • the self-attention model further includes a translation network
  • the language type of the first corpus is the first type
  • the input of the translation network is the output sequence
  • the language type of the second corpus is the first type
  • the second category, the first category and the second category are different language types. Therefore, the method provided by the embodiment of the present application can also be applied to translation scenarios, that is, adding a translation network to the self-attention model can realize the translation of corpus, and translate the corpus corresponding to the input sequence into corpus of different languages.
  • the parameters of multiple self-attention modules in each layer of the multi-layer network are the same. Therefore, in the embodiments of the present application, the self-attention model can be implemented through the same parameters of the SA module, so that the self-attention model occupies a reduced amount of storage, and the efficiency of training and forward inference of the self-attention model can be improved.
  • an embodiment of the present application provides a natural language processing apparatus, and the natural language processing apparatus has a function of implementing the natural language processing method of the first aspect.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • an embodiment of the present application provides a natural language processing apparatus, including: a processor and a memory, wherein the processor and the memory are interconnected through a line, and the processor invokes program codes in the memory to execute any one of the above-mentioned first aspects
  • the processing-related functions in the natural language processing method shown in item may be a chip.
  • an embodiment of the present application provides a natural language processing device.
  • the natural language processing device may also be referred to as a digital processing chip or a chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface.
  • the instructions are executed by a processing unit, and the processing unit is configured to perform processing-related functions as in the first aspect or any of the optional embodiments of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, including instructions, which, when executed on a computer, cause the computer to execute the method in the first aspect or any optional implementation manner of the first aspect.
  • an embodiment of the present application provides a computer program product including instructions, which, when run on a computer, enables the computer to execute the method in the first aspect or any optional implementation manner of the first aspect.
  • Fig. 1 is a schematic diagram of a main frame of artificial intelligence applied by the application
  • FIG. 2 is a schematic diagram of a system architecture provided by the application.
  • FIG. 3 is a schematic diagram of another system architecture provided by the present application.
  • 4A is a schematic structural diagram of a self-attention model provided by an embodiment of the present application.
  • 4B is a schematic structural diagram of a sequence conversion provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application.
  • 6A is a schematic structural diagram of another self-attention model provided by an embodiment of the present application.
  • 6B is a schematic structural diagram of another self-attention model provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a converged network provided by an embodiment of the application.
  • FIG. 12 is a schematic flowchart of a natural language processing method provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a natural language processing apparatus according to an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of another natural language processing apparatus provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a chip according to an embodiment of the present application.
  • Figure 1 shows a schematic structural diagram of the main frame of artificial intelligence.
  • the above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communicate with the outside through sensors; computing power is provided by intelligent chips, such as central processing unit (CPU), network processor (neural-network processing unit, NPU), graphics processor (English: graphics processing unit, GPU), Application specific integrated circuit (ASIC) or field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips) are provided; the basic platform includes distributed computing framework and network related platform guarantee and support, It can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • intelligent chips such as central processing unit (CPU), network processor (neural-network processing unit, NPU), graphics processor (English: graphics processing unit, GPU), Application specific integrated circuit (ASIC) or field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips
  • CPU central processing unit
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall solution of artificial intelligence, and the productization of intelligent information decision-making to achieve landing applications. Its application areas mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, safe city, etc.
  • the embodiments of this application involve related applications of neural networks and natural language processing (NLP). concept is introduced.
  • NLP natural language processing
  • Corpus Also known as free text, it can be words, words, sentences, fragments, articles, and any combination thereof. For example, "The weather is really nice today" is a corpus.
  • the self-attention model refers to effectively encoding a sequence of data (such as natural corpus "Your mobile phone is very good.") into several multi-dimensional vectors, which are convenient for numerical operations.
  • the similarity information between each element is called self-attention.
  • ODENet neural ordinary differential equations networks, ODENet
  • ODENets can fit the given neural network in the For each different continuous time point or the output under each asynchronous/layer, a set of parameters is used to fit the output of the original neural network under multiple continuous time points or multiple steps/layers, which has high parameter efficiency.
  • Loss function also known as cost function (cost function), a measure that compares the difference between the prediction output of the machine learning model for the sample and the real value (also known as the supervision value) of the sample, that is, used to measure The difference between the predicted output of a machine learning model for a sample and the true value of the sample.
  • the loss function may generally include loss functions such as mean square error, cross entropy, logarithm, and exponential.
  • the mean squared error can be used as a loss function, defined as Specifically, a specific loss function can be selected according to the actual application scenario.
  • Stochastic gradient The number of samples in machine learning is large, so the loss function of each calculation is calculated from the data obtained by random sampling, and the corresponding gradient is called stochastic gradient.
  • BP Back propagation
  • Ordinary differential equation adjoint solution is a reverse update algorithm for training ordinary differential equations (ordinary differential equation, ODE). algorithm, which greatly reduces memory consumption and gradient error.
  • Neural machine translation is a typical task of natural language processing. The task is the technique of outputting a sentence in the target language given a sentence in the source language.
  • the words in the sentences of the source language and the target language are encoded into vector representations, and the association between words and words and sentences and sentences is calculated in the vector space to perform translation tasks.
  • Pre-trained language model It is a natural language sequence encoder that encodes each word in the natural language sequence into a vector representation for prediction tasks.
  • the training of PLM consists of two stages, namely the pre-training stage and the fine-tuning stage.
  • the pre-training stage the model is trained on large-scale unsupervised text for language model tasks, thereby learning word representations.
  • the fine-tuning stage the model is initialized with the parameters learned in the pre-training stage, and can be successfully trained on downstream tasks such as text classification or sequence labeling with fewer steps.
  • the semantic information obtained by pre-training is successfully transferred to downstream tasks.
  • Embedding refers to the feature representation of the sample.
  • the natural language processing method provided by the embodiments of the present application may be executed on a server, and may also be executed on a terminal device.
  • the terminal device can be a mobile phone with image processing function, tablet personal computer (TPC), media player, smart TV, laptop computer (LC), personal digital assistant (PDA) ), a personal computer (PC), a camera, a video camera, a smart watch, a wearable device (WD), or an autonomous vehicle, etc., which are not limited in this embodiment of the present application.
  • an embodiment of the present application provides a system architecture 200 .
  • the system architecture includes a database 230 and a client device 240 .
  • the data collection device 260 is used to collect data and store it in the database 230 , and the training module 202 generates the target model/rule 201 based on the data maintained in the database 230 .
  • the following will describe in more detail how the training module 202 obtains the target model/rule 201 based on the data.
  • the target model/rule 201 is the neural network mentioned in the following embodiments of the present application. For details, please refer to the relevant descriptions in FIGS. 4A-12 below. .
  • the computing module may include a training module 202, and the target model/rule obtained by the training module 202 may be applied to different systems or devices.
  • the execution device 210 configures a transceiver 212, which can be a wireless transceiver, an optical transceiver, or a wired interface (such as an I/O interface), etc., to perform data interaction with external devices, and a "user" can Data is input to transceiver 212 through client device 240.
  • client device 240 may send target tasks to execution device 210, request the execution device to train a neural network, and send execution device 210 a database for training.
  • the execution device 210 can call data, codes, etc. in the data storage system 250 , and can also store data, instructions, etc. in the data storage system 250 .
  • the calculation module 211 uses the target model/rule 201 to process the input data. Specifically, the calculation module 211 is used to: obtain an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus; use the input sequence as the input of the self-attention model to obtain an output sequence, where the output sequence includes The vector representation corresponding to at least one word in the first corpus after the attention model performs natural language processing (NLP), the output sequence represents the semantic information of each word in the first corpus in the first corpus, and the self-attention model is obtained through the training set.
  • NLP natural language processing
  • the training set includes at least one corpus, and each corpus includes at least one word;
  • the self-attention model includes a multi-layer network, and the input of any layer of the multi-layer network is the output of the previous layer of network, each The layer network includes multiple self-attention modules, each self-attention module is used to calculate the degree of association representing the degree of association between each word in the first corpus and at least one adjacent word based on the input vector, and fuse the association degree and the input vector to get the first sequence, each layer in the multi-layer network is connected, and the output of each layer of the network is the first sequence that fuses the output of multiple self-attention modules and multiple layers in the previous layer of network.
  • the first sequence output from the attention module is obtained.
  • transceiver 212 returns the constructed neural network to client device 240 to deploy the neural network in client device 240 or other devices.
  • the training module 202 can obtain corresponding target models/rules 201 based on different data for different tasks, so as to provide users with better results.
  • the data input into the execution device 210 can be determined according to the input data of the user, for example, the user can operate in the interface provided by the transceiver 212 .
  • the client device 240 can automatically input data to the transceiver 212 and obtain the result. If the client device 240 automatically inputs data and needs to obtain the authorization of the user, the user can set the corresponding permission in the client device 240 .
  • the user can view the result output by the execution device 210 on the client device 240, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 240 can also act as a data collection end to store the collected data associated with the target task into the database 230 .
  • the training or update process mentioned in this application may be performed by the training module 202 .
  • the training process of the neural network is to learn the way to control the spatial transformation, and more specifically, to learn the weight matrix.
  • the purpose of training a neural network is to make the output of the neural network as close to the expected value as possible, so you can compare the predicted value and expected value of the current network, and then update the weight of each layer of the neural network in the neural network according to the difference between the two. vector (of course, the weight vector can usually be initialized before the first update, that is, the parameters are pre-configured for each layer in the deep neural network).
  • the value of the weight in the weight matrix is adjusted to reduce the predicted value, and after continuous adjustment, the value output by the neural network is close to or equal to the expected value.
  • the difference between the predicted value and the expected value of the neural network can be measured by a loss function or an objective function. Taking the loss function as an example, the higher the output value of the loss function (loss), the greater the difference.
  • the training of the neural network can be understood as the process of reducing the loss as much as possible. For the process of updating the weight of the starting point network and training the serial network in the following embodiments of the present application, reference may be made to this process, which will not be repeated below.
  • a target model/rule 201 is obtained by training according to the training module 202.
  • the target model/rule 201 may be a self-attention model in the present application, and the self-attention model may include a depth volume Deep convolutional neural networks (DCNN), recurrent neural networks (RNNS) and other networks.
  • the neural network mentioned in this application may include various types, such as deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN) or residual network other neural networks, etc.
  • the database 230 may be used to store a sample set for training.
  • the execution device 210 generates a target model/rule 201 for processing samples, and uses the sample set in the database to iteratively train the target model/rule 201 to obtain a mature target model/rule 201.
  • the target model/rule 201 is embodied as Neural Networks.
  • the neural network obtained by executing the device 210 can be applied in different systems or devices.
  • the execution device 210 may call data, codes, etc. in the data storage system 250 , and may also store data, instructions, etc. in the data storage system 250 .
  • the data storage system 250 may be placed in the execution device 210 , or the data storage system 250 may be an external memory relative to the execution device 210 .
  • the calculation module 211 can process the samples obtained by the execution device 210 through a neural network to obtain a prediction result, and the specific manifestation of the prediction result is related to the function of the neural network.
  • FIG. 2 is only an exemplary schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 250 is an external memory relative to the execution device 210 . In other scenarios, the data storage system 250 may also be placed in the execution device 210 .
  • the target model/rule 201 trained according to the training module 202 can be applied to different systems or devices, such as mobile phones, tablet computers, laptop computers, augmented reality (AR)/virtual reality (VR) , a vehicle terminal, etc., or a server or a cloud device.
  • systems or devices such as mobile phones, tablet computers, laptop computers, augmented reality (AR)/virtual reality (VR) , a vehicle terminal, etc., or a server or a cloud device.
  • the target model/rule 201 may be the self-attention model in the present application in the embodiment of the present application.
  • the self-attention model provided in the embodiment of the present application may include CNN, deep convolutional neural networks (deep convolutional neural networks) , DCNN), recurrent neural network (RNN) and other networks.
  • an embodiment of the present application further provides a system architecture 300 .
  • the execution device 210 is implemented by one or more servers, and optionally, cooperates with other computing devices, such as: data storage, routers, load balancers and other devices; the execution device 210 may be arranged on a physical site, or distributed in multiple on the physical site.
  • the execution device 210 can use the data in the data storage system 250, or call the program code in the data storage system 250 to implement the steps of the training method for a computing device corresponding to FIG. 12 below in this application.
  • a user may operate respective user devices (eg, local device 301 and local device 302 ) to interact with execution device 210 .
  • Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, etc.
  • Each user's local device can interact with the execution device 210 through any communication mechanism/standard communication network, which can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, and the like.
  • the wireless network includes but is not limited to: the fifth generation mobile communication technology (5th-Generation, 5G) system, the long term evolution (long term evolution, LTE) system, the global system for mobile communication (global system for mobile communication, GSM) or code division Multiple access (code division multiple access, CDMA) network, wideband code division multiple access (wideband code division multiple access, WCDMA) network, wireless fidelity (wireless fidelity, WiFi), Bluetooth (bluetooth), Zigbee protocol (Zigbee), Any one or a combination of radio frequency identification technology (radio frequency identification, RFID), long range (Long Range, Lora) wireless communication, and near field communication (near field communication, NFC).
  • the wired network may include an optical fiber communication network or a network composed of coaxial cables, and the like.
  • one or more aspects of the execution device 210 may be implemented by each local device, for example, the local device 301 may provide the execution device 210 with local data or feedback calculation results.
  • the local device may also be referred to as a computing device.
  • the local device 301 implements the functions of the execution device 210 and provides services for its own users, or provides services for the users of the local device 302 .
  • SA models are usually obtained by stacking multiple SA modules, and each SA module has corresponding parameters.
  • SA models learn more abstract semantic information by stacking multiple layers, which often results in parameter Redundancy can easily lead to a decrease in the efficiency of training and inference.
  • the SA model can capture the global information, but the local information fusion is ignored and cannot learn the actual meaning represented by the text.
  • this application proposes a self-attention model and a natural language processing method based on the self-attention model, which are used in the process of natural language processing by adding a pair of layers to each network layer of the self-attention model.
  • the hidden state reuse of the network can more efficiently interpret the better semantic information of each word in its corpus.
  • the semantic information of each word in the text can be interpreted, that is, the context semantics represented by each word in the text in the corpus in which it is located can be more accurately interpreted.
  • the self-attention model provided by the present application and the natural language processing method based on the self-attention model will be introduced in detail below.
  • the input of the self-attention model provided by this application is the input sequence, and the output is the output sequence obtained by performing self-attention processing on the input sequence.
  • the input sequence may be a sequence composed of vectors corresponding to a corpus, each word in the corpus corresponds to a vector, and one or more vectors form a sequence.
  • the input sequence is input into the self-attention model 401, and then the self-attention model 401 interprets the semantics of each word in the corresponding corpus according to the dependencies between the sequences in the input sequence, thereby obtaining the output sequence.
  • the self-attention model can effectively encode words into several vector representations by calculating the dependencies between words, so that the output word vector represents the semantic information contained in the corpus, and combines the context information of the corpus, Make the semantics represented by the word vector representation, which is also called the hidden state in deep learning (Deep Learning), more accurate.
  • each word in the corpus "Your mobile phone is very good.” corresponds to an initial vector, thus forming a sequence [x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 ], such as ""you” corresponds to the initial vector x 1 , "the” corresponds to the initial vector x 2 and so on.
  • the output vector corresponding to each input vector is output, and the output sequence [h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 ] is obtained.
  • the length of the output sequence is the same as that of the input sequence. have the same length or have a corresponding mapping relationship.
  • the self-attention model can interpret the semantics of each word in the corpus according to the similarity or dependency between each vector and adjacent vectors, so as to obtain an output sequence representing the semantics of each word in the corpus. , the initial vector of the word is further optimized into a better vector representation.
  • an ODE-based SA module is introduced into the self-attention model, so that self-attention is calculated by means of ODE.
  • the ODE-based SA module is introduced into the self-attention model used in this application, which has higher parameter efficiency and improves the training and performance of the self-attention model. Efficiency of reasoning.
  • FIG. 5 a schematic structural diagram of a self-attention model provided by the present application is described as follows.
  • the self-attention model may include multiple network layers, each of which includes one or more self-attention modules.
  • the multi-layer network layer is referred to as the SA network.
  • the input of the first layer of the network layer is an input sequence, and the input sequence includes an initial vector representation corresponding to each word in the corpus to be processed (or referred to as the first corpus).
  • the input of each network layer after the first layer is the output of the previous network layer.
  • the last layer of the network layer outputs the output result of the SA network.
  • each layer of the network includes one or more SA modules.
  • the input end of each SA module is connected to one or more SA modules, and the input of each SA module includes the output of one or more SA modules connected to its input end.
  • each SA module in addition to being connected to one or more SA modules, the input end of each SA module may also be connected to the output end of the upper-layer network or to the input end of the SA network, that is, each SA module
  • the input can also include the input of the current network layer.
  • the current network layer is the first layer network
  • the input of the current network layer is the input sequence
  • the current network layer is a non-first layer network
  • the input of the current network layer is the output of the previous layer network.
  • the input of a certain SA module includes multiple sequences (each sequence includes one or more vectors)
  • the multiple sequences can be fused to obtain the input of the current SA module.
  • the outputs of the multiple SA modules can be fused to obtain the input of the current SA module; or, when the input of a certain SA module includes one or more SA modules
  • the output of each SA module and the output of the upper-layer network are merged with the output of the one or more SA modules and the output of the upper-layer network to obtain the input of the current SA module.
  • the structure of the SA network can be exemplarily shown in FIG. 6A , wherein, the SA network includes i+1 layers of network layers. Except for the first layer, each network layer can reuse the network of the previous layer. The hidden state inside the layer is combined with the hidden state inside the previous network layer and the calculation result of the SA module in the current layer to obtain the output result of the current layer.
  • the right side of FIG. 6A is a schematic structural diagram of one of the network layers, wherein the ODE layer i contains several SA modules and connections between them. Each SA module usually has the same structure and parameters, and each SA module has the same structure and parameters. A connection contains a weight (a 1 , a 2 , a 3 . get.
  • the specific structure of the SA network is described by taking the first-layer network layer and the second-layer network layer as examples.
  • the input sequence can be used as the input of each SA module in the first-layer network (ie, the network layer 1 shown in FIG. 6B ).
  • the input of the first SA module in the first network layer may be an input sequence
  • the inputs of other SA modules except the first SA module may include, in addition to the input sequence, the input sequence of the previous SA module connected to it. output.
  • the input of the SA module includes the input sequence and the output of the previous SA module connected to it
  • the input sequence and the output of the previous SA module connected to it can be fused to obtain a new vector and used as the input of the current SA module.
  • the output of the first network layer can be obtained by fusing the outputs of all SA modules in the first layer network side layer.
  • the output of the first-layer network is used as the input of the second-layer network (ie, network layer 2 shown in FIG. 6B ), and the calculation process of each SA module in the second-layer network is the same as that of each SA module in the first-layer network.
  • the calculation method is similar.
  • the output of each SA module in the first-layer network is used as the input of the second-layer network, that is, the hidden state obtained in the first-layer network is used as the input of the second-layer network.
  • the second-layer network fuses the output of each SA module in the first-layer network to obtain the first sequence, and then fuses the first sequence with the output of each SA module in the second-layer network to obtain the second-layer network Output.
  • the self-attention model provided in this application, not only the ODE mechanism is introduced into the self-attention model, but each layer of network can also reuse the hidden state of the previous layer of network, so that the self-attention model can be calculated more accurately. Attention can improve the training efficiency of the self-attention model, so that the self-attention model can quickly converge, and the output accuracy of the self-attention model can also be improved.
  • each SA module is exemplified below.
  • the SA module can be used to calculate the degree of association based on the input vector, that is, the degree of association between each word in the corpus and one or more adjacent words, and then fuse the input vector and the degree of association to obtain the SA module's degree of association. Output the result.
  • Algorithms for calculating the degree of association may include various algorithms, such as multiplication, transposition and multiplication, etc. Specifically, an algorithm suitable for practical application scenarios can be selected.
  • N ⁇ N attention matrix (attention matrix)
  • V Multiplied and processed by softmax converted to a
  • the sequence representation of contains N d-dimensional vectors, that is, the output of the SA module.
  • the self-attention model incorporates the similarity information of the word vector x i in each input sequence and all other word vectors in the input sequence into h i , that is, h i is dependent on the information of each input word vector in the sentence, that is It is said that the output sequence contains global information, that is, the vector representation of the sequence learned by the output sequence can capture long-distance dependencies, and the training and inference process of the SA model has better parallelism, so as to achieve more efficient training of the SA model. and inference effects.
  • the parameters of the SA modules in each network layer are the same, or the parameters of each SA module in the SA model are the same, and so on. Therefore, the SA model provided in the present application has higher parameter efficiency, improves the training speed of the model, occupies less memory, and improves the generalization ability of the model.
  • the self-attention model can also include a feature extraction network for extracting local features of the input sequence, that is, extracting features from the input sequence in units of multiple adjacent initial vector representations to obtain a local feature sequence, and can The local feature sequence is used as the input of the aforementioned multilayer network. It can be understood that, in the embodiments of the present application, the feature extraction network is used to perform feature extraction in units of multiple adjacent vectors, so as to extract a vector including local information.
  • local feature extraction can be used to focus on the local information of the input sequence, and to extract detailed information in the input sequence, thereby obtaining an output sequence that can more accurately represent semantics.
  • the feature extraction network can be composed of multiple convolution kernels, and usually the width of the convolution kernel is the length of the unit for extracting features. For example, if the width of the convolution kernel is 3, it can indicate that when extracting features, the feature extraction can be performed in units of three adjacent initial vectors, that is, the three adjacent initial vectors are used as the input of the convolution kernel, so as to output The extracted local feature sequence.
  • the self-attention model can also include a fusion network for fusing the local feature sequence and the output results of the multi-layer network to obtain the output sequence.
  • a fusion network for fusing the local feature sequence and the output results of the multi-layer network to obtain the output sequence. It can be understood that the sequence output by the SA network carries the global information, while the local feature sequence carries the local information. After the global information and the local information are fused, the output sequence that can represent the global and local information can be obtained.
  • the self-attention model may include a feature extraction network 701 , an SA network 702 and a fusion network 703 , and the SA network is the network shown in the foregoing FIGS. 5 to 6B .
  • the feature extraction network 701 can extract features from the input sequence, and then use the obtained local feature sequence as the input of the SA network 702 .
  • the SA network 702 can be used to obtain an output result by calculating the degree of association between each word in the corpus and at least one adjacent word.
  • the input of the SA network 702 can be the local feature sequence output by the feature extraction network 701, and the SA network 702 can specifically calculate the degree of association between each word in the corpus and at least one adjacent word based on the local feature sequence, so as to obtain the output result , please refer to the related descriptions in the aforementioned FIG. 5 to FIG. 6B , which will not be repeated here.
  • the fusion network 703 can fuse the output result of the SA network 702 and the local feature sequence output by the feature extraction network 701 to obtain the output sequence.
  • the fusion network 703 can calculate the similarity between the local feature sequence and the output result of the fusion SA network 702, and then fuse the similarity and the local feature sequence to obtain the output sequence.
  • global and local semantic information is fused to obtain more accurate contextual semantic information, so that the final output sequence can better represent the semantics of each word in the corpus, thereby improving downstream tasks. accuracy.
  • the local feature sequence extracted by the feature extraction network 701 can also not be used as the input of the SA network 702, that is, the input sequence is used as the input of the SA network 702.
  • the local feature sequence output by the extraction network 701 and the output result of the SA network 702 are used as the input of the fusion network 703 to obtain the output sequence. That is, the fusion network fuses the local feature sequence and the output result of the SA network to obtain the output sequence, which is equivalent to the output sequence containing both global information and local information, thus paying attention to the global and local semantics in the corpus, so that the output sequence contains both global and local information.
  • the interpreted semantics are more accurate.
  • the fusion mentioned in the above or the following embodiments of this application may refer to operations such as multiplying two values, weighted fusion, or direct splicing, and the specific fusion method can be selected according to the actual application scenario. , which is not limited in this application. For example, after obtaining two sequences, multiply the two sequences to obtain the fused sequence. For another example, two sequences with dimensions of 5 and 8 are obtained, and the two sequences can be directly spliced to obtain a sequence with a dimension of 13. For another example, after two sequences are obtained, a corresponding weight value is assigned to each sequence, and then a new sequence is obtained by means of weighted fusion.
  • the self-attention model provided in this application may also include networks related to downstream tasks, such as neural machine translation, text classification, or pre-trained language models.
  • networks related to downstream tasks such as neural machine translation, text classification, or pre-trained language models.
  • the self-attention model may further include a classification network, or the output of the self-attention model may be used for input to the classification network to identify the category of the corpus corresponding to the input sequence.
  • the structure of a self-attention model including a classification network can be shown in Figure 9, where an N-Gram convolutional layer is used to learn the local information of the input text representation matrix (ie, the input sequence).
  • the N-Gram convolutional layer performs convolution calculation on the input sequence representation.
  • each hidden state can only be obtained by convolution calculation on the representation of several adjacent elements.
  • each hidden state consists of two or three adjacent elements. It is represented by a vector, so each hidden state can only contain the dependencies between the element and adjacent finite elements, that is, local information.
  • the ODE-based SA module in this application is shown.
  • the figure shows the ODE-based SA calculation method of the first layer and the second layer.
  • the first layer contains 4 SA modules
  • the second layer contains 3 SA modules.
  • it also includes the multiplexing of the state of the previous layer.
  • the architecture and parameters of each layer above the second layer are usually the same as those of the second layer, and each SA module is usually the same, as shown in the figure 10 is shown in the lower half of the figure.
  • G() represents the hidden state of the output of the corresponding layer.
  • a i and c i are the model parameters of the ODE SA module
  • r is the hyperparameter of the SA model, which is usually a fixed value.
  • h L represents the word vector sequence containing local information output from the feature extraction network
  • h G represents the word vector sequence containing global information output from the ODE SA network.
  • the fusion network uses the attention mechanism (attention) and gate.
  • the (gate) mechanism fuses the two to obtain a word vector sequence that combines local information and global information, namely h O .
  • Layernorm Layer_norm(W L *h L +W E *h E ) .
  • the classification network may be used to identify the type of sentence of the input vector, or the classification of the nouns included in the sentence of the input vector. For example, if the text corresponding to the input sequence is "Your mobile phone is very good", the category corresponding to the text is identified as "mobile phone”.
  • the self-attention model may also include a translation network, the output sequence is used as the input of the translation network, and the language corresponding to the input sequence is output in different languages, or the corpus obtained after translation is referred to as the second corpus here.
  • the language of the first corpus and the language of the second corpus are different.
  • the SA network is used to analyze the meaning of each word in the first corpus, so as to obtain an output sequence that can represent the semantics of the first corpus.
  • the output sequence is used as the input of the translation network, and the translation result corresponding to the first corpus is obtained. For example, if the text corresponding to the input sequence is "Your cell phone is very good", the SA network can analyze the semantics of each word in the text in the text, and then translate the corresponding English "Your cell phone" through the translation network. is very nice”.
  • the structure of the self-attention model including the translation network is similar to the structure shown in the aforementioned Figures 9-11, the difference is only in the structure of the translation network.
  • the classification network or translation network mentioned in this application can be selected from deep convolutional neural networks (DCNN), recurrent neural networks (RNNS), and so on.
  • the neural network mentioned in this application may include various types, such as deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN) or residual network other neural networks, etc.
  • the ODE-based self-attention model can be applied to a variety of scenarios, complete a variety of downstream tasks, has strong generalization ability, and can adapt to more scenarios.
  • the structure of the self-attention model provided by the present application has been introduced in the foregoing, and the natural language processing method provided by the present application will be described in detail below based on the self-attention model provided by the foregoing FIGS. 4A to 11 .
  • FIG. 12 a schematic flowchart of a natural language processing method provided by the present application is as follows.
  • the input sequence includes the initial vector representation corresponding to each word in the corpus.
  • each word in the corpus is converted into an initial vector representation according to a preset mapping relationship, each word corresponds to an initial vector representation, one or more initial representations Vectors form an input sequence.
  • a mapping table can be set up in advance, and each word is set in the mapping table to correspond to a vector, such as "now” corresponds to a vector x 1 , and "day” corresponds to a vector x 2 etc.
  • the output sequence includes a vector representation corresponding to at least one word in the first corpus after natural language processing by the self-attention model, and each vector in the output sequence indicates that each word in the first corpus is in the first corpus.
  • the semantic information in the first corpus is combined with the context information of the first corpus, which can accurately represent the exact meaning of each word in the first corpus.
  • the self-attention model includes a multi-layer network
  • the input of any layer of the multi-layer network is the output of the previous layer network
  • each layer of the network includes multiple self-attention modules
  • each self-attention module is used for Calculate the degree of association representing the degree of association between each word in the first corpus and at least one adjacent word based on the input vector, and fuse the degree of association and the input vector to obtain the first sequence and fuse multiple self-attentions
  • the output of each layer of network can be obtained from the first sequence output by the module and the first sequence output by multiple self-attention modules in the previous layer of network.
  • the self-attention model is realized through the mechanism of the neural network of frequent differential equations, and the neural differentiation through a layer of self-attention modules (that is, multiple calls and summations in the self-attention model) ), to achieve multi-layer self-attention fitting, and obtain an output sequence that can more accurately express semantics.
  • the state reuse mechanism is introduced into the normal differential equation, and the hidden state information inside the previous layer is reused during the fitting process of the normal differential equation to each layer (except the first layer), thereby improving the calculation speed. That is, the current network layer can reuse the output of the SA module of the previous layer of the network, so that the self-attention model can quickly obtain more accurate output results, reduce the computational complexity of the model, and improve the training and inference of the self-attention model. efficiency.
  • the input sequence can be directly used as the input of the SA network to obtain the output sequence.
  • the self-attention model can be the self-attention model shown in the aforementioned Figures 5-6B. After the input sequence is obtained, the input sequence can be input into the SA network, and then the SA network outputs the corresponding output sequence.
  • the input sequence can be used as the input of the feature extraction network to extract the local feature sequence of the input sequence, and then the The local feature sequence is used as the input of the SA network, and the output result of the SA network and the local feature sequence are input into the fusion network, and the fusion network fuses the output result of the SA network and the local feature sequence to obtain the final output sequence.
  • the local feature sequence extracted by the feature extraction network can also not be used as the input of the SA network, but the input sequence can be directly used as the input of the SA network, and the fusion network is used to fuse the output of the SA network and the local feature sequence to obtain the final output sequence.
  • local features can be extracted from the input sequence, and the local features and the output result of the SA network can be fused to obtain the output sequence, so that the local features are fused in the output sequence.
  • the final self-attention result focuses on local features, and a sequence that can more accurately represent the semantics of each word is obtained.
  • the self-attention model can be implemented through the same parameters of the SA module, so that the self-attention model occupies a reduced amount of storage, and the efficiency of training and forward inference of the self-attention model can be improved.
  • ODE the self-attention model is realized by means of ODE, which solves the parameter redundancy problem caused by the multi-layer overlapping in some SA models.
  • the original multi-layer can be achieved by using the parameters of one layer of network. effect of the network.
  • the self-attention model can also be trained by using a training set.
  • the training set can include at least one corpus, each corpus includes at least one word and a corresponding label, and each label can include the corresponding corpus.
  • the corpus in the training set can be used as the input of the self-attention model, the inference result of the self-attention model can be obtained, and then the gradient value can be calculated using the adjoint ODE algorithm, and the parameters of the self-attention model can be updated based on the gradient, so that the The output of the self-attention model is closer to the label corresponding to the corpus. Therefore, in the embodiment of the present application, the numerical solution method is used to obtain the gradient, which greatly reduces memory consumption and gradient error compared with the backpropagation algorithm.
  • the self-attention model and the natural language processing method provided by the present application are described in detail above, and the apparatus for carrying the self-attention model or for executing the foregoing natural language processing method is described in detail below.
  • the present application provides a natural language processing apparatus, including:
  • an acquisition module 1301, configured to acquire an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus;
  • the processing module 1302 is used to take the input sequence as the input of the self-attention model, and obtain the output sequence, and the output sequence includes the vector representation corresponding to at least one word in the first corpus after the natural language processing NLP by the self-attention model, The output sequence represents the semantic information of each word in the first corpus in the first corpus;
  • the self-attention model includes a multi-layer network
  • the input of any layer of the multi-layer network is the output of the previous layer of network
  • each layer of the network includes multiple self-attention modules
  • each self-attention module is used based on
  • the input vector calculates the correlation degree, and fuses the correlation degree and the input vector to obtain the first sequence.
  • the output of each layer of the network is the first sequence output by the fusion of multiple self-attention modules and multiple self-attentions in the previous layer of network.
  • the first sequence output by the force module is obtained, and the degree of association represents the degree of association between each word in the first corpus and at least one adjacent word.
  • the self-attention model further includes a feature extraction network, which is used to extract features from the input sequence in units of multiple adjacent initial vector representations to obtain a local feature sequence, where the local feature sequence is used as a multi-dimensional feature sequence. input to the layer network.
  • a feature extraction network which is used to extract features from the input sequence in units of multiple adjacent initial vector representations to obtain a local feature sequence, where the local feature sequence is used as a multi-dimensional feature sequence. input to the layer network.
  • the self-attention model further includes a fusion network, and the fusion network is used to fuse the local feature sequence and the output result of the multi-layer network to obtain the output sequence.
  • the fusion network is specifically used to: calculate the similarity between the local feature sequence and the output result, and fuse the similarity and the local feature sequence to obtain the output sequence.
  • the self-attention model further includes a classification network, the input of the classification network is an output sequence, and the category corresponding to the first corpus is output.
  • the self-attention model further includes a translation network, the language type of the first corpus is the first type, the input of the translation network is the output sequence, and outputs the second corpus, and the language type of the second corpus is the first type
  • the second category, the first category and the second category are different language types.
  • the parameters of multiple self-attention modules in each layer of the multi-layer network are the same.
  • FIG. 14 is a schematic structural diagram of another natural language processing apparatus provided by the present application, as described below.
  • the natural language processing apparatus may include a processor 1401 and a memory 1402 .
  • the processor 1401 and the memory 1402 are interconnected by wires.
  • the memory 1402 stores program instructions and data.
  • the memory 1402 stores program instructions and data corresponding to the steps in the foregoing FIGS. 4A to 12 .
  • the processor 1401 is configured to execute the method steps executed by the natural language processing apparatus shown in any of the foregoing embodiments in FIG. 4A to FIG. 12 .
  • the natural language processing apparatus may further include a transceiver 1403 for receiving or transmitting data.
  • Embodiments of the present application also provide a computer-readable storage medium, where a program for generating a vehicle's running speed is stored in the computer-readable storage medium, and when the computer is running on a computer, the computer is made to execute the above-mentioned FIG. 4A-FIG. 12
  • the illustrated embodiment describes the steps in the method.
  • the aforementioned natural language processing apparatus shown in FIG. 14 is a chip.
  • the embodiments of the present application also provide a natural language processing device, which may also be referred to as a digital processing chip or a chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface, and the program instructions are processed.
  • the unit is executed, and the processing unit is configured to execute the method steps executed by the natural language processing apparatus shown in any of the foregoing embodiments in FIG. 4A to FIG. 12 .
  • the embodiments of the present application also provide a digital processing chip.
  • the digital processing chip integrates circuits and one or more interfaces for realizing the above-mentioned processor 1401 or the functions of the processor 1401 .
  • the digital processing chip can perform the method steps of any one or more of the foregoing embodiments.
  • the digital processing chip does not integrate the memory, it can be connected with the external memory through the communication interface.
  • the digital processing chip implements the actions performed by the natural language processing apparatus in the above embodiment according to the program codes stored in the external memory.
  • Embodiments of the present application also provide a computer program product that, when driving on a computer, causes the computer to execute the steps performed by the natural language processing apparatus in the method described in the embodiments shown in the foregoing FIGS. 4A-12 .
  • the natural language processing apparatus may be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit, etc. .
  • the processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip in the server executes the natural language processing method described in the embodiments shown in FIG. 4A to FIG. 12 .
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), a digital signal processing digital signal processor (DSP), application specific integrated circuit (ASIC) or field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or it may be any conventional processor or the like.
  • FIG. 15 is a schematic structural diagram of a chip provided by an embodiment of the application.
  • the chip can be represented as a neural network processor NPU 150, and the NPU 150 is mounted on the main CPU ( Host CPU), the task is allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1503, which is controlled by the controller 1504 to extract the matrix data in the memory and perform multiplication operations.
  • the arithmetic circuit 1503 includes multiple processing units (process engines, PEs). In some implementations, the arithmetic circuit 1503 is a two-dimensional systolic array. The arithmetic circuit 1503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1503 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1502 and buffers it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1501 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 1508 .
  • Unified memory 1506 is used to store input data and output data.
  • the weight data is directly passed through the storage unit access controller (direct memory access controller, DMAC) 1505, and the DMAC is transferred to the weight memory 1502.
  • Input data is also moved into unified memory 1506 via the DMAC.
  • a bus interface unit (BIU) 1510 is used for the interaction between the AXI bus and the DMAC and an instruction fetch buffer (instruction fetch buffer, IFB) 1509.
  • IFB instruction fetch buffer
  • the bus interface unit 1510 (bus interface unit, BIU) is used for the instruction fetch memory 1509 to obtain instructions from the external memory, and is also used for the storage unit access controller 1505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1506 or the weight data to the weight memory 1502 or the input data to the input memory 1501 .
  • the vector calculation unit 1507 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 1507 can store the vector of processed outputs to the unified memory 1506 .
  • the vector calculation unit 1507 may apply a linear function and/or a nonlinear function to the output of the operation circuit 1503, such as linear interpolation of the feature plane extracted by the convolutional layer, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 1507 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 1503, such as for use in subsequent layers in a neural network.
  • the instruction fetch buffer (instruction fetch buffer) 1509 connected to the controller 1504 is used to store the instructions used by the controller 1504;
  • the unified memory 1506, the input memory 1501, the weight memory 1502 and the instruction fetch memory 1509 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each layer in the recurrent neural network can be performed by the operation circuit 1503 or the vector calculation unit 1507 .
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the programs of the above-mentioned methods of FIGS. 4A-12 .
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • U disk U disk
  • mobile hard disk ROM
  • RAM random access memory
  • disk or CD etc.
  • a computer device which can be a personal computer, server, or network device, etc. to execute the methods described in the various embodiments of the present application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wire eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Abstract

A method and device for natural language processing, for use in the field of artificial intelligence. The method comprises: acquiring an input sequence (1201), which comprises an initial vector expression corresponding to at least one word in a first corpus; and producing an output sequence with the input sequence serving as an input for a self-attention model (1202), the output sequence expressing semantic information of each word in the first corpus. The self-attention model comprises multiple layers of networks, each layer of network comprises multiple self-attention modules, used for calculating degrees of relevance on the basis of an input vector, that is, the degrees of relevance between each word and neighboring words in the first corpus, and merging the degrees of relevance with the input vector to produce first sequences; the first sequences outputted by the multiple self-attention modules and outputs of the multiple self-attention modules in the preceding layer of network are merged to produce outputs of the current layer of network. The method allows, during natural language processing, the efficient interpretation of improved sematic information for each word in the corpus thereof.

Description

一种自然语言处理方法以及装置A natural language processing method and device
本申请要求于2021年01月20日提交中国专利局、申请号为CN202110077612.0、申请名称为“一种自然语言处理方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number CN202110077612.0 and the application title "A Natural Language Processing Method and Device", which was filed with the China Patent Office on January 20, 2021, the entire contents of which are incorporated by reference in in this application.
技术领域technical field
本申请涉及人工智能领域,尤其涉及一种自然语言处理方法以及装置。The present application relates to the field of artificial intelligence, and in particular, to a natural language processing method and device.
背景技术Background technique
自注意力(self-attention,SA)模型是自然语言处理领域的主要部分,有着非常广泛的应用,如机器翻译、预训练语言模型等。自注意力模型将一个序列数据通过计算词与词之间的依赖关系,将词有效编码成为若干向量表示,使得输出的词向量表示包含其上下文语义信息。因此,如何通过自注意力模型解读出各个词更优的语义信息,将词转换为更优的向量来表示,成为亟待解决的问题。Self-attention (SA) model is the main part of the field of natural language processing and has a very wide range of applications, such as machine translation, pre-trained language models, etc. The self-attention model effectively encodes a sequence of words into several vector representations by calculating the dependencies between words, so that the output word vector representation contains its contextual semantic information. Therefore, how to interpret the better semantic information of each word through the self-attention model and convert the word into a better vector to represent it has become an urgent problem to be solved.
发明内容SUMMARY OF THE INVENTION
本申请提供一种自然语言处理方法以及装置,用于在自然语言处理过程中,通过在自注意力模型的各个网络层中增加对上一层网络的隐状态复用,更高效地解读出各个词在其语料中更优的语义信息。The present application provides a natural language processing method and device, which are used to more efficiently interpret each network layer by adding the hidden state reuse of the upper layer of the network in each network layer of the self-attention model in the process of natural language processing. The better semantic information of a word in its corpus.
有鉴于此,第一方面,本申请提供一种自然语言处理方法,包括:首先获取输入序列,该输入序列中包括第一语料中的至少一个词对应的初始向量表示,通常一个词对应一个初始向量表示;将输入序列作为自注意力模型的输入,得到输出序列,输出序列中包括经自注意力模型进行自然语言处理(natural language processing,NLP)后的第一语料中的至少一个词对应的向量表示,输出序列表示第一语料中的各个词在第一语料中的语义信息,其语义信息中结合了第一语料的上下文信息,能够准确表示出各个词在第一语料中的准确含义;其中,自注意力模型包括多层网络,多层网络中除第一层之外的任意一层网络的输入为上一层网络的输出,每层网络包括多个自注意力模块,每个自注意力模块用于基于输入的向量计算关联度,即第一语料中每个词和相邻的至少一个词之间的关联程度并融合关联度和输入的向量,得到第一序列,每层网络的输出为融合多个自注意力模块输出的第一序列和上一层网络中的多个自注意力模块输出的第一序列得到。In view of this, in the first aspect, the present application provides a method for natural language processing, including: first acquiring an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus, usually one word corresponds to an initial vector representation Vector representation; the input sequence is used as the input of the self-attention model, and the output sequence is obtained, and the output sequence includes at least one word in the first corpus after natural language processing (NLP) by the self-attention model. Vector representation, the output sequence represents the semantic information of each word in the first corpus in the first corpus, and the semantic information combines the context information of the first corpus, which can accurately represent the exact meaning of each word in the first corpus; Among them, the self-attention model includes a multi-layer network, the input of any layer of the network except the first layer in the multi-layer network is the output of the previous layer of network, each layer of network includes multiple self-attention modules, each self-attention module The attention module is used to calculate the degree of association based on the input vector, that is, the degree of association between each word in the first corpus and at least one adjacent word, and fuse the degree of association and the input vector to obtain the first sequence, each layer of the network. The output of is obtained by fusing the first sequence output by multiple self-attention modules and the first sequence output by multiple self-attention modules in the previous layer of network.
因此,在本申请实施方式中,通过神经常微分方程网络的机制来实现自注意力模型,通过一层自注意力模块的神经常微分化(即自注意力模型中的多次调用和求和),实现多层自注意力的拟合,得到能更准确表达语义的输出序列。并且,将状态复用机制引入神经常微分方程,在神经常微分方程对每一层的拟合过程中(第一层除外)复用前一层内部的隐状态信息,从而提高计算速度。即当前网络层可以复用上一层网络的SA模块的输出,从而使自注意力模型可以快速得到更准确的输出结果,降低了模型的计算复杂度,提高自注意力模型的训练以及推理的效率。Therefore, in the embodiment of the present application, the self-attention model is realized through the mechanism of the neural network of frequent differential equations, and the neural differentiation through a layer of self-attention modules (that is, multiple calls and summations in the self-attention model) ), to achieve multi-layer self-attention fitting, and obtain an output sequence that can more accurately express semantics. In addition, the state reuse mechanism is introduced into the normal differential equation, and the hidden state information inside the previous layer is reused during the fitting process of the normal differential equation to each layer (except the first layer), thereby improving the calculation speed. That is, the current network layer can reuse the output of the SA module of the previous layer of the network, so that the self-attention model can quickly obtain more accurate output results, reduce the computational complexity of the model, and improve the training and inference of the self-attention model. efficiency.
在一种可能的实施方式中,自注意力模型中还包括特征提取网络,上述方法还可以包 括:特征提取网络根据输入序列中相邻的多个初始向量表示提取特征,得到局部特征序列,并将局部特征序列作为多层网络的输入。In a possible implementation, the self-attention model further includes a feature extraction network, and the above method may further include: the feature extraction network extracts features according to multiple adjacent initial vector representations in the input sequence to obtain a local feature sequence, and The local feature sequence is used as the input of the multi-layer network.
因此,本申请实施方式中,可以从输入序列中提取局部特征,从而将局部特征作为多层网络的输入,使多层网络可以参考局部信息来进行关联度计算,提高对局部信息的关注度,使输出序列针对局部信息的解读更优。Therefore, in the embodiment of the present application, local features can be extracted from the input sequence, so that the local features can be used as the input of the multi-layer network, so that the multi-layer network can refer to the local information to calculate the correlation degree, and improve the attention to the local information. Make the interpretation of the output sequence better for local information.
在一种可能的实施方式中,自注意力模型还包括融合网络,上述方法还可以包括,融合网络融合局部特征序列和多层网络的输出结果,得到输出序列。In a possible implementation, the self-attention model further includes a fusion network, and the above method may further include: the fusion network fuses the local feature sequence and the output result of the multi-layer network to obtain an output sequence.
因此,本申请实施方式中,可以从输入序列中提取局部特征,并融合局部特征和SA网络的输出结果得到输出序列,使输出序列中融合了局部特征。从而使最终得到的自注意力结果关注局部特征,得到能更准确表示每个词的语义的序列。Therefore, in the embodiment of the present application, local features can be extracted from the input sequence, and the local features and the output result of the SA network can be fused to obtain the output sequence, so that the local features are fused in the output sequence. In this way, the final self-attention result focuses on local features, and a sequence that can more accurately represent the semantics of each word is obtained.
在一种可能的实施方式中,上述的融合网络融合局部特征序列和多层网络的输出结果,具体可以包括:融合网络计算局部特征序列和输出结果之间的相似度,融合相似度和局部特征序列,得到输出序列。In a possible implementation manner, the above-mentioned fusion network fuses the local feature sequence and the output result of the multi-layer network, which may specifically include: the fusion network calculates the similarity between the local feature sequence and the output result, and fuses the similarity and the local feature. sequence to get the output sequence.
本申请实施方式中,可以通过计算相似度的方式来融合相似度和局部特征,从而得到输出序列,使输出序列中包含局部信息,所表示的语义更准确。In the embodiments of the present application, the similarity and local features can be fused by calculating the similarity, so as to obtain an output sequence, so that the output sequence contains local information, and the represented semantics is more accurate.
在一种可能的实施方式中,自注意力模型还包括分类网络,上述方法还可以包括,将分类网络的输入为输出序列,输出第一语料对应的类别。因此,本申请实施方式提供的方法可以应用于分类的场景中,通过在自注意力模型中增加分类网络,即可实现对语料的分类。In a possible implementation manner, the self-attention model further includes a classification network, and the above method may further include: using the input of the classification network as an output sequence, and outputting the category corresponding to the first corpus. Therefore, the method provided by the embodiment of the present application can be applied to the classification scene, and the classification of the corpus can be realized by adding a classification network to the self-attention model.
在一种可能的实施方式中,自注意力模型还包括翻译网络,上述方法还可以包括:将输出序列作为翻译网络的输入,输出第二语料,第一语料的语言种类为第一类,第二语料的语言种类为第二类,第一类和第二类为不同的语言种类。因此,本申请实施方式提供的方法还可以应用于翻译场景,即在自注意力模型中增加翻译网络,即可实现对语料的翻译,将输入序列对应的语料翻译为语种不同的语料。In a possible implementation, the self-attention model further includes a translation network, and the above method may further include: using the output sequence as the input of the translation network, and outputting a second corpus, the language type of the first corpus is the first category, and the first The language category of the second corpus is the second category, and the first category and the second category are different language categories. Therefore, the method provided by the embodiment of the present application can also be applied to translation scenarios, that is, adding a translation network to the self-attention model can realize the translation of corpus, and translate the corpus corresponding to the input sequence into corpus of different languages.
在一种可能的实施方式中,多层网络中每一层网络中的多个自注意力模块的参数相同。因此,本申请实施方式中,可以通过相同的SA模块的参数来实现自注意力模型,使自注意力模型占用减少的存储量,并且可以提高自注意力模型的训练和正向推理的效率。通过将ODE引入SA机制中,通过ODE的方式实现SA模型,解决了一些SA模型中多层重叠带来的参数冗余问题,使用一层网络的参数量即可达到原有的多层网络所能达到的效果。In a possible implementation, the parameters of multiple self-attention modules in each layer of the multi-layer network are the same. Therefore, in the embodiments of the present application, the self-attention model can be implemented through the same parameters of the SA module, so that the self-attention model occupies a reduced amount of storage, and the efficiency of training and forward inference of the self-attention model can be improved. By introducing ODE into the SA mechanism, and realizing the SA model by means of ODE, the problem of parameter redundancy caused by the multi-layer overlapping in some SA models is solved, and the parameters of the original multi-layer network can be achieved by using the parameters of one layer of network. effect that can be achieved.
在一种可能的实施方式中,在获取输入序列之前,还可以使用训练集对自注意力模型进行训练,训练集中可以包括至少一种语料,每种语料包括至少一个词,将每个语料转换为包括了每个词的初始向量表示的序列,然后将该序列输入至自注意力模型,得到输出的序列。然后可以通过adjoint ODE算法计算出梯度值,基于该梯度之来更新自注意力模型的参数,使自注意力模型的输出结果与语料对应的标签更接近。因此,本申请实施方式中,采用数值解法求梯度,相对于反向传播算法,大大减少了内存消耗和梯度误差。In a possible implementation, before acquiring the input sequence, a training set may also be used to train the self-attention model, the training set may include at least one corpus, each corpus includes at least one word, and each corpus is converted into To include a sequence of initial vector representations for each word, this sequence is then fed into the self-attention model, resulting in an output sequence. Then, the gradient value can be calculated by the adjoint ODE algorithm, and the parameters of the self-attention model can be updated based on the gradient, so that the output of the self-attention model is closer to the label corresponding to the corpus. Therefore, in the embodiment of the present application, the numerical solution method is used to obtain the gradient, which greatly reduces memory consumption and gradient error compared with the backpropagation algorithm.
第二方面,本申请提供一种自注意力模型,该自注意力模型包括多层网络,多层网络中的任意一层网络的输入为上一层网络的输出,每层网络包括多个自注意力模块,每个自 注意力模块用于基于输入的向量计算关联度,并融合关联度和输入的向量,得到第一序列,每层网络的输出为融合多个自注意力模块输出的第一序列和上一层网络中的多个自注意力模块输出的第一序列得到,关联度表示第一语料中每个词和相邻的至少一个词之间的关联程度;将输入序列作为自注意力模型的输入,得到输出序列,输出序列中包括经自注意力模型进行自然语言处理NLP后的第一语料中的至少一个词对应的向量表示,输出序列表示第一语料中的各个词在第一语料中的语义信息。In a second aspect, the present application provides a self-attention model, the self-attention model includes a multi-layer network, the input of any layer of the multi-layer network is the output of the previous layer of network, and each layer of the network includes multiple self-attention models. Attention module, each self-attention module is used to calculate the correlation degree based on the input vector, and fuse the correlation degree and the input vector to obtain the first sequence, and the output of each layer of the network is the output of the fusion of multiple self-attention modules. A sequence is obtained from the first sequence output by multiple self-attention modules in the previous layer of network, and the degree of association represents the degree of association between each word in the first corpus and at least one adjacent word; the input sequence is used as the self-attention module. The input of the attention model, the output sequence is obtained. The output sequence includes the vector representation corresponding to at least one word in the first corpus after the natural language processing NLP by the self-attention model, and the output sequence indicates that each word in the first corpus is in Semantic information in the first corpus.
因此,在本申请实施方式中,通过神经常微分方程网络的机制来实现自注意力模型,通过一层自注意力模块的神经常微分化(即自注意力模型中的多次调用和求和),实现多层自注意力的拟合,得到能更准确表达语义的输出序列。并且,将状态复用机制引入神经常微分方程,在神经常微分方程对每一层的拟合过程中(第一层除外)复用前一层内部的隐状态信息,从而提高计算速度。即当前网络层可以复用上一层网络的SA模块的输出,从而使自注意力模型可以快速得到更准确的输出结果,降低了模型的计算复杂度,提高自注意力模型的训练以及推理的效率。Therefore, in the embodiment of the present application, the self-attention model is realized through the mechanism of the neural network of frequent differential equations, and the neural differentiation through a layer of self-attention modules (that is, multiple calls and summations in the self-attention model) ), to achieve multi-layer self-attention fitting, and obtain an output sequence that can more accurately express semantics. In addition, the state reuse mechanism is introduced into the normal differential equation, and the hidden state information inside the previous layer is reused during the fitting process of the normal differential equation to each layer (except the first layer), thereby improving the calculation speed. That is, the current network layer can reuse the output of the SA module of the previous layer of the network, so that the self-attention model can quickly obtain more accurate output results, reduce the computational complexity of the model, and improve the training and inference of the self-attention model. efficiency.
在一种可能的实施方式中,自注意力模型中还包括特征提取网络,用于从输入序列中以相邻的多个初始向量表示为单位提取特征,得到局部特征序列,并将局部特征序列作为多层网络的输入。In a possible implementation, the self-attention model further includes a feature extraction network, which is used to extract features from the input sequence in units of multiple adjacent initial vector representations to obtain a local feature sequence, and combine the local feature sequence as the input to the multi-layer network.
因此,本申请实施方式中,可以从输入序列中提取局部特征,从而将局部特征作为多层网络的输入,使多层网络可以参考局部信息来进行关联度计算,提高对局部信息的关注度,使输出序列针对局部信息的解读更优。Therefore, in the embodiment of the present application, local features can be extracted from the input sequence, so that the local features can be used as the input of the multi-layer network, so that the multi-layer network can refer to the local information to calculate the correlation degree, and improve the attention to the local information. Make the interpretation of the output sequence better for local information.
在一种可能的实施方式中,自注意力模型还包括融合网络,用于融合局部特征序列和多层网络的输出结果,得到输出序列。In a possible implementation, the self-attention model further includes a fusion network for fusing the local feature sequence and the output result of the multi-layer network to obtain an output sequence.
因此,本申请实施方式中,可以从输入序列中提取局部特征,并融合局部特征和SA网络的输出结果得到输出序列,使输出序列中融合了局部特征。从而使最终得到的自注意力结果关注局部特征,得到能更准确表示每个词的语义的序列。Therefore, in the embodiment of the present application, local features can be extracted from the input sequence, and the local features and the output result of the SA network can be fused to obtain the output sequence, so that the local features are fused in the output sequence. In this way, the final self-attention result focuses on local features, and a sequence that can more accurately represent the semantics of each word is obtained.
在一种可能的实施方式中,融合网络具体用于计算局部特征序列和输出结果之间的相似度,融合相似度和局部特征序列,得到输出序列。In a possible implementation, the fusion network is specifically used to calculate the similarity between the local feature sequence and the output result, and fuse the similarity and the local feature sequence to obtain the output sequence.
本申请实施方式中,可以通过计算相似度的方式来融合相似度和局部特征,从而得到输出序列,使输出序列中包含局部信息,所表示的语义更准确。In the embodiments of the present application, the similarity and local features can be fused by calculating the similarity, so as to obtain an output sequence, so that the output sequence contains local information, and the represented semantics is more accurate.
在一种可能的实施方式中,自注意力模型还包括分类网络,分类网络的输入为输出序列,输出第一语料对应的类别。因此,本申请实施方式提供的方法可以应用于分类的场景中,通过在自注意力模型中增加分类网络,即可实现对语料的分类。In a possible implementation, the self-attention model further includes a classification network, the input of the classification network is an output sequence, and the category corresponding to the first corpus is output. Therefore, the method provided by the embodiment of the present application can be applied to the classification scene, and the classification of the corpus can be realized by adding a classification network to the self-attention model.
在一种可能的实施方式中,自注意力模型还包括翻译网络,第一语料的语言种类为第一类,翻译网络的输入为输出序列,输出第二语料,第二语料的语言种类为第二类,第一类和第二类为不同的语言种类。因此,本申请实施方式提供的方法还可以应用于翻译场景,即在自注意力模型中增加翻译网络,即可实现对语料的翻译,将输入序列对应的语料翻译为语种不同的语料。In a possible implementation, the self-attention model further includes a translation network, the language type of the first corpus is the first type, the input of the translation network is the output sequence, and outputs the second corpus, and the language type of the second corpus is the first type The second category, the first category and the second category are different language types. Therefore, the method provided by the embodiment of the present application can also be applied to translation scenarios, that is, adding a translation network to the self-attention model can realize the translation of corpus, and translate the corpus corresponding to the input sequence into corpus of different languages.
在一种可能的实施方式中,多层网络中每一层网络中的多个自注意力模块的参数相同。 因此,本申请实施方式中,可以通过相同的SA模块的参数来实现自注意力模型,使自注意力模型占用减少的存储量,并且可以提高自注意力模型的训练和正向推理的效率。通过将ODE引入SA机制中,通过ODE的方式实现SA模型,解决了一些SA模型中多层重叠带来的参数冗余问题,使用一层网络的参数量即可达到原有的多层网络所能达到的效果。In a possible implementation, the parameters of multiple self-attention modules in each layer of the multi-layer network are the same. Therefore, in the embodiments of the present application, the self-attention model can be implemented through the same parameters of the SA module, so that the self-attention model occupies a reduced amount of storage, and the efficiency of training and forward inference of the self-attention model can be improved. By introducing ODE into the SA mechanism, and realizing the SA model by means of ODE, the problem of parameter redundancy caused by the multi-layer overlapping in some SA models is solved, and the parameters of the original multi-layer network can be achieved by using the parameters of one layer of network. effect that can be achieved.
第三方面,本申请实施例提供一种自然语言处理装置,该自然语言处理装置具有实现上述第一方面自然语言处理方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。In a third aspect, an embodiment of the present application provides a natural language processing apparatus, and the natural language processing apparatus has a function of implementing the natural language processing method of the first aspect. This function can be implemented by hardware or by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions.
第四方面,本申请实施例提供一种自然语言处理装置,包括:处理器和存储器,其中,处理器和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第一方面任一项所示的自然语言处理方法中与处理相关的功能。可选地,该自然语言处理装置可以是芯片。In a fourth aspect, an embodiment of the present application provides a natural language processing apparatus, including: a processor and a memory, wherein the processor and the memory are interconnected through a line, and the processor invokes program codes in the memory to execute any one of the above-mentioned first aspects The processing-related functions in the natural language processing method shown in item. Optionally, the natural language processing device may be a chip.
第五方面,本申请实施例提供了一种自然语言处理装置,该自然语言处理装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行如上述第一方面或第一方面任一可选实施方式中与处理相关的功能。In a fifth aspect, an embodiment of the present application provides a natural language processing device. The natural language processing device may also be referred to as a digital processing chip or a chip. The chip includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface. The instructions are executed by a processing unit, and the processing unit is configured to perform processing-related functions as in the first aspect or any of the optional embodiments of the first aspect.
第六方面,本申请实施例提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任一可选实施方式中的方法。In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, including instructions, which, when executed on a computer, cause the computer to execute the method in the first aspect or any optional implementation manner of the first aspect.
第七方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任一可选实施方式中的方法。In a seventh aspect, an embodiment of the present application provides a computer program product including instructions, which, when run on a computer, enables the computer to execute the method in the first aspect or any optional implementation manner of the first aspect.
附图说明Description of drawings
图1为本申请应用的一种人工智能主体框架示意图;Fig. 1 is a schematic diagram of a main frame of artificial intelligence applied by the application;
图2为本申请提供的一种系统架构示意图;2 is a schematic diagram of a system architecture provided by the application;
图3为本申请提供的另一种系统架构示意图;3 is a schematic diagram of another system architecture provided by the present application;
图4A为本申请实施例提供的一种自注意力模型的结构示意图;4A is a schematic structural diagram of a self-attention model provided by an embodiment of the present application;
图4B为本申请实施例提供的一种序列转换的结构示意图;4B is a schematic structural diagram of a sequence conversion provided by an embodiment of the present application;
图5为本申请实施例提供的另一种自注意力模型的结构示意图;FIG. 5 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;
图6A为本申请实施例提供的另一种自注意力模型的结构示意图;6A is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;
图6B为本申请实施例提供的另一种自注意力模型的结构示意图;6B is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;
图7为本申请实施例提供的另一种自注意力模型的结构示意图;FIG. 7 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;
图8为本申请实施例提供的另一种自注意力模型的结构示意图;FIG. 8 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;
图9为本申请实施例提供的另一种自注意力模型的结构示意图;FIG. 9 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;
图10为本申请实施例提供的另一种自注意力模型的结构示意图;FIG. 10 is a schematic structural diagram of another self-attention model provided by an embodiment of the present application;
图11为本申请实施例提供的一种融合网络的结构示意图;FIG. 11 is a schematic structural diagram of a converged network provided by an embodiment of the application;
图12为本申请实施例提供的一种自然语言处理方法的流程示意图;12 is a schematic flowchart of a natural language processing method provided by an embodiment of the present application;
图13为本申请实施例提供的一种自然语言处理装置的结构示意图;13 is a schematic structural diagram of a natural language processing apparatus according to an embodiment of the present application;
图14为本申请实施例提供的另一种自然语言处理装置的结构示意图;14 is a schematic structural diagram of another natural language processing apparatus provided by an embodiment of the present application;
图15为本申请实施例提供的一种芯片的结构示意图。FIG. 15 is a schematic structural diagram of a chip according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。First, the overall workflow of the artificial intelligence system will be described. Please refer to Figure 1. Figure 1 shows a schematic structural diagram of the main frame of artificial intelligence. The above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis). Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom". The "IT value chain" reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
(1)基础设施(1) Infrastructure
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片,如中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(英语:graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。The infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communicate with the outside through sensors; computing power is provided by intelligent chips, such as central processing unit (CPU), network processor (neural-network processing unit, NPU), graphics processor (English: graphics processing unit, GPU), Application specific integrated circuit (ASIC) or field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips) are provided; the basic platform includes distributed computing framework and network related platform guarantee and support, It can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
(2)数据(2) Data
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。The data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
(3)数据处理(3) Data processing
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
(4)通用能力(4) General ability
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理, 语音识别,图像的识别等等。After the above-mentioned data processing, some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
(5)智能产品及行业应用(5) Smart products and industry applications
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、平安城市等。Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall solution of artificial intelligence, and the productization of intelligent information decision-making to achieve landing applications. Its application areas mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, safe city, etc.
本申请实施例涉及了神经网络和自然语言处理(natural language processing,NLP)的相关应用,为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。The embodiments of this application involve related applications of neural networks and natural language processing (NLP). concept is introduced.
语料(Corpus):也称为自由文本,其可以是字、词语、句子、片段、文章及其任意组合。例如,“今天天气真好”即为一段语料。Corpus: Also known as free text, it can be words, words, sentences, fragments, articles, and any combination thereof. For example, "The weather is really nice today" is a corpus.
自注意力模型(self-attention model),是指将一个序列数据(如自然语料“你的手机很不错。”)有效编码成为若干多维的向量,方便进行数值运算,该多维向量融合了序列中每个元素的相互之间的相似度信息,该相似度被称为自注意力。The self-attention model refers to effectively encoding a sequence of data (such as natural corpus "Your mobile phone is very good.") into several multi-dimensional vectors, which are convenient for numerical operations. The similarity information between each element is called self-attention.
神经常微分方程网络(neural ordinary differential equations networks,ODENet),是一种针对时间相关的神经网络或者多步/层神经网络的实现方式,神经常微分方程网络可以拟合该给定的神经网络在各个不同连续的时间点或者各个不同步/层下的输出,用一套参数拟合多个连续时间点或者多个步/层下原有神经网络的输出,具有很高的参数效率。ODENet (neural ordinary differential equations networks, ODENet) is an implementation for time-dependent neural networks or multi-step/layer neural networks. ODENets can fit the given neural network in the For each different continuous time point or the output under each asynchronous/layer, a set of parameters is used to fit the output of the original neural network under multiple continuous time points or multiple steps/layers, which has high parameter efficiency.
损失函数(loss function):也可以称为代价函数(cost function),一种比较机器学习模型对样本的预测输出和样本的真实值(也可以称为监督值)区别的度量,即用于衡量机器学习模型对样本的预测输出和样本的真实值之间的区别。该损失函数通常可以包括误差平方均方、交叉熵、对数、指数等损失函数。例如,可以使用误差均方作为损失函数,定义为
Figure PCTCN2022071285-appb-000001
具体可以根据实际应用场景选择具体的损失函数。
Loss function (loss function): also known as cost function (cost function), a measure that compares the difference between the prediction output of the machine learning model for the sample and the real value (also known as the supervision value) of the sample, that is, used to measure The difference between the predicted output of a machine learning model for a sample and the true value of the sample. The loss function may generally include loss functions such as mean square error, cross entropy, logarithm, and exponential. For example, the mean squared error can be used as a loss function, defined as
Figure PCTCN2022071285-appb-000001
Specifically, a specific loss function can be selected according to the actual application scenario.
梯度:损失函数关于参数的导数向量。Gradient: The vector of derivatives of the loss function with respect to the parameters.
随机梯度:机器学习中样本数量很大,所以每次计算的损失函数都由随机采样得到的数据计算,相应的梯度称作随机梯度。Stochastic gradient: The number of samples in machine learning is large, so the loss function of each calculation is calculated from the data obtained by random sampling, and the corresponding gradient is called stochastic gradient.
反向传播(back propagation,BP):一种计算根据损失函数计算模型参数梯度、更新模型参数的算法。Back propagation (BP): An algorithm that calculates the gradient of model parameters and updates model parameters according to the loss function.
常微分方程伴随求解(adjoint ODE)算法:是训练常微分方程(ordinary differential equation,ODE)的一种反向更新算法,通过采用数值解法求梯度,即直接求解到梯度值,相对于反向传播算法,大大减少了内存消耗和梯度误差。Ordinary differential equation adjoint solution (adjoint ODE) algorithm: is a reverse update algorithm for training ordinary differential equations (ordinary differential equation, ODE). algorithm, which greatly reduces memory consumption and gradient error.
神经机器翻译(neural machine translation):神经机器翻译是自然语言处理的一个典型任务。该任务是给定一个源语言的句子,输出其对应的目标语言句子的技术。在常用的神经机器翻译模型中,源语言和目标语言的句子中的词均会编码成为向量表示,在向量空间进行计算词与词以及句子与句子之间的关联,从而进行翻译任务。Neural machine translation: Neural machine translation is a typical task of natural language processing. The task is the technique of outputting a sentence in the target language given a sentence in the source language. In the commonly used neural machine translation model, the words in the sentences of the source language and the target language are encoded into vector representations, and the association between words and words and sentences and sentences is calculated in the vector space to perform translation tasks.
预训练语言模型(pre-trained language model,PLM):是一种自然语言序列编码器,将自然语言序列中的每个词进行编码为一个向量表示,从而进行预测任务。PLM的训练包含两个阶段,即预训练(pre-training)阶段和微调(finetuning)阶段。在预训练阶段, 该模型在大规模无监督文本上进行语言模型任务的训练,从而学习到词表示方式。在微调阶段,该模型利用预训练阶段学到的参数做初始化,在文本分类(text classification)或序列标注(sequence labeling)等下游任务(Downstream Task)上进行较少步骤的训练,就可以成功把预训练得到的语义信息成功迁移到下游任务上来。Pre-trained language model (PLM): It is a natural language sequence encoder that encodes each word in the natural language sequence into a vector representation for prediction tasks. The training of PLM consists of two stages, namely the pre-training stage and the fine-tuning stage. In the pre-training stage, the model is trained on large-scale unsupervised text for language model tasks, thereby learning word representations. In the fine-tuning stage, the model is initialized with the parameters learned in the pre-training stage, and can be successfully trained on downstream tasks such as text classification or sequence labeling with fewer steps. The semantic information obtained by pre-training is successfully transferred to downstream tasks.
Embedding:指样本的特征表示。Embedding: refers to the feature representation of the sample.
本申请实施例提供的自然语言处理方法可以在服务器上被执行,还可以在终端设备上被执行。其中该终端设备可以是具有图像处理功能的移动电话、平板个人电脑(tablet personal computer,TPC)、媒体播放器、智能电视、笔记本电脑(laptop computer,LC)、个人数字助理(personal digital assistant,PDA)、个人计算机(personal computer,PC)、照相机、摄像机、智能手表、可穿戴式设备(wearable device,WD)或者自动驾驶的车辆等,本申请实施例对此不作限定。The natural language processing method provided by the embodiments of the present application may be executed on a server, and may also be executed on a terminal device. The terminal device can be a mobile phone with image processing function, tablet personal computer (TPC), media player, smart TV, laptop computer (LC), personal digital assistant (PDA) ), a personal computer (PC), a camera, a video camera, a smart watch, a wearable device (WD), or an autonomous vehicle, etc., which are not limited in this embodiment of the present application.
参见附图2,本申请实施例提供了一种系统架构200。该系统架构中包括数据库230、客户设备240。数据采集设备260用于采集数据并存入数据库230,训练模块202基于数据库230中维护的数据生成目标模型/规则201。下面将更详细地描述训练模块202如何基于数据得到目标模型/规则201,目标模型/规则201即本申请以下实施方式中所提及的神经网络,具体参阅以下图4A-图12中的相关描述。Referring to FIG. 2 , an embodiment of the present application provides a system architecture 200 . The system architecture includes a database 230 and a client device 240 . The data collection device 260 is used to collect data and store it in the database 230 , and the training module 202 generates the target model/rule 201 based on the data maintained in the database 230 . The following will describe in more detail how the training module 202 obtains the target model/rule 201 based on the data. The target model/rule 201 is the neural network mentioned in the following embodiments of the present application. For details, please refer to the relevant descriptions in FIGS. 4A-12 below. .
计算模块可以包括训练模块202,训练模块202得到的目标模型/规则可以应用不同的系统或设备中。在附图2中,执行设备210配置收发器212,该收发器212可以是无线收发器、光收发器或有线接口(如I/O接口)等,与外部设备进行数据交互,“用户”可以通过客户设备240向收发器212输入数据,例如,客户设备240可以向执行设备210发送目标任务,请求执行设备训练神经网络,并向执行设备210发送用于训练的数据库。The computing module may include a training module 202, and the target model/rule obtained by the training module 202 may be applied to different systems or devices. In FIG. 2, the execution device 210 configures a transceiver 212, which can be a wireless transceiver, an optical transceiver, or a wired interface (such as an I/O interface), etc., to perform data interaction with external devices, and a "user" can Data is input to transceiver 212 through client device 240. For example, client device 240 may send target tasks to execution device 210, request the execution device to train a neural network, and send execution device 210 a database for training.
执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。The execution device 210 can call data, codes, etc. in the data storage system 250 , and can also store data, instructions, etc. in the data storage system 250 .
计算模块211使用目标模型/规则201对输入的数据进行处理。具体地,计算模块211用于:获取输入序列,输入序列包括第一语料中至少一个词对应的初始向量表示;将输入序列作为自注意力模型的输入,得到输出序列,输出序列中包括经自注意力模型进行自然语言处理NLP后的第一语料中的至少一个词对应的向量表示,输出序列表示第一语料中的各个词在第一语料中的语义信息,自注意力模型为通过训练集训练得到,训练集中包括至少一种语料,每种语料包括至少一个词;其中,自注意力模型包括多层网络,多层网络中的任意一层网络的输入为上一层网络的输出,每层网络包括多个自注意力模块,每个自注意力模块用于基于输入的向量计算表示第一语料中每个词和相邻的至少一个词之间的关联程度的关联度,并融合关联度和输入的向量,得到第一序列,该多层网络中每一层之间连接,每层网络的输出为融合多个自注意力模块输出的第一序列和上一层网络中的多个自注意力模块输出的第一序列得到。The calculation module 211 uses the target model/rule 201 to process the input data. Specifically, the calculation module 211 is used to: obtain an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus; use the input sequence as the input of the self-attention model to obtain an output sequence, where the output sequence includes The vector representation corresponding to at least one word in the first corpus after the attention model performs natural language processing (NLP), the output sequence represents the semantic information of each word in the first corpus in the first corpus, and the self-attention model is obtained through the training set. Obtained from training, the training set includes at least one corpus, and each corpus includes at least one word; wherein, the self-attention model includes a multi-layer network, and the input of any layer of the multi-layer network is the output of the previous layer of network, each The layer network includes multiple self-attention modules, each self-attention module is used to calculate the degree of association representing the degree of association between each word in the first corpus and at least one adjacent word based on the input vector, and fuse the association degree and the input vector to get the first sequence, each layer in the multi-layer network is connected, and the output of each layer of the network is the first sequence that fuses the output of multiple self-attention modules and multiple layers in the previous layer of network. The first sequence output from the attention module is obtained.
最后,收发器212将构建得到的神经网络返回给客户设备240,以在客户设备240或者其他设备中部署该神经网络。Finally, transceiver 212 returns the constructed neural network to client device 240 to deploy the neural network in client device 240 or other devices.
更深层地,训练模块202可以针对不同的任务,基于不同的数据得到相应的目标模型/ 规则201,以给用户提供更佳的结果。More deeply, the training module 202 can obtain corresponding target models/rules 201 based on different data for different tasks, so as to provide users with better results.
在附图2中所示情况下,可以根据用户的输入数据确定输入执行设备210中的数据,例如,用户可以在收发器212提供的界面中操作。另一种情况下,客户设备240可以自动地向收发器212输入数据并获得结果,若客户设备240自动输入数据需要获得用户的授权,用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备240也可以作为数据采集端将采集到与目标任务关联的数据存入数据库230。In the case shown in FIG. 2 , the data input into the execution device 210 can be determined according to the input data of the user, for example, the user can operate in the interface provided by the transceiver 212 . In another case, the client device 240 can automatically input data to the transceiver 212 and obtain the result. If the client device 240 automatically inputs data and needs to obtain the authorization of the user, the user can set the corresponding permission in the client device 240 . The user can view the result output by the execution device 210 on the client device 240, and the specific presentation form can be a specific manner such as display, sound, and action. The client device 240 can also act as a data collection end to store the collected data associated with the target task into the database 230 .
在本申请所提及的训练或者更新过程可以由训练模块202来执行。可以理解的是,神经网络的训练过程即学习控制空间变换的方式,更具体即学习权重矩阵。训练神经网络的目的是使神经网络的输出尽可能接近期望值,因此可以通过比较当前网络的预测值和期望值,再根据两者之间的差异情况来更新神经网络中的每一层神经网络的权重向量(当然,在第一次更新之前通常可以先对权重向量进行初始化,即为深度神经网络中的各层预先配置参数)。例如,如果网络的预测值过高,则调整权重矩阵中的权重的值从而降低预测值,经过不断的调整,直到神经网络输出的值接近期望值或者等于期望值。具体地,可以通过损失函数(loss function)或目标函数(objective function)来衡量神经网络的预测值和期望值之间的差异。以损失函数举例,损失函数的输出值(loss)越高表示差异越大,神经网络的训练可以理解为尽可能缩小loss的过程。本申请以下实施方式中更新起点网络的权重以及对串行网络进行训练的过程可以参阅此过程,以下不再赘述。The training or update process mentioned in this application may be performed by the training module 202 . It can be understood that the training process of the neural network is to learn the way to control the spatial transformation, and more specifically, to learn the weight matrix. The purpose of training a neural network is to make the output of the neural network as close to the expected value as possible, so you can compare the predicted value and expected value of the current network, and then update the weight of each layer of the neural network in the neural network according to the difference between the two. vector (of course, the weight vector can usually be initialized before the first update, that is, the parameters are pre-configured for each layer in the deep neural network). For example, if the predicted value of the network is too high, the value of the weight in the weight matrix is adjusted to reduce the predicted value, and after continuous adjustment, the value output by the neural network is close to or equal to the expected value. Specifically, the difference between the predicted value and the expected value of the neural network can be measured by a loss function or an objective function. Taking the loss function as an example, the higher the output value of the loss function (loss), the greater the difference. The training of the neural network can be understood as the process of reducing the loss as much as possible. For the process of updating the weight of the starting point network and training the serial network in the following embodiments of the present application, reference may be made to this process, which will not be repeated below.
如图2所示,根据训练模块202训练得到目标模型/规则201,该目标模型/规则201在本申请实施例中可以是本申请中的自注意力模型,该自注意力模型可以包括深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络(recurrent neural network,RNNS)等等网络。本申请提及的神经网络可以包括多种类型,如深度神经网络(deep neural network,DNN)、卷积神经网络(convolutional neural network,CNN)、循环神经网络(recurrent neural networks,RNN)或残差网络其他神经网络等。As shown in FIG. 2, a target model/rule 201 is obtained by training according to the training module 202. In this embodiment of the present application, the target model/rule 201 may be a self-attention model in the present application, and the self-attention model may include a depth volume Deep convolutional neural networks (DCNN), recurrent neural networks (RNNS) and other networks. The neural network mentioned in this application may include various types, such as deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN) or residual network other neural networks, etc.
其中,在训练阶段,数据库230可以用于存储有用于训练的样本集。执行设备210生成用于处理样本的目标模型/规则201,并利用数据库中的样本集合对目标模型/规则201进行迭代训练,得到成熟的目标模型/规则201,该目标模型/规则201具体表现为神经网络。执行设备210得到的神经网络可以应用不同的系统或设备中。Wherein, in the training phase, the database 230 may be used to store a sample set for training. The execution device 210 generates a target model/rule 201 for processing samples, and uses the sample set in the database to iteratively train the target model/rule 201 to obtain a mature target model/rule 201. The target model/rule 201 is embodied as Neural Networks. The neural network obtained by executing the device 210 can be applied in different systems or devices.
在推理阶段,执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。数据存储系统250可以置于执行设备210中,也可以为数据存储系统250相对执行设备210是外部存储器。计算模块211可以通过神经网络对执行设备210获取到的样本进行处理,得到预测结果,预测结果的具体表现形式与神经网络的功能相关。In the inference phase, the execution device 210 may call data, codes, etc. in the data storage system 250 , and may also store data, instructions, etc. in the data storage system 250 . The data storage system 250 may be placed in the execution device 210 , or the data storage system 250 may be an external memory relative to the execution device 210 . The calculation module 211 can process the samples obtained by the execution device 210 through a neural network to obtain a prediction result, and the specific manifestation of the prediction result is related to the function of the neural network.
需要说明的是,附图2仅是本申请实施例提供的一种系统架构的示例性的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制。例如,在附图2中,数据存储系统250相对执行设备210是外部存储器,在其它场景中,也可以将数据存储系统250置于执行设备210中。It should be noted that FIG. 2 is only an exemplary schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 2 , the data storage system 250 is an external memory relative to the execution device 210 . In other scenarios, the data storage system 250 may also be placed in the execution device 210 .
根据训练模块202训练得到的目标模型/规则201可以应用于不同的系统或设备中,如应用于手机,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端设备等。The target model/rule 201 trained according to the training module 202 can be applied to different systems or devices, such as mobile phones, tablet computers, laptop computers, augmented reality (AR)/virtual reality (VR) , a vehicle terminal, etc., or a server or a cloud device.
该目标模型/规则201在本申请实施例中可以是本申请中的自注意力模型,具体的,本申请实施例提供的自注意力模型可以包括CNN,深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络(recurrent neural network,RNN)等等网络。The target model/rule 201 may be the self-attention model in the present application in the embodiment of the present application. Specifically, the self-attention model provided in the embodiment of the present application may include CNN, deep convolutional neural networks (deep convolutional neural networks) , DCNN), recurrent neural network (RNN) and other networks.
参见附图3,本申请实施例还提供了一种系统架构300。执行设备210由一个或多个服务器实现,可选的,与其它计算设备配合,例如:数据存储、路由器、负载均衡器等设备;执行设备210可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备210可以使用数据存储系统250中的数据,或者调用数据存储系统250中的程序代码实现本申请以下图12对应的用于计算设备的训练方法的步骤。Referring to FIG. 3 , an embodiment of the present application further provides a system architecture 300 . The execution device 210 is implemented by one or more servers, and optionally, cooperates with other computing devices, such as: data storage, routers, load balancers and other devices; the execution device 210 may be arranged on a physical site, or distributed in multiple on the physical site. The execution device 210 can use the data in the data storage system 250, or call the program code in the data storage system 250 to implement the steps of the training method for a computing device corresponding to FIG. 12 below in this application.
用户可以操作各自的用户设备(例如本地设备301和本地设备302)与执行设备210进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。A user may operate respective user devices (eg, local device 301 and local device 302 ) to interact with execution device 210 . Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, etc.
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备210进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。具体地,该通信网络可以包括无线网络、有线网络或者无线网络与有线网络的组合等。该无线网络包括但不限于:第五代移动通信技术(5th-Generation,5G)系统,长期演进(long term evolution,LTE)系统、全球移动通信系统(global system for mobile communication,GSM)或码分多址(code division multiple access,CDMA)网络、宽带码分多址(wideband code division multiple access,WCDMA)网络、无线保真(wireless fidelity,WiFi)、蓝牙(bluetooth)、紫蜂协议(Zigbee)、射频识别技术(radio frequency identification,RFID)、远程(Long Range,Lora)无线通信、近距离无线通信(near field communication,NFC)中的任意一种或多种的组合。该有线网络可以包括光纤通信网络或同轴电缆组成的网络等。Each user's local device can interact with the execution device 210 through any communication mechanism/standard communication network, which can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof. Specifically, the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, and the like. The wireless network includes but is not limited to: the fifth generation mobile communication technology (5th-Generation, 5G) system, the long term evolution (long term evolution, LTE) system, the global system for mobile communication (global system for mobile communication, GSM) or code division Multiple access (code division multiple access, CDMA) network, wideband code division multiple access (wideband code division multiple access, WCDMA) network, wireless fidelity (wireless fidelity, WiFi), Bluetooth (bluetooth), Zigbee protocol (Zigbee), Any one or a combination of radio frequency identification technology (radio frequency identification, RFID), long range (Long Range, Lora) wireless communication, and near field communication (near field communication, NFC). The wired network may include an optical fiber communication network or a network composed of coaxial cables, and the like.
在另一种实现中,执行设备210的一个方面或多个方面可以由每个本地设备实现,例如,本地设备301可以为执行设备210提供本地数据或反馈计算结果。该本地设备也可以称为计算设备。In another implementation, one or more aspects of the execution device 210 may be implemented by each local device, for example, the local device 301 may provide the execution device 210 with local data or feedback calculation results. The local device may also be referred to as a computing device.
需要注意的,执行设备210的所有功能也可以由本地设备实现。例如,本地设备301实现执行设备210的功能并为自己的用户提供服务,或者为本地设备302的用户提供服务。It should be noted that all the functions of the execution device 210 can also be implemented by the local device. For example, the local device 301 implements the functions of the execution device 210 and provides services for its own users, or provides services for the users of the local device 302 .
一些常用的SA模型,通常由多个SA模块堆叠得到,每个SA模块都具有相应的参数,对于自然语言表示,SA模型通过多层堆叠的方式学习更加抽象的语义信息,这往往会造成参数冗余,容易导致训练和推理的效率降低。并且,SA模型可以捕捉到全局信息,但局部信息融合被忽略,而并不能学习到文本所表示的实际含义。Some commonly used SA models are usually obtained by stacking multiple SA modules, and each SA module has corresponding parameters. For natural language representation, SA models learn more abstract semantic information by stacking multiple layers, which often results in parameter Redundancy can easily lead to a decrease in the efficiency of training and inference. Moreover, the SA model can capture the global information, but the local information fusion is ignored and cannot learn the actual meaning represented by the text.
因此,本申请提出了一种自注意力模型以及基于该自注意力模型的自然语言处理方法,用于在自然语言处理过程中,通过在自注意力模型的各个网络层中增加对上一层网络的隐 状态复用,更高效地解读出各个词在其语料中更优的语义信息。此外,还可以通过融合全局信息和局部信息,解读出文本中每个词在其文本中的语义信息,即更准确地解读出文本中每个词在其所在的语料中所表示的上下文语义。Therefore, this application proposes a self-attention model and a natural language processing method based on the self-attention model, which are used in the process of natural language processing by adding a pair of layers to each network layer of the self-attention model. The hidden state reuse of the network can more efficiently interpret the better semantic information of each word in its corpus. In addition, by fusing global information and local information, the semantic information of each word in the text can be interpreted, that is, the context semantics represented by each word in the text in the corpus in which it is located can be more accurately interpreted.
下面对本申请提供的自注意力模型以及基于该自注意力模型的自然语言处理方法进行详细介绍。The self-attention model provided by the present application and the natural language processing method based on the self-attention model will be introduced in detail below.
首先,对本申请提供的自注意力模型进行介绍。First, the self-attention model provided by this application is introduced.
本申请提供的自注意力模型的输入为输入序列,输出为对输入序列进行自注意力处理后得到的输出序列。如图4A所示,输入序列可以是一段语料对应的向量组成的序列,该语料中的每个词都对应一个向量,一个或多个向量即组成一个序列。将输入序列输入自注意力模型401,然后由自注意力模型401根据输入序列中序列之间的依赖关系,解读出对应的语料中各个词在语料中的语义,从而得到输出序列。The input of the self-attention model provided by this application is the input sequence, and the output is the output sequence obtained by performing self-attention processing on the input sequence. As shown in FIG. 4A , the input sequence may be a sequence composed of vectors corresponding to a corpus, each word in the corpus corresponds to a vector, and one or more vectors form a sequence. The input sequence is input into the self-attention model 401, and then the self-attention model 401 interprets the semantics of each word in the corresponding corpus according to the dependencies between the sequences in the input sequence, thereby obtaining the output sequence.
例如,自注意力模型可以通过计算词与词之间的依赖关系,将词有效编码成为若干向量表示,使得输出的词向量表示包含其在语料中的语义信息,且结合了语料的上下文信息,使词向量表示所表征的语义更准确,该词向量表示在深度学习(Deep Learning)中也称为隐状态。如图4B所示,如语料“你的手机很不错。”中每个词分别对应一个初始向量,从而组成序列[x 1x 2x 3x 4x 5x 6x 7x 8],如“你”对应初始向量x 1,“的”对应初始向量x 2等以此类推。经自注意力模型进行编码后,输出每个输入的向量对应的输出向量,得到输出序列[h 1h 2h 3h 4h 5h 6h 7h 8],通常输出序列的长度与输入序列的长度相同或者具有相应的映射关系。可以理解为,自注意力模型可以根据每个向量与相邻向量之间的相似度或者依赖关系来解读每个词在语料中的语义,从而得到表示每个词在语料中的语义的输出序列,将词的初始向量进一步优化成为更优的向量表示。 For example, the self-attention model can effectively encode words into several vector representations by calculating the dependencies between words, so that the output word vector represents the semantic information contained in the corpus, and combines the context information of the corpus, Make the semantics represented by the word vector representation, which is also called the hidden state in deep learning (Deep Learning), more accurate. As shown in Figure 4B, each word in the corpus "Your mobile phone is very good." corresponds to an initial vector, thus forming a sequence [x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 ], such as ""you" corresponds to the initial vector x 1 , "the" corresponds to the initial vector x 2 and so on. After encoding by the self-attention model, the output vector corresponding to each input vector is output, and the output sequence [h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 ] is obtained. Usually, the length of the output sequence is the same as that of the input sequence. have the same length or have a corresponding mapping relationship. It can be understood that the self-attention model can interpret the semantics of each word in the corpus according to the similarity or dependency between each vector and adjacent vectors, so as to obtain an output sequence representing the semantics of each word in the corpus. , the initial vector of the word is further optimized into a better vector representation.
因此,在本申请实施方式中,在自注意力模型中引入了ODE化的SA模块,从而通过ODE的方式来计算自注意力。相对于一些现有方案的SA模块堆叠机制,本申请提供的本申请体用的自注意力模型中引入了ODE化的SA模块,有更高的参数效率,提高了自注意力模型的训练和推理的效率。Therefore, in the embodiment of the present application, an ODE-based SA module is introduced into the self-attention model, so that self-attention is calculated by means of ODE. Compared with the SA module stacking mechanism of some existing solutions, the ODE-based SA module is introduced into the self-attention model used in this application, which has higher parameter efficiency and improves the training and performance of the self-attention model. Efficiency of reasoning.
示例性地,参阅图5,本申请提供的一种自注意力模型的结构示意图,如下所述。Exemplarily, referring to FIG. 5 , a schematic structural diagram of a self-attention model provided by the present application is described as follows.
该自注意力模型可以包括多层网络层,每层网络层包括一个或者多个自注意力(self-attention)模块。为便于区分,本申请以下实施方式中,将该多层网络层称为SA网络。The self-attention model may include multiple network layers, each of which includes one or more self-attention modules. For the convenience of distinction, in the following embodiments of the present application, the multi-layer network layer is referred to as the SA network.
具体地,第一层网络层的输入为输入序列,该输入序列包括待处理语料(或者称为第一语料)中每个词对应的初始向量表示。第一层之后的每一层网络层的输入为上一层网络层的输出。最后一层网络层输出SA网络的输出结果。Specifically, the input of the first layer of the network layer is an input sequence, and the input sequence includes an initial vector representation corresponding to each word in the corpus to be processed (or referred to as the first corpus). The input of each network layer after the first layer is the output of the previous network layer. The last layer of the network layer outputs the output result of the SA network.
更具体地,本申请实施例提供的自注意力模型的SA网络中,每一层网络中包括了一个或者多个SA模块。当每一层网络包括了多个SA模块时,每个SA模块的输入端与一个或多个SA模块连接,每个SA模块的输入包括与其输入端连接的一个或多个SA模块的输出。More specifically, in the SA network of the self-attention model provided by the embodiment of the present application, each layer of the network includes one or more SA modules. When each layer network includes multiple SA modules, the input end of each SA module is connected to one or more SA modules, and the input of each SA module includes the output of one or more SA modules connected to its input end.
或者,一些可能的场景中,每个SA模块的输入端除了与一个或多个SA模块连接,还可以与上一层网络的输出端连接或者与SA网络的输入端连接,即每个SA模块的输入还可 以包括当前网络层的输入。通常,若当前网络层为第一层网络,则当前网络层的输入为输入序列,若当前网络层为非第一层网络,则当前网络层的输入为上一层网络的输出。当某一个SA模块的输入包括了多个序列(每个序列包括一个或多个向量)时,可以对该多个序列进行融合,得到当前SA模块的输入。例如,若某一个SA模块的输入包括了多个SA模块的输出,则可以对该多个SA的输出进行融合,得到当前SA模块的输入;或者,当某一个SA模块的输入包括一个或多个SA模块的输出,以及上一层网络的输出,则对该一个或多个SA模块的输出和上一层网络的输出进行融合,得到当前SA模块的输入。Or, in some possible scenarios, in addition to being connected to one or more SA modules, the input end of each SA module may also be connected to the output end of the upper-layer network or to the input end of the SA network, that is, each SA module The input can also include the input of the current network layer. Generally, if the current network layer is the first layer network, the input of the current network layer is the input sequence, and if the current network layer is a non-first layer network, the input of the current network layer is the output of the previous layer network. When the input of a certain SA module includes multiple sequences (each sequence includes one or more vectors), the multiple sequences can be fused to obtain the input of the current SA module. For example, if the input of a certain SA module includes the outputs of multiple SA modules, the outputs of the multiple SA modules can be fused to obtain the input of the current SA module; or, when the input of a certain SA module includes one or more SA modules The output of each SA module and the output of the upper-layer network are merged with the output of the one or more SA modules and the output of the upper-layer network to obtain the input of the current SA module.
为便于理解,示例性地,SA网络的结构可以如图6A所示,其中,该SA网络中包括i+1层网络层,除了第一层,每一层网络层可以复用上一层网络层内部的隐状态,从而结合上一层网络层内部的隐状态,以及当前层中SA模块的计算结果,得到当前层的输出结果。图6A中右侧是其中一层网络层的结构示意图,其中,该,ODE化的第i层内部包含若干SA模块以及相互之间的连接,每个SA模块通常具有同样的架构与参数,每一个连接包含一个权重(如图6A中所示的a 1、a 2、a 3…),该权重也作为ODE化的SA网络的参数,同SA模块的参数一并在训练的过程中学习得到。 For ease of understanding, the structure of the SA network can be exemplarily shown in FIG. 6A , wherein, the SA network includes i+1 layers of network layers. Except for the first layer, each network layer can reuse the network of the previous layer. The hidden state inside the layer is combined with the hidden state inside the previous network layer and the calculation result of the SA module in the current layer to obtain the output result of the current layer. The right side of FIG. 6A is a schematic structural diagram of one of the network layers, wherein the ODE layer i contains several SA modules and connections between them. Each SA module usually has the same structure and parameters, and each SA module has the same structure and parameters. A connection contains a weight (a 1 , a 2 , a 3 . get.
更详细的,如图6B所示,以第一层网络层和第二层网络层为例对SA网络的具体结构进行说明。In more detail, as shown in FIG. 6B , the specific structure of the SA network is described by taking the first-layer network layer and the second-layer network layer as examples.
其中,在获取到输入序列之后,即可将输入序列作为第一层网络(即图6B中所示的网络层1)中每个SA模块的输入。Wherein, after the input sequence is obtained, the input sequence can be used as the input of each SA module in the first-layer network (ie, the network layer 1 shown in FIG. 6B ).
此外,第一网络层中的第一个SA模块的输入可以是输入序列,除第一个SA模块外的其他SA模块的输入除了可以包括输入序列,还可以包括与其连接的前一个SA模块的输出。当SA模块的输入包括了输入序列和与其连接的前一个SA模块的输出时,可以融合该输入序列和与其连接的前一个SA模块的输出,得到新的向量并作为当前SA模块的输入。In addition, the input of the first SA module in the first network layer may be an input sequence, and the inputs of other SA modules except the first SA module may include, in addition to the input sequence, the input sequence of the previous SA module connected to it. output. When the input of the SA module includes the input sequence and the output of the previous SA module connected to it, the input sequence and the output of the previous SA module connected to it can be fused to obtain a new vector and used as the input of the current SA module.
融合第一层网络侧层中所有SA模块的输出,即可得到第一层网络层的输出结果。The output of the first network layer can be obtained by fusing the outputs of all SA modules in the first layer network side layer.
第一层网络的输出结果作为第二层网络(即图6B中所示的网络层2)的输入,第二层网络中每个SA模块的计算过程与第一层网络中每个SA模块的计算方式类似。此外,还将第一层网络中每个SA模块的输出作为第二层网络的输入,即将第一层网络中得到的隐状态作为第二层网络的输入。第二层网络对第一层网络中每个SA模块的输出进行融合,得到第一序列,然后融合该第一序列和第二层网络中每个SA模块的输出,即可得到第二层网络的输出。The output of the first-layer network is used as the input of the second-layer network (ie, network layer 2 shown in FIG. 6B ), and the calculation process of each SA module in the second-layer network is the same as that of each SA module in the first-layer network. The calculation method is similar. In addition, the output of each SA module in the first-layer network is used as the input of the second-layer network, that is, the hidden state obtained in the first-layer network is used as the input of the second-layer network. The second-layer network fuses the output of each SA module in the first-layer network to obtain the first sequence, and then fuses the first sequence with the output of each SA module in the second-layer network to obtain the second-layer network Output.
因此,在本申请提供的自注意力模型中,不仅引入了在自注意力模型中引入了ODE机制,每一层网络还可以复用上一层网络的隐状态,从而可以更准确地计算出自注意力,提高自注意力模型的训练效率,使自注意力模型可以快速收敛,还可提高自注意力模型的输出准确度。Therefore, in the self-attention model provided in this application, not only the ODE mechanism is introduced into the self-attention model, but each layer of network can also reuse the hidden state of the previous layer of network, so that the self-attention model can be calculated more accurately. Attention can improve the training efficiency of the self-attention model, so that the self-attention model can quickly converge, and the output accuracy of the self-attention model can also be improved.
更具体地,下面对每个SA模块内部的计算流程进行示例性说明。More specifically, the calculation flow inside each SA module is exemplified below.
SA模块可以用于基于输入的向量来计算关联度,即语料中每个词和相邻的一个或多个词之间的关联程度,然后融合输入的向量和关联度,即可得到SA模块的输出结果。计算关联度的算法可以包括多种,如相乘、转置相乘等,具体可以选择适应实际应用场景的算法。The SA module can be used to calculate the degree of association based on the input vector, that is, the degree of association between each word in the corpus and one or more adjacent words, and then fuse the input vector and the degree of association to obtain the SA module's degree of association. Output the result. Algorithms for calculating the degree of association may include various algorithms, such as multiplication, transposition and multiplication, etc. Specifically, an algorithm suitable for practical application scenarios can be selected.
示例性地,在将成序列[x 1x 2x 3x 4x 5x 6x 7x 8],转换为输出序列[h 1h 2h 3h 4h 5h 6h 7h 8]的处理过程中,若输入的序列长度为N,每个词(即每个初始向量)的维度为d,输入的序列向量就构成了一个
Figure PCTCN2022071285-appb-000002
的矩阵X,这个矩阵X会分别和三个矩阵W k,W v
Figure PCTCN2022071285-appb-000003
做矩阵乘法(即线性变换),从而得到三个矩阵K,V,
Figure PCTCN2022071285-appb-000004
作为计算计算自注意力(即关联度)的输入。计算自注意力时,将计算K和Q的乘积,从而得到一个N×N的注意力矩阵(attention matrix),表示输入序列中各个元素之间的依赖,即关联度,最终将这个矩阵与V相乘并经过softmax处理,转换为一个
Figure PCTCN2022071285-appb-000005
的序列表示,包含N个d维的向量,即SA模块的输出结果。自注意力模型由每个输入序列中的词向量x i和输入序列中其他所有词向量的相似度信息融入了h i,即h i对该句子每个输入词向量的信息都有依赖,即称输出序列包含了全局信息,即输出序列学到的序列的向量表示能够捕捉长距离的依赖,而且SA模型的训练与推理过程具有更优的并行性,从而实现SA模型的更高效率的训练和推理的效果。
Illustratively, in the process of converting the sequence [x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 ] to the output sequence [h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 ] In the process, if the length of the input sequence is N, and the dimension of each word (that is, each initial vector) is d, the input sequence vector constitutes a
Figure PCTCN2022071285-appb-000002
The matrix X, this matrix X will be combined with the three matrices W k , W v ,
Figure PCTCN2022071285-appb-000003
Do matrix multiplication (ie linear transformation) to get three matrices K, V,
Figure PCTCN2022071285-appb-000004
As an input to the computation computes self-attention (i.e. relevance). When calculating self-attention, the product of K and Q will be calculated to obtain an N×N attention matrix (attention matrix), which represents the dependence between each element in the input sequence, that is, the degree of association, and finally this matrix and V Multiplied and processed by softmax, converted to a
Figure PCTCN2022071285-appb-000005
The sequence representation of , contains N d-dimensional vectors, that is, the output of the SA module. The self-attention model incorporates the similarity information of the word vector x i in each input sequence and all other word vectors in the input sequence into h i , that is, h i is dependent on the information of each input word vector in the sentence, that is It is said that the output sequence contains global information, that is, the vector representation of the sequence learned by the output sequence can capture long-distance dependencies, and the training and inference process of the SA model has better parallelism, so as to achieve more efficient training of the SA model. and inference effects.
可选地,本申请提供的SA模型中,每个网络层中的SA模块的参数相同,或者,SA模型中每个SA模块的参数相同等。因此,在本申请提供的SA模型具有更高的参数效率,提高了模型的训练速度,且占用存储量更少,提高了模型的泛化能力。Optionally, in the SA model provided by this application, the parameters of the SA modules in each network layer are the same, or the parameters of each SA module in the SA model are the same, and so on. Therefore, the SA model provided in the present application has higher parameter efficiency, improves the training speed of the model, occupies less memory, and improves the generalization ability of the model.
此外,自注意力模型中还可以包括特征提取网络,用于提取输入序列的局部特征,即以相邻的多个初始向量表示为单位从输入序列中提取特征,得到局部特征序列,并可以将局部特征序列作为前述的多层网络的输入。可以理解为,本申请实施方式中,通过特征提取网络,以相邻的多个向量为单位来进行特征提取,从而提取到包含了局部信息的向量。In addition, the self-attention model can also include a feature extraction network for extracting local features of the input sequence, that is, extracting features from the input sequence in units of multiple adjacent initial vector representations to obtain a local feature sequence, and can The local feature sequence is used as the input of the aforementioned multilayer network. It can be understood that, in the embodiments of the present application, the feature extraction network is used to perform feature extraction in units of multiple adjacent vectors, so as to extract a vector including local information.
因此,本申请实施方式中,可以通过局部特征提取的方式,从而关注输入序列的局部信息,提取到输入序列中细节的信息,从而得到能更准确表示语义的输出序列。Therefore, in the embodiments of the present application, local feature extraction can be used to focus on the local information of the input sequence, and to extract detailed information in the input sequence, thereby obtaining an output sequence that can more accurately represent semantics.
具体地,特征提取网络可以由多个卷积核组成,通常卷积核的宽度即提取特征的单位的长度。例如,若卷积核的宽度为3,则可以表示提取特征时,可以以相邻的3个初始向量为单位进行特征提取,即将3个相邻的初始向量作为卷积核的输入,从而输出提取到的局部特征序列。Specifically, the feature extraction network can be composed of multiple convolution kernels, and usually the width of the convolution kernel is the length of the unit for extracting features. For example, if the width of the convolution kernel is 3, it can indicate that when extracting features, the feature extraction can be performed in units of three adjacent initial vectors, that is, the three adjacent initial vectors are used as the input of the convolution kernel, so as to output The extracted local feature sequence.
此外,可选地,自注意力模型还可以包括融合网络,用于融合局部特征序列和多层网络的输出结果,从而得到输出序列。可以理解为,SA网络输出的序列中携带了全局信息,而局部特征序列中携带了局部信息,融合了全局信息和局部信息之后,即可得到能表示全局和局部信息的输出序列。In addition, optionally, the self-attention model can also include a fusion network for fusing the local feature sequence and the output results of the multi-layer network to obtain the output sequence. It can be understood that the sequence output by the SA network carries the global information, while the local feature sequence carries the local information. After the global information and the local information are fused, the output sequence that can represent the global and local information can be obtained.
示例性地,如图7所示,自注意力模型中可以包括特征提取网络701、SA网络702和融合网络703,SA网络即如图前述图5-图6B中所示出的网络。Exemplarily, as shown in FIG. 7 , the self-attention model may include a feature extraction network 701 , an SA network 702 and a fusion network 703 , and the SA network is the network shown in the foregoing FIGS. 5 to 6B .
特征提取网络701可以从输入序列中提取特征,然后将得到的局部特征序列作为SA网络702的输入。The feature extraction network 701 can extract features from the input sequence, and then use the obtained local feature sequence as the input of the SA network 702 .
SA网络702可以用于通过计算语料中每个词和相邻的至少一个词之间的关联程度,来得到输出结果。SA网络702的输入可以是特征提取网络701输出的局部特征序列,SA网络702具体可以基于局部特征序列来计算语料中每个词和相邻的至少一个词之间的关联程度,从而得到输出结果,参阅前述图5-图6B中的相关描述,此处不再赘述。The SA network 702 can be used to obtain an output result by calculating the degree of association between each word in the corpus and at least one adjacent word. The input of the SA network 702 can be the local feature sequence output by the feature extraction network 701, and the SA network 702 can specifically calculate the degree of association between each word in the corpus and at least one adjacent word based on the local feature sequence, so as to obtain the output result , please refer to the related descriptions in the aforementioned FIG. 5 to FIG. 6B , which will not be repeated here.
融合网络703可以融合SA网络702的输出结果和特征提取网络701输出的局部特征序列,得到输出序列。The fusion network 703 can fuse the output result of the SA network 702 and the local feature sequence output by the feature extraction network 701 to obtain the output sequence.
具体的融合方式例如,融合网络703可以计算局部特征序列和融合SA网络702的输出结果之间的相似度,然后融合相似度和所述局部特征序列,得到输出序列。For example, the fusion network 703 can calculate the similarity between the local feature sequence and the output result of the fusion SA network 702, and then fuse the similarity and the local feature sequence to obtain the output sequence.
因此,在本申请实施方式中,融合了全局和局部语义信息,得到更准确的上下文语义信息,使最终输出的输出序列可以更优地表示出每个词在语料中的语义,从而提高下游任务的准确度。Therefore, in the embodiment of the present application, global and local semantic information is fused to obtain more accurate contextual semantic information, so that the final output sequence can better represent the semantics of each word in the corpus, thereby improving downstream tasks. accuracy.
此外,另一种自注意力模型的结构可以如图8所示,特征提取网络701所提取到的局部特征序列也可以不作为SA网络702的输入,即输入序列作为SA网络702的输入,特征提取网络701输出的局部特征序列和SA网络702的输出结果作为融合网络703的输入,得到输出序列。即融合网络融合了局部特征序列和SA网络的输出结果,从而得到输出序列,相当于输出序列中即包含全局信息也包含局部信息,从而关注了语料中的全局和局部的语义,使输出序列中所解读出的语义更准确。In addition, the structure of another self-attention model can be shown in FIG. 8. The local feature sequence extracted by the feature extraction network 701 can also not be used as the input of the SA network 702, that is, the input sequence is used as the input of the SA network 702. The local feature sequence output by the extraction network 701 and the output result of the SA network 702 are used as the input of the fusion network 703 to obtain the output sequence. That is, the fusion network fuses the local feature sequence and the output result of the SA network to obtain the output sequence, which is equivalent to the output sequence containing both global information and local information, thus paying attention to the global and local semantics in the corpus, so that the output sequence contains both global and local information. The interpreted semantics are more accurate.
需要说明的是,本申请以上或者以下实施方式中所提及的融合,可以是指将两个值相乘、加权融合、或者直接拼接等操作,具体可以根据实际应用场景来选择具体的融合方式,本申请对此并不作限定。例如,在得到两个序列之后,将这两个序列相乘,即可得到融合后的序列。又例如,得到了两个维度分别为5、8的序列,可直接将这两个序列进行拼接,即可得到维度为13的序列。还例如,在得到两个序列之后,为每个序列赋予相应的权重值,然后通过加权融合的方式得到新的序列。It should be noted that the fusion mentioned in the above or the following embodiments of this application may refer to operations such as multiplying two values, weighted fusion, or direct splicing, and the specific fusion method can be selected according to the actual application scenario. , which is not limited in this application. For example, after obtaining two sequences, multiply the two sequences to obtain the fused sequence. For another example, two sequences with dimensions of 5 and 8 are obtained, and the two sequences can be directly spliced to obtain a sequence with a dimension of 13. For another example, after two sequences are obtained, a corresponding weight value is assigned to each sequence, and then a new sequence is obtained by means of weighted fusion.
此外,本申请提供的自注意力模型中还可以包括与下游任务相关的网络,如是神经机器翻译、文本分类或者预训练语言模型等。下面示例性地,以其中几种网络为例进行示例性介绍。In addition, the self-attention model provided in this application may also include networks related to downstream tasks, such as neural machine translation, text classification, or pre-trained language models. In the following, several kinds of networks are taken as examples for exemplary introduction.
示例一、分类网络Example 1. Classification network
其中,自注意力模型中还可以包括分类网络,或者自注意力模型的输出可以用于输入至分类网络,以识别输入序列对应的语料的类别。Wherein, the self-attention model may further include a classification network, or the output of the self-attention model may be used for input to the classification network to identify the category of the corpus corresponding to the input sequence.
例如,包括了分类网络的自注意力模型的结构可以如图9所示,N-Gram卷积层用于学习输入的文本表征矩阵(即输入序列)的局部信息。该N-Gram卷积层对输入的序列表示进行卷积计算,N-Gram卷积层包含若干卷积核,每个卷积核会与每一个固定长度的子序列(如N=3,该子序列即图9中所示的每相邻的三个词对应的初始向量)做卷积运算,最终将所有卷积核计算得到的值进行汇总得到序列中每个初始向量的新的向量表示,由于每个卷积核的宽度有限,只能在相邻的若干个元素的表示上做卷积计算得到隐状态,如图9所示,每个隐状态由相邻2个或者3个的向量表示,因此每个隐状态只能够包含该元素与相邻有限元素的依赖关系,即局部信息。For example, the structure of a self-attention model including a classification network can be shown in Figure 9, where an N-Gram convolutional layer is used to learn the local information of the input text representation matrix (ie, the input sequence). The N-Gram convolutional layer performs convolution calculation on the input sequence representation. The N-Gram convolutional layer contains several convolution kernels, and each convolution kernel is associated with each fixed-length subsequence (for example, N=3, the The subsequence is the initial vector corresponding to each adjacent three words shown in Figure 9) to perform the convolution operation, and finally the values calculated by all the convolution kernels are summarized to obtain a new vector representation of each initial vector in the sequence. , due to the limited width of each convolution kernel, the hidden state can only be obtained by convolution calculation on the representation of several adjacent elements. As shown in Figure 9, each hidden state consists of two or three adjacent elements. It is represented by a vector, so each hidden state can only contain the dependencies between the element and adjacent finite elements, that is, local information.
如图10所示,展示了本申请中ODE化的SA模块,图中展示了第一层和第二层的ODE化的SA计算方式,第一层包含4个SA模块,第二层包含3个SA模块,此外,也包括对于前一层状态的复用,第二层往上的每一层的架构和参数与第二层通常是相同的,每个SA模块通常也是相同的,如图10中的下半图中所示。图中每一层图中G()表示相应层输出的隐 状态。图10中a i和c i是ODE化的SA模块的模型参数,r是SA模型的超参数,通常是固定值。例如,SA网络的输出可以表示为:
Figure PCTCN2022071285-appb-000006
K=(XW K),Q=(XW Q),V=(XW V),X即为输入序列。
As shown in Figure 10, the ODE-based SA module in this application is shown. The figure shows the ODE-based SA calculation method of the first layer and the second layer. The first layer contains 4 SA modules, and the second layer contains 3 SA modules. In addition, it also includes the multiplexing of the state of the previous layer. The architecture and parameters of each layer above the second layer are usually the same as those of the second layer, and each SA module is usually the same, as shown in the figure 10 is shown in the lower half of the figure. In each layer in the figure, G() represents the hidden state of the output of the corresponding layer. In Figure 10, a i and c i are the model parameters of the ODE SA module, and r is the hyperparameter of the SA model, which is usually a fixed value. For example, the output of the SA network can be expressed as:
Figure PCTCN2022071285-appb-000006
K=(XW K ), Q=(XW Q ), V=(XW V ), X is the input sequence.
其次,融合网络可以参阅图11。图中h L表示从特征提取网络输出的包含局部信息的词向量序列,h G表示从ODE化的SA网络输出的包含全局信息的词向量序列,该融合网络通过注意力机制(attention)和门(gate)机制将两者进行融合,得到一个融合了局部信息和全局信息的词向量序列,即h O。具体的融合方式可以包括,对h L和h G做一个欧式注意力进计算(Euclidean Attention)的操作,具体而言就是根据欧式距离计算h L和h G中每个词向量之间的相似度,得到相似度矩阵E,如表示为
Figure PCTCN2022071285-appb-000007
之后会将E和h L做矩阵乘法,如表示为h E=E.h L,得到一个新的词向量表示h E。至终,h L、h E和两个参数矩阵相乘并经过层归一(Layernorm)得到最终的输出h O,如表示为h O=layer_norm(W L*h L+W E*h E)。
Second, the fusion network can be seen in Figure 11. In the figure, h L represents the word vector sequence containing local information output from the feature extraction network, and h G represents the word vector sequence containing global information output from the ODE SA network. The fusion network uses the attention mechanism (attention) and gate. The (gate) mechanism fuses the two to obtain a word vector sequence that combines local information and global information, namely h O . The specific fusion method may include performing a Euclidean Attention operation on h L and h G , specifically, calculating the similarity between each word vector in h L and h G according to the Euclidean distance , the similarity matrix E is obtained, as expressed as
Figure PCTCN2022071285-appb-000007
After that, matrix multiplication will be performed on E and h L , such as h E =Eh L , to obtain a new word vector representing h E . Finally, h L , h E and the two parameter matrices are multiplied and the final output h O is obtained through layer normalization (Layernorm), such as h O =layer_norm(W L *h L +W E *h E ) .
分类网络可以用于识别输入的向量的语句的类型、或者输入的向量的语句中所包括的名词的分类。具体例如,若输入序列对应的文本为“你的手机很不错”,则识别出该文本对应的类别为“手机”。The classification network may be used to identify the type of sentence of the input vector, or the classification of the nouns included in the sentence of the input vector. For example, if the text corresponding to the input sequence is "Your mobile phone is very good", the category corresponding to the text is identified as "mobile phone".
示例二、翻译网络Example 2. Translation Network
其中,自注意力模型还可以包括翻译网络,输出序列作为该翻译网络的输入,输出与输入序列对应的语句不同语种的语文,或者此处将翻译后得到的语料称为第二语料。第一语料的语种和第二语料的语种不相同。The self-attention model may also include a translation network, the output sequence is used as the input of the translation network, and the language corresponding to the input sequence is output in different languages, or the corpus obtained after translation is referred to as the second corpus here. The language of the first corpus and the language of the second corpus are different.
例如,第一语料的输入序列经过SA网络之后,通过SA网络来分析第一语料中每个词的含义,从而得到能够表示第一语料的语义的输出序列。将该输出序列作为翻译网络的输入,得到第一语料对应的翻译结果。具体例如,若输入序列对应的文本为“你的手机很不错”,经过SA网络可以分析出该文本中每个词在其文本中的语义,然后通过翻译网络翻译得到对应的英文“Your cell phone is very nice”。For example, after the input sequence of the first corpus passes through the SA network, the SA network is used to analyze the meaning of each word in the first corpus, so as to obtain an output sequence that can represent the semantics of the first corpus. The output sequence is used as the input of the translation network, and the translation result corresponding to the first corpus is obtained. For example, if the text corresponding to the input sequence is "Your cell phone is very good", the SA network can analyze the semantics of each word in the text in the text, and then translate the corresponding English "Your cell phone" through the translation network. is very nice”.
包括了翻译网络的自注意力模型的结构与前述图9-图11所示的结构类似,区别仅在于翻译网络的结构。The structure of the self-attention model including the translation network is similar to the structure shown in the aforementioned Figures 9-11, the difference is only in the structure of the translation network.
本申请所提及的分类网络或翻译网络等可以选取深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络(recurrent neural network,RNNS)等等。本申请提及的神经网络可以包括多种类型,如深度神经网络(deep neural network,DNN)、卷积神经网络(convolutional neural network,CNN)、循环神经网络(recurrent neural networks,RNN)或残差网络其他神经网络等。The classification network or translation network mentioned in this application can be selected from deep convolutional neural networks (DCNN), recurrent neural networks (RNNS), and so on. The neural network mentioned in this application may include various types, such as deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN) or residual network other neural networks, etc.
因此,在本申请实施方式中,ODE化自注意力模型可以应用于多种场景,完成多种下游任务,泛化能力强,可以适应更多的场景。Therefore, in the embodiment of the present application, the ODE-based self-attention model can be applied to a variety of scenarios, complete a variety of downstream tasks, has strong generalization ability, and can adapt to more scenarios.
此外,在对本申请提供的自注意力模型进行训练时,可以使用BP或adjoint ODE等算 法来进行更新。例如,以使用adjoint ODE算法来进行更新自注意力模型为例,在自注意力模型完成一次正向推理得到推理结果之后,可以通过adjoint ODE算法,采用数值解法求梯度,即直接计算出梯度,而无需再进行反向传播,大大减少了内存消耗和梯度误差。In addition, when training the self-attention model provided by this application, algorithms such as BP or adjoint ODE can be used to update. For example, taking the use of the adjoint ODE algorithm to update the self-attention model as an example, after the self-attention model completes a forward inference to obtain the inference result, the adjoint ODE algorithm can be used to obtain the gradient numerically, that is, the gradient can be directly calculated. There is no need for back-propagation, which greatly reduces memory consumption and gradient errors.
前述对本申请提供的自注意力模型的结构进行了介绍,下面基于前述图4A-图11提供的自注意力模型,对本申请提供的自然语言处理方法进行详细介绍。The structure of the self-attention model provided by the present application has been introduced in the foregoing, and the natural language processing method provided by the present application will be described in detail below based on the self-attention model provided by the foregoing FIGS. 4A to 11 .
参阅图12,本申请提供的一种自然语言处理方法的流程示意图,如下所述。Referring to FIG. 12 , a schematic flowchart of a natural language processing method provided by the present application is as follows.
1201、获取输入序列。1201. Obtain an input sequence.
其中,输入序列中包括了语料中各个词对应的初始向量表示。Among them, the input sequence includes the initial vector representation corresponding to each word in the corpus.
例如,在得到一句语料(或者称为一段文本)之后,按照预先设定的映射关系将语料中的每个词转换为初始向量表示,每个词对应一个初始向量表示,一个或者多个初始表现向量组成一段输入序列。具体例如,获取到一段待处理的语料“今天天气怎样”,可以提前设置映射表,该映射表中设定了每个词对应一个向量,如“今”对应向量x 1,“天”对应向量x 2等。 For example, after a sentence of corpus (or a piece of text) is obtained, each word in the corpus is converted into an initial vector representation according to a preset mapping relationship, each word corresponds to an initial vector representation, one or more initial representations Vectors form an input sequence. Specifically, for example, to obtain a piece of corpus to be processed "how is the weather today", a mapping table can be set up in advance, and each word is set in the mapping table to correspond to a vector, such as "now" corresponds to a vector x 1 , and "day" corresponds to a vector x 2 etc.
1202、将输入序列作为自注意力模型的输入,得到输出序列。1202. Use the input sequence as the input of the self-attention model to obtain the output sequence.
其中,输出序列中包括经自注意力模型进行自然语言处理后的第一语料中的至少一个词对应的向量表示,该输出序列中的每个向量表示第一语料中的各个词在第一语料中的语义信息,其语义信息中结合了第一语料的上下文信息,能够准确表示出各个词在第一语料中的准确含义。The output sequence includes a vector representation corresponding to at least one word in the first corpus after natural language processing by the self-attention model, and each vector in the output sequence indicates that each word in the first corpus is in the first corpus. The semantic information in the first corpus is combined with the context information of the first corpus, which can accurately represent the exact meaning of each word in the first corpus.
通常,自注意力模型包括多层网络,该多层网络中的任意一层网络的输入为上一层网络的输出,每层网络包括多个自注意力模块,每个自注意力模块用于基于输入的向量来计算表示第一语料中每个词和相邻的至少一个词之间的关联程度的关联度,并融合关联度和输入的向量,得到第一序列,融合多个自注意力模块输出的第一序列和上一层网络中的多个自注意力模块输出的第一序列,即可得到每层网络的输出。Generally, the self-attention model includes a multi-layer network, the input of any layer of the multi-layer network is the output of the previous layer network, each layer of the network includes multiple self-attention modules, and each self-attention module is used for Calculate the degree of association representing the degree of association between each word in the first corpus and at least one adjacent word based on the input vector, and fuse the degree of association and the input vector to obtain the first sequence and fuse multiple self-attentions The output of each layer of network can be obtained from the first sequence output by the module and the first sequence output by multiple self-attention modules in the previous layer of network.
更具体地,本步骤中所提及的自注意力模型可以参阅前述图4A-图11中的相关描述,此处不再赘述。More specifically, for the self-attention model mentioned in this step, reference may be made to the relevant descriptions in FIG. 4A to FIG. 11 , which will not be repeated here.
因此,在本申请实施方式中,通过神经常微分方程网络的机制来实现自注意力模型,通过一层自注意力模块的神经常微分化(即自注意力模型中的多次调用和求和),实现多层自注意力的拟合,得到能更准确表达语义的输出序列。并且,将状态复用机制引入神经常微分方程,在神经常微分方程对每一层的拟合过程中(第一层除外)复用前一层内部的隐状态信息,从而提高计算速度。即当前网络层可以复用上一层网络的SA模块的输出,从而使自注意力模型可以快速得到更准确的输出结果,降低了模型的计算复杂度,提高自注意力模型的训练以及推理的效率。Therefore, in the embodiment of the present application, the self-attention model is realized through the mechanism of the neural network of frequent differential equations, and the neural differentiation through a layer of self-attention modules (that is, multiple calls and summations in the self-attention model) ), to achieve multi-layer self-attention fitting, and obtain an output sequence that can more accurately express semantics. In addition, the state reuse mechanism is introduced into the normal differential equation, and the hidden state information inside the previous layer is reused during the fitting process of the normal differential equation to each layer (except the first layer), thereby improving the calculation speed. That is, the current network layer can reuse the output of the SA module of the previous layer of the network, so that the self-attention model can quickly obtain more accurate output results, reduce the computational complexity of the model, and improve the training and inference of the self-attention model. efficiency.
具体地,若自注意力模型中包括了SA网络,而不包括特征提取网络和融合网络,则在得到输入序列之后,即可直接将输入序列作为SA网络的输入,得到输出序列。例如,自注意力模型可以是如前述图5-图6B中所示出的自注意力模型,在得到输入序列之后,即可将该输入序列输入至SA网络中,然后由SA网络来输出对应的输出序列。Specifically, if the SA network is included in the self-attention model, but the feature extraction network and fusion network are not included, after the input sequence is obtained, the input sequence can be directly used as the input of the SA network to obtain the output sequence. For example, the self-attention model can be the self-attention model shown in the aforementioned Figures 5-6B. After the input sequence is obtained, the input sequence can be input into the SA network, and then the SA network outputs the corresponding output sequence.
若自注意力模型除了包括SA网络,还包括了特征提取网络和融合网络,如前述图7所 示,则可以将输入序列作为特征提取网络的输入,提取到输入序列的局部特征序列,然后将该局部特征序列作为SA网络的输入,并将SA网络的输出结果和局部特征序列输入至融合网络中,由融合网络来融合SA网络的输出结果和局部特征序列,从而得到最终的输出序列。此外,如前述图8所示,特征提取网络提取到的局部特征序列也可以不作为SA网络的输入,而直接将输入序列作为SA网络的输入,由融合网络来融合SA网络的输出结果和局部特征序列,从而得到最终的输出序列。If the self-attention model includes not only the SA network, but also the feature extraction network and the fusion network, as shown in Figure 7, the input sequence can be used as the input of the feature extraction network to extract the local feature sequence of the input sequence, and then the The local feature sequence is used as the input of the SA network, and the output result of the SA network and the local feature sequence are input into the fusion network, and the fusion network fuses the output result of the SA network and the local feature sequence to obtain the final output sequence. In addition, as shown in Figure 8 above, the local feature sequence extracted by the feature extraction network can also not be used as the input of the SA network, but the input sequence can be directly used as the input of the SA network, and the fusion network is used to fuse the output of the SA network and the local feature sequence to obtain the final output sequence.
因此,本申请实施方式中,可以从输入序列中提取局部特征,并融合局部特征和SA网络的输出结果得到输出序列,使输出序列中融合了局部特征。从而使最终得到的自注意力结果关注局部特征,得到能更准确表示每个词的语义的序列。Therefore, in the embodiment of the present application, local features can be extracted from the input sequence, and the local features and the output result of the SA network can be fused to obtain the output sequence, so that the local features are fused in the output sequence. In this way, the final self-attention result focuses on local features, and a sequence that can more accurately represent the semantics of each word is obtained.
在一种可能的场景中,前述的SA网络中的每一层网络中每个SA模块的参数相同,或者,整个SA网络中的每个SA模块的参数相同。因此,本申请实施方式中,可以通过相同的SA模块的参数来实现自注意力模型,使自注意力模型占用减少的存储量,并且可以提高自注意力模型的训练和正向推理的效率。通过将ODE引入SA机制中,通过ODE的方式实现自注意力模型,解决了一些SA模型中多层重叠带来的参数冗余问题,使用一层网络的参数量即可达到原有的多层网络所能达到的效果。In a possible scenario, the parameters of each SA module in each layer of the aforementioned SA network are the same, or the parameters of each SA module in the entire SA network are the same. Therefore, in the embodiments of the present application, the self-attention model can be implemented through the same parameters of the SA module, so that the self-attention model occupies a reduced amount of storage, and the efficiency of training and forward inference of the self-attention model can be improved. By introducing ODE into the SA mechanism, the self-attention model is realized by means of ODE, which solves the parameter redundancy problem caused by the multi-layer overlapping in some SA models. The original multi-layer can be achieved by using the parameters of one layer of network. effect of the network.
此外,在步骤1201之前,还可以使用训练集对自注意力模型进行训练,该训练集中可以包括至少一条语料,每条语料包括至少一个词以及对应的标签,每个标签中可以包括语料对应的能表示其中每个词的语义的向量。具体例如,可以将训练集中的语料作为自注意力模型的输入,得到自注意力模型的推理结果,然后使用adjoint ODE算法计算出梯度值,基于该梯度之来更新自注意力模型的参数,使自注意力模型的输出结果与语料对应的标签更接近。因此,本申请实施方式中,采用数值解法求梯度,相对于反向传播算法,大大减少了内存消耗和梯度误差。In addition, before step 1201, the self-attention model can also be trained by using a training set. The training set can include at least one corpus, each corpus includes at least one word and a corresponding label, and each label can include the corresponding corpus. A vector representing the semantics of each word in it. Specifically, for example, the corpus in the training set can be used as the input of the self-attention model, the inference result of the self-attention model can be obtained, and then the gradient value can be calculated using the adjoint ODE algorithm, and the parameters of the self-attention model can be updated based on the gradient, so that the The output of the self-attention model is closer to the label corresponding to the corpus. Therefore, in the embodiment of the present application, the numerical solution method is used to obtain the gradient, which greatly reduces memory consumption and gradient error compared with the backpropagation algorithm.
前述对本申请提供的自注意力模型和自然语言处理方法进行了详细介绍,下面对承载该自注意力模型或执行前述自然语言处理方法的装置进行详细介绍。The self-attention model and the natural language processing method provided by the present application are described in detail above, and the apparatus for carrying the self-attention model or for executing the foregoing natural language processing method is described in detail below.
参阅图13,本申请提供一种自然语言处理装置,包括:Referring to FIG. 13, the present application provides a natural language processing apparatus, including:
获取模块1301,用于获取输入序列,输入序列包括第一语料中至少一个词对应的初始向量表示;an acquisition module 1301, configured to acquire an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus;
处理模块1302,用于将输入序列作为自注意力模型的输入,得到输出序列,输出序列中包括经自注意力模型进行自然语言处理NLP后的第一语料中的至少一个词对应的向量表示,输出序列表示第一语料中的各个词在第一语料中的语义信息;The processing module 1302 is used to take the input sequence as the input of the self-attention model, and obtain the output sequence, and the output sequence includes the vector representation corresponding to at least one word in the first corpus after the natural language processing NLP by the self-attention model, The output sequence represents the semantic information of each word in the first corpus in the first corpus;
其中,自注意力模型包括多层网络,多层网络中的任意一层网络的输入为上一层网络的输出,每层网络包括多个自注意力模块,每个自注意力模块用于基于输入的向量计算关联度,并融合关联度和输入的向量,得到第一序列,每层网络的输出为融合多个自注意力模块输出的第一序列和上一层网络中的多个自注意力模块输出的第一序列得到,关联度表示第一语料中每个词和相邻的至少一个词之间的关联程度。Among them, the self-attention model includes a multi-layer network, the input of any layer of the multi-layer network is the output of the previous layer of network, each layer of the network includes multiple self-attention modules, and each self-attention module is used based on The input vector calculates the correlation degree, and fuses the correlation degree and the input vector to obtain the first sequence. The output of each layer of the network is the first sequence output by the fusion of multiple self-attention modules and multiple self-attentions in the previous layer of network. The first sequence output by the force module is obtained, and the degree of association represents the degree of association between each word in the first corpus and at least one adjacent word.
在一种可能的实施方式中,自注意力模型中还包括特征提取网络,用于从输入序列中以相邻的多个初始向量表示为单位提取特征,得到局部特征序列,局部特征序列作为多层 网络的输入。In a possible implementation, the self-attention model further includes a feature extraction network, which is used to extract features from the input sequence in units of multiple adjacent initial vector representations to obtain a local feature sequence, where the local feature sequence is used as a multi-dimensional feature sequence. input to the layer network.
在一种可能的实施方式中,自注意力模型还包括融合网络,融合网络用于融合局部特征序列和多层网络的输出结果,得到输出序列。In a possible implementation, the self-attention model further includes a fusion network, and the fusion network is used to fuse the local feature sequence and the output result of the multi-layer network to obtain the output sequence.
在一种可能的实施方式中,融合网络具体用于:计算局部特征序列和输出结果之间的相似度,融合相似度和局部特征序列,得到输出序列。In a possible implementation, the fusion network is specifically used to: calculate the similarity between the local feature sequence and the output result, and fuse the similarity and the local feature sequence to obtain the output sequence.
在一种可能的实施方式中,自注意力模型还包括分类网络,分类网络的输入为输出序列,输出第一语料对应的类别。In a possible implementation, the self-attention model further includes a classification network, the input of the classification network is an output sequence, and the category corresponding to the first corpus is output.
在一种可能的实施方式中,自注意力模型还包括翻译网络,第一语料的语言种类为第一类,翻译网络的输入为输出序列,输出第二语料,第二语料的语言种类为第二类,第一类和第二类为不同的语言种类。In a possible implementation, the self-attention model further includes a translation network, the language type of the first corpus is the first type, the input of the translation network is the output sequence, and outputs the second corpus, and the language type of the second corpus is the first type The second category, the first category and the second category are different language types.
在一种可能的实施方式中,多层网络中每一层网络中的多个自注意力模块的参数相同。In a possible implementation, the parameters of multiple self-attention modules in each layer of the multi-layer network are the same.
此外,上述提及的自注意模型可以参阅前述图4A-图12中的相关描述,此处不再赘述。In addition, for the self-attention model mentioned above, reference may be made to the relevant descriptions in the aforementioned FIG. 4A to FIG. 12 , and details are not repeated here.
请参阅图14,本申请提供的另一种自然语言处理装置的结构示意图,如下所述。Please refer to FIG. 14 , which is a schematic structural diagram of another natural language processing apparatus provided by the present application, as described below.
该自然语言处理装置可以包括处理器1401和存储器1402。该处理器1401和存储器1402通过线路互联。其中,存储器1402中存储有程序指令和数据。The natural language processing apparatus may include a processor 1401 and a memory 1402 . The processor 1401 and the memory 1402 are interconnected by wires. Among them, the memory 1402 stores program instructions and data.
存储器1402中存储了前述图4A-图12中的步骤对应的程序指令以及数据。The memory 1402 stores program instructions and data corresponding to the steps in the foregoing FIGS. 4A to 12 .
处理器1401用于执行前述图4A-图12中任一实施例所示的自然语言处理装置执行的方法步骤。The processor 1401 is configured to execute the method steps executed by the natural language processing apparatus shown in any of the foregoing embodiments in FIG. 4A to FIG. 12 .
可选地,该自然语言处理装置还可以包括收发器1403,用于接收或者发送数据。Optionally, the natural language processing apparatus may further include a transceiver 1403 for receiving or transmitting data.
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于生成车辆行驶速度的程序,当其在计算机上行驶时,使得计算机执行如前述图4A-图12所示实施例描述的方法中的步骤。Embodiments of the present application also provide a computer-readable storage medium, where a program for generating a vehicle's running speed is stored in the computer-readable storage medium, and when the computer is running on a computer, the computer is made to execute the above-mentioned FIG. 4A-FIG. 12 The illustrated embodiment describes the steps in the method.
可选地,前述的图14中所示的自然语言处理装置为芯片。Optionally, the aforementioned natural language processing apparatus shown in FIG. 14 is a chip.
本申请实施例还提供了一种自然语言处理装置,该自然语言处理装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行前述图4A-图12中任一实施例所示的自然语言处理装置执行的方法步骤。The embodiments of the present application also provide a natural language processing device, which may also be referred to as a digital processing chip or a chip. The chip includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface, and the program instructions are processed. The unit is executed, and the processing unit is configured to execute the method steps executed by the natural language processing apparatus shown in any of the foregoing embodiments in FIG. 4A to FIG. 12 .
本申请实施例还提供一种数字处理芯片。该数字处理芯片中集成了用于实现上述处理器1401,或者处理器1401的功能的电路和一个或者多个接口。当该数字处理芯片中集成了存储器时,该数字处理芯片可以完成前述实施例中的任一个或多个实施例的方法步骤。当该数字处理芯片中未集成存储器时,可以通过通信接口与外置的存储器连接。该数字处理芯片根据外置的存储器中存储的程序代码来实现上述实施例中自然语言处理装置执行的动作。The embodiments of the present application also provide a digital processing chip. The digital processing chip integrates circuits and one or more interfaces for realizing the above-mentioned processor 1401 or the functions of the processor 1401 . When a memory is integrated in the digital processing chip, the digital processing chip can perform the method steps of any one or more of the foregoing embodiments. When the digital processing chip does not integrate the memory, it can be connected with the external memory through the communication interface. The digital processing chip implements the actions performed by the natural language processing apparatus in the above embodiment according to the program codes stored in the external memory.
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上行驶时,使得计算机执行如前述图4A-图12所示实施例描述的方法中自然语言处理装置所执行的步骤。Embodiments of the present application also provide a computer program product that, when driving on a computer, causes the computer to execute the steps performed by the natural language processing apparatus in the method described in the embodiments shown in the foregoing FIGS. 4A-12 .
本申请实施例提供的自然语言处理装置可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。 该处理单元可执行存储单元存储的计算机执行指令,以使服务器内的芯片执行上述图4A-图12所示实施例描述的自然语言处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。The natural language processing apparatus provided in this embodiment of the present application may be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit, etc. . The processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip in the server executes the natural language processing method described in the embodiments shown in FIG. 4A to FIG. 12 . Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
具体地,前述的处理单元或者处理器可以是中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者也可以是任何常规的处理器等。Specifically, the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), a digital signal processing digital signal processor (DSP), application specific integrated circuit (ASIC) or field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or it may be any conventional processor or the like.
示例性地,请参阅图15,图15为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 150,NPU 150作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1503,通过控制器1504控制运算电路1503提取存储器中的矩阵数据并进行乘法运算。Exemplarily, please refer to FIG. 15. FIG. 15 is a schematic structural diagram of a chip provided by an embodiment of the application. The chip can be represented as a neural network processor NPU 150, and the NPU 150 is mounted on the main CPU ( Host CPU), the task is allocated by the Host CPU. The core part of the NPU is the arithmetic circuit 1503, which is controlled by the controller 1504 to extract the matrix data in the memory and perform multiplication operations.
在一些实现中,运算电路1503内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路1503是二维脉动阵列。运算电路1503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1503是通用的矩阵处理器。In some implementations, the arithmetic circuit 1503 includes multiple processing units (process engines, PEs). In some implementations, the arithmetic circuit 1503 is a two-dimensional systolic array. The arithmetic circuit 1503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1503 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1502中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1508中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1502 and buffers it on each PE in the arithmetic circuit. The arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1501 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 1508 .
统一存储器1506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)1505,DMAC被搬运到权重存储器1502中。输入数据也通过DMAC被搬运到统一存储器1506中。Unified memory 1506 is used to store input data and output data. The weight data is directly passed through the storage unit access controller (direct memory access controller, DMAC) 1505, and the DMAC is transferred to the weight memory 1502. Input data is also moved into unified memory 1506 via the DMAC.
总线接口单元(bus interface unit,BIU)1510,用于AXI总线与DMAC和取指存储器(instruction fetch buffer,IFB)1509的交互。A bus interface unit (BIU) 1510 is used for the interaction between the AXI bus and the DMAC and an instruction fetch buffer (instruction fetch buffer, IFB) 1509.
总线接口单元1510(bus interface unit,BIU),用于取指存储器1509从外部存储器获取指令,还用于存储单元访问控制器1505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 1510 (bus interface unit, BIU) is used for the instruction fetch memory 1509 to obtain instructions from the external memory, and is also used for the storage unit access controller 1505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1506或将权重数据搬运到权重存储器1502中或将输入数据数据搬运到输入存储器1501中。The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1506 or the weight data to the weight memory 1502 or the input data to the input memory 1501 .
向量计算单元1507包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如批归一化(batch normalization),像素级求和,对特 征平面进行上采样等。The vector calculation unit 1507 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.
在一些实现中,向量计算单元1507能将经处理的输出的向量存储到统一存储器1506。例如,向量计算单元1507可以将线性函数和/或非线性函数应用到运算电路1503的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1507生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1503的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector computation unit 1507 can store the vector of processed outputs to the unified memory 1506 . For example, the vector calculation unit 1507 may apply a linear function and/or a nonlinear function to the output of the operation circuit 1503, such as linear interpolation of the feature plane extracted by the convolutional layer, such as a vector of accumulated values, to generate activation values. In some implementations, the vector computation unit 1507 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation input to the arithmetic circuit 1503, such as for use in subsequent layers in a neural network.
控制器1504连接的取指存储器(instruction fetch buffer)1509,用于存储控制器1504使用的指令;The instruction fetch buffer (instruction fetch buffer) 1509 connected to the controller 1504 is used to store the instructions used by the controller 1504;
统一存储器1506,输入存储器1501,权重存储器1502以及取指存储器1509均为On-Chip存储器。外部存储器私有于该NPU硬件架构。The unified memory 1506, the input memory 1501, the weight memory 1502 and the instruction fetch memory 1509 are all On-Chip memories. External memory is private to the NPU hardware architecture.
其中,循环神经网络中各层的运算可以由运算电路1503或向量计算单元1507执行。The operation of each layer in the recurrent neural network can be performed by the operation circuit 1503 or the vector calculation unit 1507 .
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述图4A-图12的方法的程序执行的集成电路。Wherein, the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the programs of the above-mentioned methods of FIGS. 4A-12 .
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。In addition, it should be noted that the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary general-purpose hardware. Special components, etc. to achieve. Under normal circumstances, all functions completed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structures used to implement the same function can also be various, such as analog circuits, digital circuits or special circuit, etc. However, a software program implementation is a better implementation in many cases for this application. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that make contributions to the prior art. The computer software products are stored in a readable storage medium, such as a floppy disk of a computer. , U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), disk or CD, etc., including several instructions to make a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present application.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product.
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传 输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.
最后应说明的是:以上,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。Finally, it should be noted that: the above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. Any person skilled in the art who is familiar with the technical scope disclosed by the present application can easily think of changes. Or replacement should be covered within the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (17)

  1. 一种自然语言处理方法,其特征在于,包括:A method for natural language processing, comprising:
    获取输入序列,所述输入序列包括第一语料中至少一个词对应的初始向量表示;obtaining an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus;
    将所述输入序列作为自注意力模型的输入,得到输出序列,所述输出序列中包括经所述自注意力模型进行自然语言处理NLP后的所述第一语料中的至少一个词对应的向量表示,所述输出序列表示所述第一语料中的各个词在所述第一语料中的语义信息;The input sequence is used as the input of the self-attention model, and the output sequence is obtained, and the output sequence includes the vector corresponding to at least one word in the first corpus after the natural language processing NLP is performed by the self-attention model means that the output sequence represents the semantic information of each word in the first corpus in the first corpus;
    其中,所述自注意力模型包括多层网络,所述多层网络中的任意一层网络的输入为上一层网络的输出,每层网络包括多个自注意力模块,每个自注意力模块用于基于输入的向量计算关联度,并融合所述关联度和输入的向量,得到第一序列,所述每层网络的输出为融合所述多个自注意力模块输出的第一序列和上一层网络中的多个自注意力模块输出的第一序列得到,所述关联度表示所述第一语料中每个词和相邻的至少一个词之间的关联程度。Wherein, the self-attention model includes a multi-layer network, the input of any layer of the network in the multi-layer network is the output of the previous layer of network, each layer of network includes a plurality of self-attention modules, each self-attention module The module is used to calculate the correlation degree based on the input vector, and fuse the correlation degree and the input vector to obtain the first sequence, and the output of each layer of the network is the first sequence and the output of the fusion of the multiple self-attention modules. The first sequence output by multiple self-attention modules in the upper layer network is obtained, and the degree of association represents the degree of association between each word in the first corpus and at least one adjacent word.
  2. 根据权利要求1所述的方法,其特征在于,所述自注意力模型中还包括特征提取网络,所述方法还包括:The method according to claim 1, wherein the self-attention model further comprises a feature extraction network, and the method further comprises:
    所述特征提取网络根据所述输入序列中相邻的多个初始向量表示提取特征,得到局部特征序列,并将所述局部特征序列作为所述多层网络的输入。The feature extraction network extracts features according to multiple adjacent initial vector representations in the input sequence, obtains a local feature sequence, and uses the local feature sequence as the input of the multi-layer network.
  3. 根据权利要求2所述的方法,其特征在于,所述自注意力模型还包括融合网络,所述方法还包括:The method of claim 2, wherein the self-attention model further comprises a fusion network, the method further comprising:
    所述融合网络融合所述局部特征序列和所述多层网络的输出结果,得到所述输出序列。The fusion network fuses the local feature sequence and the output result of the multi-layer network to obtain the output sequence.
  4. 根据权利要求3所述的方法,其特征在于,所述融合网络融合所述局部特征序列和所述多层网络的输出结果,包括:The method according to claim 3, wherein the fusion network fuses the local feature sequence and the output result of the multi-layer network, comprising:
    所述融合网络计算所述局部特征序列和所述输出结果之间的相似度,融合所述相似度和所述局部特征序列,得到所述输出序列。The fusion network calculates the similarity between the local feature sequence and the output result, and fuses the similarity and the local feature sequence to obtain the output sequence.
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述自注意力模型还包括分类网络,所述方法还包括:The method according to any one of claims 1-4, wherein the self-attention model further comprises a classification network, and the method further comprises:
    将所述输出序列作为所述分类网络的输入,所述分类网络输出所述第一语料对应的类别。The output sequence is used as the input of the classification network, and the classification network outputs the category corresponding to the first corpus.
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述自注意力模型还包括翻译网络,所述方法还包括:The method according to any one of claims 1-5, wherein the self-attention model further comprises a translation network, and the method further comprises:
    将所述输出序列作为所述翻译网络的输入,输出第二语料,所述第一语料的语言种类为第一类,所述第二语料的语言种类为第二类,所述第一类和所述第二类为不同的语言种类。The output sequence is used as the input of the translation network, and a second corpus is output. The language type of the first corpus is the first type, the language type of the second corpus is the second type, and the first type and the The second category is a different language category.
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述多层网络中每一层网络中的多个自注意力模块的参数相同。The method according to any one of claims 1-6, wherein the parameters of multiple self-attention modules in each layer of the multi-layer network are the same.
  8. 一种自然语言处理装置,其特征在于,包括:A natural language processing device, comprising:
    获取模块,用于获取输入序列,所述输入序列包括第一语料中至少一个词对应的初始向量表示;an acquisition module, configured to acquire an input sequence, where the input sequence includes an initial vector representation corresponding to at least one word in the first corpus;
    处理模块,用于将所述输入序列作为自注意力模型的输入,得到输出序列,所述输出 序列中包括经所述自注意力模型进行自然语言处理NLP后的所述第一语料中的至少一个词对应的向量表示,所述输出序列表示所述第一语料中的各个词在所述第一语料中的语义信息;A processing module, configured to use the input sequence as the input of the self-attention model to obtain an output sequence, the output sequence includes at least one of the first corpus after the natural language processing NLP is performed by the self-attention model. A vector representation corresponding to a word, and the output sequence represents the semantic information of each word in the first corpus in the first corpus;
    其中,所述自注意力模型包括多层网络,所述多层网络中的任意一层网络的输入为上一层网络的输出,每层网络包括多个自注意力模块,每个自注意力模块用于基于输入的向量计算关联度,并融合所述关联度和输入的向量,得到第一序列,所述每层网络的输出为融合所述多个自注意力模块输出的第一序列和上一层网络中的多个自注意力模块输出的第一序列得到,所述关联度表示所述第一语料中每个词和相邻的至少一个词之间的关联程度。Wherein, the self-attention model includes a multi-layer network, the input of any layer of the network in the multi-layer network is the output of the previous layer of network, each layer of network includes a plurality of self-attention modules, each self-attention module The module is used to calculate the correlation degree based on the input vector, and fuse the correlation degree and the input vector to obtain the first sequence, and the output of each layer of the network is the first sequence and the output of the fusion of the multiple self-attention modules. The first sequence output by multiple self-attention modules in the upper layer network is obtained, and the degree of association represents the degree of association between each word in the first corpus and at least one adjacent word.
  9. 根据权利要求8所述的装置,其特征在于,所述自注意力模型中还包括特征提取网络,用于根据所述输入序列中相邻的多个初始向量表示提取特征,得到局部特征序列,并将所述局部特征序列作为所述多层网络的输入。The device according to claim 8, wherein the self-attention model further comprises a feature extraction network, which is used for extracting features according to multiple adjacent initial vector representations in the input sequence to obtain a local feature sequence, and use the local feature sequence as the input of the multi-layer network.
  10. 根据权利要求9所述的装置,其特征在于,所述自注意力模型还包括融合网络,所述融合网络用于融合所述局部特征序列和所述多层网络的输出结果,得到所述输出序列。The apparatus according to claim 9, wherein the self-attention model further comprises a fusion network, and the fusion network is used to fuse the local feature sequence and the output result of the multi-layer network to obtain the output sequence.
  11. 根据权利要求10所述的装置,其特征在于,所述融合网络具体用于:计算所述局部特征序列和所述输出结果之间的相似度,融合所述相似度和所述局部特征序列,得到所述输出序列。The device according to claim 10, wherein the fusion network is specifically configured to: calculate the similarity between the local feature sequence and the output result, and fuse the similarity and the local feature sequence, to obtain the output sequence.
  12. 根据权利要求8-11中任一项所述的装置,其特征在于,所述自注意力模型还包括分类网络,所述分类网络的输入为所述输出序列,输出所述第一语料对应的类别。The device according to any one of claims 8-11, wherein the self-attention model further comprises a classification network, the input of the classification network is the output sequence, and the output sequence corresponding to the first corpus is output. category.
  13. 根据权利要求8-12中任一项所述的装置,其特征在于,The device according to any one of claims 8-12, characterized in that,
    所述自注意力模型还包括翻译网络,所述第一语料的语言种类为第一类,所述翻译网络的输入为所述输出序列,输出第二语料,所述第二语料的语言种类为第二类,所述第一类和所述第二类为不同的语言种类。The self-attention model further includes a translation network, the language type of the first corpus is the first type, the input of the translation network is the output sequence, and the output of the second corpus, the language type of the second corpus is The second category, the first category and the second category are different language types.
  14. 根据权利要求8-13中任一项所述的装置,其特征在于,所述多层网络中每一层网络中的多个自注意力模块的参数相同。The apparatus according to any one of claims 8-13, wherein the parameters of the multiple self-attention modules in each layer of the multi-layer network are the same.
  15. 一种自然语言处理装置,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器存储有程序,当所述存储器存储的程序指令被所述处理器执行时实现权利要求1至7中任一项所述的方法。A natural language processing device, characterized in that it comprises a processor, the processor is coupled with a memory, the memory stores a program, and when the program instructions stored in the memory are executed by the processor, claims 1 to 1 are implemented. The method of any one of 7.
  16. 一种计算机可读存储介质,包括程序,当其被处理单元所执行时,执行如权利要求1至7中任一项所述的方法。A computer-readable storage medium comprising a program which, when executed by a processing unit, performs the method of any one of claims 1 to 7.
  17. 一种自然语言处理装置,其特征在于,包括处理单元和通信接口,所述处理单元通过所述通信接口获取程序指令,当所述程序指令被所述处理单元执行时实现权利要求1至7中任一项所述的方法。A natural language processing device, characterized in that it includes a processing unit and a communication interface, the processing unit obtains a program instruction through the communication interface, and when the program instruction is executed by the processing unit, the program instructions in claims 1 to 7 are implemented The method of any one.
PCT/CN2022/071285 2021-01-20 2022-01-11 Method and device for natural language processing WO2022156561A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110077612.0 2021-01-20
CN202110077612.0A CN112883149B (en) 2021-01-20 2021-01-20 Natural language processing method and device

Publications (1)

Publication Number Publication Date
WO2022156561A1 true WO2022156561A1 (en) 2022-07-28

Family

ID=76051025

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071285 WO2022156561A1 (en) 2021-01-20 2022-01-11 Method and device for natural language processing

Country Status (2)

Country Link
CN (1) CN112883149B (en)
WO (1) WO2022156561A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117436457A (en) * 2023-11-01 2024-01-23 人民网股份有限公司 Method, apparatus, computing device and storage medium for ironic recognition
CN117436457B (en) * 2023-11-01 2024-05-03 人民网股份有限公司 Irony identification method, irony identification device, computing equipment and storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883149B (en) * 2021-01-20 2024-03-26 华为技术有限公司 Natural language processing method and device
CN113243886B (en) * 2021-06-11 2021-11-09 四川翼飞视科技有限公司 Vision detection system and method based on deep learning and storage medium
CN113657391A (en) * 2021-08-13 2021-11-16 北京百度网讯科技有限公司 Training method of character recognition model, and method and device for recognizing characters
CN116383089B (en) * 2023-05-29 2023-08-04 云南大学 Statement level software defect prediction system based on ordinary differential equation diagram neural network
CN117033641A (en) * 2023-10-07 2023-11-10 江苏微皓智能科技有限公司 Network structure optimization fine tuning method of large-scale pre-training language model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442720A (en) * 2019-08-09 2019-11-12 中国电子技术标准化研究院 A kind of multi-tag file classification method based on LSTM convolutional neural networks
CN110991190A (en) * 2019-11-29 2020-04-10 华中科技大学 Document theme enhanced self-attention network, text emotion prediction system and method
CN111382584A (en) * 2018-09-04 2020-07-07 腾讯科技(深圳)有限公司 Text translation method and device, readable storage medium and computer equipment
WO2020228376A1 (en) * 2019-05-16 2020-11-19 华为技术有限公司 Text processing method and model training method and apparatus
WO2020237188A1 (en) * 2019-05-23 2020-11-26 Google Llc Fully attentional computer vision
WO2020257812A2 (en) * 2020-09-16 2020-12-24 Google Llc Modeling dependencies with global self-attention neural networks
CN112883149A (en) * 2021-01-20 2021-06-01 华为技术有限公司 Natural language processing method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549646B (en) * 2018-04-24 2022-04-15 中译语通科技股份有限公司 Neural network machine translation system based on capsule and information data processing terminal
US11138392B2 (en) * 2018-07-26 2021-10-05 Google Llc Machine translation using neural network models
CN109543824B (en) * 2018-11-30 2023-05-23 腾讯科技(深圳)有限公司 Sequence model processing method and device
CN111368536A (en) * 2018-12-07 2020-07-03 北京三星通信技术研究有限公司 Natural language processing method, apparatus and storage medium therefor
CN110457713B (en) * 2019-06-19 2023-07-28 腾讯科技(深圳)有限公司 Translation method, device, equipment and storage medium based on machine translation model
CN110490717B (en) * 2019-09-05 2021-11-05 齐鲁工业大学 Commodity recommendation method and system based on user session and graph convolution neural network
CN111062203B (en) * 2019-11-12 2021-07-20 贝壳找房(北京)科技有限公司 Voice-based data labeling method, device, medium and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382584A (en) * 2018-09-04 2020-07-07 腾讯科技(深圳)有限公司 Text translation method and device, readable storage medium and computer equipment
WO2020228376A1 (en) * 2019-05-16 2020-11-19 华为技术有限公司 Text processing method and model training method and apparatus
WO2020237188A1 (en) * 2019-05-23 2020-11-26 Google Llc Fully attentional computer vision
CN110442720A (en) * 2019-08-09 2019-11-12 中国电子技术标准化研究院 A kind of multi-tag file classification method based on LSTM convolutional neural networks
CN110991190A (en) * 2019-11-29 2020-04-10 华中科技大学 Document theme enhanced self-attention network, text emotion prediction system and method
WO2020257812A2 (en) * 2020-09-16 2020-12-24 Google Llc Modeling dependencies with global self-attention neural networks
CN112883149A (en) * 2021-01-20 2021-06-01 华为技术有限公司 Natural language processing method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117436457A (en) * 2023-11-01 2024-01-23 人民网股份有限公司 Method, apparatus, computing device and storage medium for ironic recognition
CN117436457B (en) * 2023-11-01 2024-05-03 人民网股份有限公司 Irony identification method, irony identification device, computing equipment and storage medium

Also Published As

Publication number Publication date
CN112883149B (en) 2024-03-26
CN112883149A (en) 2021-06-01

Similar Documents

Publication Publication Date Title
WO2020228376A1 (en) Text processing method and model training method and apparatus
WO2022156561A1 (en) Method and device for natural language processing
CN111368993B (en) Data processing method and related equipment
CN112487182A (en) Training method of text processing model, and text processing method and device
US20230229898A1 (en) Data processing method and related device
CN116720004B (en) Recommendation reason generation method, device, equipment and storage medium
WO2021238333A1 (en) Text processing network, neural network training method, and related device
US20230117973A1 (en) Data processing method and apparatus
WO2022253074A1 (en) Data processing method and related device
WO2021129668A1 (en) Neural network training method and device
WO2023231794A1 (en) Neural network parameter quantification method and apparatus
CN113505883A (en) Neural network training method and device
WO2023236977A1 (en) Data processing method and related device
EP4318313A1 (en) Data processing method, training method for neural network model, and apparatus
US20240046067A1 (en) Data processing method and related device
WO2023284716A1 (en) Neural network searching method and related device
WO2023020613A1 (en) Model distillation method and related device
CN116432019A (en) Data processing method and related equipment
WO2022156475A1 (en) Neural network model training method and apparatus, and data processing method and apparatus
WO2024001653A9 (en) Feature extraction method and apparatus, storage medium, and electronic device
CN116993185A (en) Time sequence prediction method, device, equipment and storage medium
CN113792537A (en) Action generation method and device
WO2023143262A1 (en) Data processing method and related device
CN116563920B (en) Method and device for identifying age in cabin environment based on multi-mode information
US20240135174A1 (en) Data processing method, and neural network model training method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22742040

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22742040

Country of ref document: EP

Kind code of ref document: A1