WO2023160472A1 - 一种模型训练方法及相关设备 - Google Patents

一种模型训练方法及相关设备 Download PDF

Info

Publication number
WO2023160472A1
WO2023160472A1 PCT/CN2023/076756 CN2023076756W WO2023160472A1 WO 2023160472 A1 WO2023160472 A1 WO 2023160472A1 CN 2023076756 W CN2023076756 W CN 2023076756W WO 2023160472 A1 WO2023160472 A1 WO 2023160472A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sequence
data sequence
data unit
unit
Prior art date
Application number
PCT/CN2023/076756
Other languages
English (en)
French (fr)
Inventor
李鹏飞
李良友
张檬
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023160472A1 publication Critical patent/WO2023160472A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of artificial intelligence, in particular to a model training method and related equipment.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is the branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that respond in ways similar to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Sequence-to-sequence natural language generation is a very important direction in natural language processing tasks, and an encoder-decoder design framework is often used.
  • the task of sequence generation can be divided into autoregressive generation and non-autoregressive (parallel) generation.
  • Autoregressive generation means that in the process of generating the target sequence, first predict the first character of the generated target sequence, and then predict the entire target sequence step by step according to the generated subsequences.
  • Non-autoregressive generation refers to generating a complete target sequence in parallel during decoding, eliminating the need for a step-by-step iterative process, thereby greatly reducing the waiting time for generating the target sequence. For tasks that require high real-time performance such as translation and dialogue, non-autoregressive generation becomes more and more important.
  • pre-training-fine-tuning is a standard paradigm for improving model performance.
  • the existing pre-training schemes only focus on the autoregressive generation from left to right, that is, only the above information of the data sequence can be seen during the pre-training process, so the fine-tuning
  • large pre-trained models such as generative pre-trained transformer 3 (GPT-3) and Pangu
  • GPT-3 generative pre-trained transformer 3
  • Pangu the parameters of the model have become larger and larger, and the cost of pre-training has also become higher and higher. . If a pre-training can only adapt to a single downstream task, then a pre-training needs to be done at a large cost for each generation strategy, which will consume too many resources.
  • This application provides a model training method, which does not need to pre-train a corresponding PLM for each type of sequence generation task, which greatly reduces the resources required for training PLM (such as computing resources, storage resources, time resources, etc.) .
  • the present application provides a model training method, the method comprising:
  • the first embedding vector corresponds to a first data sequence
  • the second embedding vector corresponds to a second data sequence
  • the second data sequence includes the first sub-data, which is The masked data unit to be predicted and the second sub-data
  • the first sub-data is located in the data unit to be predicted in the second data sequence
  • the above element, the second sub-data is located below the data unit to be predicted in the second data sequence; according to the first embedding vector, through the encoder in the pre-trained language model PLM, the hidden state: according to the first sub-data, the second sub-data, and the hidden state, through the decoder in the PLM and the output layer of the decoder, predict the data unit to be predicted, so as to Obtaining a first predicted data unit; updating the encoder and the decoder according to the difference between the first predicted data unit and the data unit to be predicted.
  • the pre-training architecture of the encoder and the two-way decoder is adopted, and the decoder can see the above and below information at the same time during the training process.
  • other types of sequence generation tasks autoregressive: from left to right, from right to left; non-autoregressive: completely non-autoregressive, semi-autoregressive, etc.
  • the PLM obtained by the training method in the embodiment of the present application can have a good adaptability to other types of sequence generation tasks (autoregressive: from left to right, from right to left; non-autoregressive: completely non-autoregressive, semi-autoregressive non-autoregressive, etc.)
  • the PLM obtained through the training method in the embodiment of the present application can achieve better model accuracy.
  • the method further includes: acquiring a first initial data sequence; determining whether at least one data unit in the first initial data sequence is masked by means of probability sampling, so as to obtain the The second data sequence, wherein the probability value obtained by the probability sampling is used as the probability that the at least one data unit is masked.
  • the method further includes: acquiring a second initial data sequence; determining whether at least one data unit in the second initial data sequence is masked by means of probability sampling, so as to obtain the A first data sequence; wherein, when performing the probability sampling, the probability that a data unit in the first initial data sequence is masked is greater than the probability that a data unit in the second initial data sequence is masked.
  • a dynamic sampling mask probability may be used, where dynamic means that the probability of each data unit in the data sequence being masked is dynamic.
  • At least one data unit in the second initial data sequence (for example, at least one data unit may be all data units in the second initial data sequence) within a certain probability interval)
  • Each data unit is sampled to obtain a probability value, and the probability value obtained by the probability sampling is used as the probability that the at least one data unit is masked, for example, another probability value obtained by sampling based on this probability value and other probability intervals
  • a comparison is made to determine whether to mask the data unit.
  • the embedding vector generated based on the first data sequence can be used as the input of the encoder in the PLM
  • the embedding vector generated based on the second data sequence can be used as the input of the decoder in the PLM.
  • the above-mentioned respectively for the first data sequence and the second data sequence The mask operation can be called a dynamic dual mask operation.
  • the input of the encoder and the input of the decoder can be masked separately, and the encoder can be completed simultaneously in the subsequent training process. and decoder pre-training.
  • the dynamic sampling mask probability can prevent the mask probability from being too large, resulting in less effective information of the entire batch during model training.
  • the probability that a data unit in the first initial data sequence is masked is greater than the probability that a data unit in the second initial data sequence is masked.
  • the PLM is used to realize the target task
  • the first data sequence (optionally, the first data sequence before the mask) can be the original data before executing the target task
  • the second data sequence The sequence (optionally, the second data sequence before the mask) may be the target data after the target task is executed.
  • the target task can be a translation task, a natural language generation task, etc.
  • the first data sequence and the second data sequence may constitute a training sample, and the PLM needs to generate the second data sequence based on the first data sequence.
  • the first data sequence and the second data sequence are data that have the same semantics and are expressed in different language types.
  • PLM can be used to realize the text summarization generation task, then the original data can be the source corpus that needs to extract the summarization, and the target data can be the summarization text that needs to be generated.
  • PLM can be used to implement text answering tasks, then the original data can be the source corpus that needs to be answered, and the target data can be the answer content for the source corpus.
  • the method further includes: predicting the masked data units in the first data sequence through the output layer of the encoder in the PLM to obtain a second prediction A data unit; updating the encoder according to the difference between the second predicted data unit and the data unit before being masked in the first data sequence.
  • an output layer similar to that of the decoder can be added on the output side of the encoder, for example, it can be composed of a fully connected layer and a softmax normalization layer, which is used to predict the output layer in the first data sequence masked data unit.
  • the masked data unit in the first data sequence may be predicted through the output layer of the encoder in the PLM to obtain a second predicted data unit; and according to The difference between the second predicted data unit and the data unit before being masked in the first data sequence is used to update the encoder.
  • the fully connected network at the output layer of the encoder can map the output of the encoder to a fixed dimension (the dimension of the vocabulary size), and then use the softmax normalization function to obtain the probability of the occurrence of the target word at each position.
  • the target word here may be a masked data unit (eg, the second predicted data unit) in the first data sequence.
  • the accuracy of the model's prediction on the current data is calculated by calculating the logarithmic likelihood (logarithm of the probability) of the position corresponding to the target word.
  • the encoder and the decoder can be pre-trained at the same time, effectively jointly training the two modules.
  • the PLM can be used to implement sequence conversion tasks (such as translation tasks) between texts of different language types, for the original source corpus (such as the third initial data in the embodiment of the present application) Sequence), part of the data units can be replaced by units (replaced with data units with the same semantics and expressed by another language type), which can improve the accuracy of PLM for sequence conversion between multiple languages.
  • sequence conversion tasks such as translation tasks
  • the original source corpus such as the third initial data in the embodiment of the present application) Sequence
  • part of the data units can be replaced by units (replaced with data units with the same semantics and expressed by another language type), which can improve the accuracy of PLM for sequence conversion between multiple languages.
  • a third initial data sequence can be obtained; the second data sequence and the third initial data sequence are texts with the same semantics and are expressed in different language types, and the third initial data sequence
  • the semantics of the first data unit in and the data unit to be predicted are the same; the first data unit in the third initial data sequence is replaced with a second data unit to obtain the first data sequence; wherein The second data unit and the first A data unit has the same semantics and is expressed by different language types.
  • the present application does not limit that the first data sequence is obtained only by replacing the first data unit in the third initial data sequence with the second data unit.
  • the first data unit may be randomly selected from the third initial data sequence, for example, any data unit in the third initial data sequence may be selected as the above-mentioned first data unit.
  • a data unit that has the same or similar semantics as the first data unit and is expressed in a different language type may be retrieved from the language library as a second data unit, and the second data unit is used to replace the first data unit in the third initial data sequence, to get the first data sequence.
  • the second data unit and the first initial data sequence are also expressed by different language types, that is to say, any two of the first initial data sequence, the second initial data sequence and the second data unit
  • the language types are all different.
  • the method further includes: acquiring a fourth initial data sequence; performing a mask on a data unit in the fourth initial data sequence that has the same semantics as the first data unit, so as to obtain the the second data sequence.
  • the data unit in the first initial data unit that has the same semantics as the first data unit can be masked Operation, because the data unit after the mask operation needs to be predicted during the training process of the PLM, therefore, it can make the PLM have a richer language type text understanding ability.
  • the fourth initial data sequence can be obtained; for the data unit in the fourth initial data sequence that has the same semantics as the first data unit (for example, it can be the data to be predicted in the embodiment of the present application unit) to perform a mask to obtain the second data sequence.
  • the first sub-data or the second sub-data includes unmasked data units
  • the second embedding vector includes semantic information of the unmasked data units
  • the unmasked a positional relationship between a data unit and other data units in said second data sequence
  • the first sub-data or the second sub-data includes a masked data unit, and the second embedding vector includes the relationship between the masked data unit and other data units in the second data sequence positional relationship; or,
  • the second embedding vector includes a positional relationship between the data unit to be predicted and other data units in the second data sequence.
  • the first data sequence before the mask operation and the first data sequence before the mask operation the second data sequence is the same data sequence; or,
  • the first data sequence before the mask operation and the second data sequence before the mask operation are different data sequences marked by samples.
  • the first data sequence and the second data sequence are text data.
  • the present application provides a model training device, the device comprising:
  • An acquisition module configured to acquire a first embedding vector and a second embedding vector; the first embedding vector corresponds to the first data sequence, the second embedding vector corresponds to the second data sequence, and the second data sequence includes the first A sub-data, a masked data unit to be predicted, and second sub-data, the first sub-data is located above the data unit to be predicted in the second data sequence, and the second sub-data is in The second data sequence is located below the data unit to be predicted;
  • the encoding module is used to obtain the hidden state through the encoder in the pre-trained language model PLM according to the first embedding vector;
  • a decoding module configured to, according to the first sub-data, the second sub-data, and the hidden state, use the decoder in the PLM and the output layer of the decoder to perform the data unit to be predicted Predict, to obtain the first predicted data unit;
  • a training module configured to update the encoder and the decoder according to the difference between the first predicted data unit and the data unit to be predicted.
  • the acquisition module is also used to:
  • probability sampling it is determined whether at least one data unit in the first initial data sequence is masked to obtain the second data sequence, wherein the probability value obtained by the probability sampling is used as the at least one Probability that a data unit is masked.
  • the acquisition module is also used to:
  • probability sampling it is determined whether at least one data unit in the second initial data sequence is masked to obtain the first data sequence; wherein, when performing the probability sampling, the first initial data
  • the probability that a data unit in the sequence is masked is greater than the probability that a data unit in the second initial data sequence is masked.
  • the encoding module is also used for:
  • the training module is further configured to: update the encoder according to the difference between the second predicted data unit and the data unit before being masked in the first data sequence.
  • the PLM is used to realize the sequence conversion task between texts of different language types
  • the acquisition module is also used to:
  • the second data sequence and the third initial data sequence are texts with the same semantics and expressed in different language types, the first data unit in the third initial data sequence and the The semantics of the data units to be predicted are the same;
  • the acquisition module is also used to:
  • the first sub-data or the second sub-data includes unmasked data units
  • the second embedding vector includes semantic information of the unmasked data units
  • the unmasked a positional relationship between a data unit and other data units in said second data sequence
  • the first sub-data or the second sub-data includes a masked data unit, and the second embedding vector includes the relationship between the masked data unit and other data units in the second data sequence positional relationship; or,
  • the second embedding vector includes a positional relationship between the data unit to be predicted and other data units in the second data sequence.
  • the first data sequence before the mask operation and the second data sequence before the mask operation are the same data sequence; or,
  • the first data sequence before the mask operation and the second data sequence before the mask operation are different data sequences marked by samples.
  • the first data sequence and the second data sequence are text data.
  • the embodiment of the present application provides a training device, which may include a memory, a processor, and a bus system, wherein the memory is used to store programs, and the processor is used to execute the programs in the memory to perform the above-mentioned first aspect and any of its optional methods.
  • the embodiment of the present application also provides a data processing method, characterized in that the method includes:
  • the updated PLM obtained by the method described in the first aspect and the data to be processed; wherein, the updated PLM may include an updated encoder and an updated decoder; through the updated PLM, process the Pending data to obtain processing results.
  • the data to be processed may be text data, and for details, refer to the description about the data sequence in the first aspect of the above embodiment.
  • the embodiment of the present application also provides a data processing device, which is characterized in that the device is used to: obtain the updated PLM and the data to be processed obtained by the method described in the first aspect; The final PLM processes the data to be processed to obtain a processing result.
  • the embodiment of the present application provides an execution device, which may include a memory, a processor, and a bus system, wherein the memory is used to store programs, and the processor is used to execute the programs in the memory to perform the above-mentioned fourth aspect and any of its optional methods.
  • the embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, the computer executes the above-mentioned first aspect and any one thereof.
  • the embodiment of the present application provides a computer program that, when run on a computer, enables the computer to execute the above-mentioned first aspect and any optional method thereof, or the above-mentioned fourth aspect and any optional method thereof.
  • the method of choice is a computer program that, when run on a computer, enables the computer to execute the above-mentioned first aspect and any optional method thereof, or the above-mentioned fourth aspect and any optional method thereof. The method of choice.
  • the present application provides a chip system, which includes a processor, configured to support the training device to implement the functions involved in the above aspect, for example, sending or processing the data involved in the above method; or, information .
  • the chip system further includes a memory, and the memory is used for storing necessary program instructions and data of the execution device or the training device.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • An embodiment of the present application provides a model training method, the method comprising: obtaining a first embedding vector and a second embedding vector; the first embedding vector corresponds to the first data sequence, and the second embedding vector corresponds to the first Two data sequences, the second data sequence includes first sub-data, masked data units to be predicted and second sub-data, the first sub-data is located in the data to be predicted in the second data sequence The above unit, the second sub-data is located below the data unit to be predicted in the second data sequence; according to the first embedding vector, the encoder in the pre-trained language model PLM is used to obtain the hidden state: according to the first sub-data, the second sub-data, and the hidden state, through the decoder in the PLM and the output layer of the decoder, predict the data unit to be predicted, so as to Obtaining a first predicted data unit; updating the encoder and the decoder according to the difference between the first predicted data unit and the data unit to be predicted.
  • the pre-training architecture of the encoder and the two-way decoder is adopted, and the decoder can see the above and below information at the same time during the training process.
  • other types of sequence generation tasks autoregressive: from left to right, from right to left; non-autoregressive: completely non-autoregressive, semi-autoregressive, etc.
  • the PLM obtained by the training method in the embodiment of the present application can have a good adaptability to other types of sequence generation tasks (autoregressive: from left to right, from right to left; non-autoregressive: completely non-autoregressive, semi-autoregressive non-autoregressive, etc.)
  • the PLM obtained through the training method in the embodiment of the present application can achieve better model accuracy.
  • Fig. 1 is a kind of structural schematic diagram of main frame of artificial intelligence
  • Fig. 2 is a kind of natural language processing system
  • Figure 3a is another natural language processing system
  • Figure 3b is a schematic structural diagram of a system
  • Fig. 3c is a schematic diagram of auto-regression
  • Figure 3d is a schematic diagram of non-autoregressive
  • Figure 3e is a schematic diagram of a semi-non-autoregressive
  • Figure 3f is a schematic diagram of a translation model
  • FIG. 4 is a schematic diagram of related equipment for natural language processing provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of a transformer layer architecture
  • Fig. 6a is a schematic diagram of an embodiment of a model training method provided in the embodiment of the present application.
  • Fig. 6b is a schematic diagram of an embodiment of a model training method
  • FIG. 7 is a schematic structural diagram of a neural network model in an embodiment of the present application.
  • FIG. 8 is a structural representation of a transformer layer
  • Figure 9 is a schematic diagram of the operation of an attention head head
  • FIG. 10 is a schematic structural diagram of a model training device provided in an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a training device provided in an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • Figure 1 shows a schematic structural diagram of the main framework of artificial intelligence.
  • the following is from the “intelligent information chain” (horizontal axis) and “IT value chain” ( Vertical axis) to illustrate the above artificial intelligence theme framework in two dimensions.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has experienced the condensation of "data-information-knowledge-wisdom". practice process.
  • "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the systematic industrial ecological process.
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • the basic platform includes distributed computing framework and network and other related platform guarantees and supports, which can include cloud storage and Computing, interconnection network, etc.
  • sensors communicate with the outside to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, text, and IoT data of traditional equipment, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies.
  • the typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image processing identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart cities, etc.
  • This application can be applied to the field of natural language processing in the field of artificial intelligence.
  • the following uses natural language processing as an example to introduce multiple application scenarios that have been implemented in products.
  • Fig. 2 shows a natural language processing system, which includes a user device and a data processing device.
  • the user equipment includes smart terminals such as a mobile phone, a personal computer, or an information processing center.
  • the user device is the initiator of natural language data processing, and as the initiator of requests such as language question and answer or query, usually the user initiates the request through the user device.
  • the above-mentioned data processing device may be a device or server having a data processing function such as a cloud server, a network server, an application server, and a management server.
  • the data processing device receives the query statement/ Speech/text, etc., and then perform language data processing such as machine learning, deep learning, search, reasoning, decision-making, etc. through the memory for storing data and the processor for data processing, and feedback the processing results to the user device.
  • the storage in the data processing device may be a general term, including local storage and a database storing historical data, and the database may be on the data processing device or on other network servers.
  • the user equipment can receive user instructions, for example, the user equipment can receive a section of text input by the user, and then initiate a request to the data processing equipment, so that the data processing equipment can obtain the text for the user equipment.
  • the text executes natural language processing applications (such as natural language generation, text classification, text reasoning, named entity recognition, translation, etc.), so as to obtain the processing results of the corresponding natural language processing applications for this piece of text (such as predicted word results, classification results , inference results, named entity recognition results, translation results, etc.).
  • natural language generation can also be called a text prediction task or a natural language synthesis task, which refers to the task of generating missing text or subsequent text given a piece of text .
  • Natural language generation is widely used in search engines, input methods and other scenarios. It can predict the user's next input under the premise of the user's input of part of the text, which can greatly improve the efficiency of the user's use of the product. In addition, it can also correct the missing text. text to be restored.
  • the user equipment may receive a piece of text data input by the user, wherein the text data includes known words and words to be predicted, and the words to be predicted are invisible, and only the position of the words to be predicted in the text data is known. location, and then the user device can initiate a request to the data processing device (the request carries text data), so that the data processing device can predict the word to be predicted in the text data, thereby obtaining the word to be predicted, and feeding back the word to be predicted to the user equipment.
  • the request carries text data
  • the user equipment may receive a piece of text data input by the user, and then initiate a request to the data processing device, so that the data processing device performs entity classification on the piece of text data, so as to obtain an entity classification result for the piece of text data, and The entity classification result is fed back to the user equipment;
  • the user device may receive a piece of text data input by the user (the text data is Chinese text), and then initiate a request to the data processing device, so that the data processing device translates the piece of text data into English, so as to obtain the , and feed back the English translation to the user's device.
  • the text data is Chinese text
  • the data processing device translates the piece of text data into English, so as to obtain the , and feed back the English translation to the user's device.
  • Figure 3a shows another natural language processing system.
  • the user equipment is directly used as a data processing equipment, and the user equipment can directly receive input from the user and be directly processed by the hardware of the user equipment itself.
  • the specific process is the same as FIG. 2 is similar, and reference may be made to the above description, and details are not repeated here.
  • FIG. 4 is a schematic diagram of a device 300 related to natural language processing provided by an embodiment of the present application.
  • the above-mentioned user equipment in FIG. 2 and FIG. 3a may specifically be the local device 301 or the local device 302 in FIG. 4, and the data processing device in FIG. 2 may specifically be the execution device 310 in FIG.
  • the data storage system 350 may be integrated on the execution device 310, or set on the cloud or other network servers.
  • the processors in Figure 2 and Figure 3a can perform data training/machine learning/deep learning through the neural network model or other models, and use the data to finally train or learn the model to perform natural language processing applications (such as natural language generation) on text data. , text classification, sequence labeling, reading comprehension, text generation, text reasoning, translation, etc.), so as to obtain the corresponding processing results.
  • natural language processing applications such as natural language generation
  • the high-precision model after fine-tuning the pre-trained language model in the embodiment of the application can be deployed on the data
  • the data processing device can provide a high-precision model to process the text data, so as to obtain the processing result of the above-mentioned natural language processing application.
  • Fig. 3b is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the system architecture 500 includes an execution device 510 , a training device 520 , a database 530 , a client device 540 , a data storage system 550 and a data collection system 560 .
  • the execution device 510 includes a calculation module 511 , an I/O interface 512 , a preprocessing module 513 and a preprocessing module 514 .
  • the calculation module 511 may include the target model/rule 501, and the preprocessing module 513 and the preprocessing module 514 are optional.
  • the data collection device 560 is used to collect training data.
  • the training data may be text data with missing text and complete text data corresponding to the text data with missing text.
  • the training data may include but not limited to parallel corpus, monolingual corpus, and the like.
  • Parallel corpus refers to a bilingual or multilingual corpus (that is, labeled text data) composed of the original text and its parallel corresponding target language text.
  • the original text and the target language text have the same semantics and there is a correspondence between text units. relation.
  • the original text is "This trip needs careful planning”
  • its parallel English text is “The trip needs careful planning”
  • "This trip needs careful planning” and "The trip needs careful planning” can be regarded as a set of parallel Corpus
  • this group of parallel corpora is a Chinese-English parallel language pair
  • the original text "This trip needs careful planning” can be regarded as the source corpus of this group of parallel corpora
  • the translated text "The trip needs careful planning” can be regarded as the group of parallel corpus
  • travel may correspond to trip.
  • the data collection device 560 After collecting the training data, the data collection device 560 stores the training data in the database 530 , and the training device 520 trains the target model/rules 501 based on the training data maintained in the database 530 .
  • the training device 520 trains the pretrained language model (pretrained language model, PLM) in the embodiment of the present application based on the training data maintained in the database 530 to obtain the target model/rule 501 .
  • PLM pretrained language model
  • the training device 520 can fine-tune the trained pre-trained language model based on the training data maintained in the database 530 to obtain the target model/rule 501 .
  • training device 520 for training the pre-trained language model and the training device 520 for fine-tuning the trained pre-trained language model may be different devices.
  • the training data maintained in the database 530 may not all be collected by the data collection device 560, but may also be received from other devices.
  • the training device 520 does not necessarily perform the training of the target model/rules 501 based entirely on the training data maintained by the database 530, and it is also possible to obtain training data from the cloud or other places for model training. Limitations of the Examples.
  • the target model/rule 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG. Notebook computer, augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) equipment, vehicle terminal, etc., can also be a server or cloud, etc.
  • execution device 510 configures input/output (input/output
  • the I/O interface 512 is used for data interaction with external devices, and the user can input data to the I/O interface 512 through the client device 540 .
  • the preprocessing module 513 and the preprocessing module 514 are used to perform preprocessing according to the input data received by the I/O interface 512 (such as obtaining the position of the known data unit and the data unit to be predicted in the target data, or generating attention information, etc. preprocessing process). It should be understood that there may be no preprocessing module 513 and preprocessing module 514 or only one preprocessing module. When the preprocessing module 513 and the preprocessing module 514 do not exist, the calculation module 511 may be used directly to process the input data.
  • the execution device 510 When the execution device 510 preprocesses the input data, or in the calculation module 511 of the execution device 510 performs calculation and other related processing, the execution device 510 can call the data, codes, etc. in the data storage system 550 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 550 .
  • the I/O interface 512 presents the processing results to the client device 540 for provision to the user.
  • the user can manually specify input data, and the “manually specify input data” can be operated through the interface provided by the I/O interface 512 .
  • the client device 540 can automatically send the input data to the I/O interface 512 . If the client device 540 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 540 . The user can view the results output by the execution device 510 on the client device 540, and the specific presentation form may be specific ways such as display, sound, and action.
  • the client device 540 can also be used as a data collection terminal, collecting input data from the input I/O interface 512 and output results from the output I/O interface 512 as new sample data, and storing them in the database 530 .
  • the data is stored in database 530 .
  • Fig. 3b is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 550 is an external memory relative to the execution device 510 , and in other cases, the data storage system 550 may also be placed in the execution device 510 .
  • execution device 510 may also be deployed in the client device 540 .
  • the neural network can be composed of neural units, and the neural unit can refer to an operation unit that takes xs (that is, the input data) and an intercept 1 as input, and the output of the operation unit can be:
  • Ws is the weight of xs
  • b is the bias of the neuron unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a combination of multiple above-mentioned single neural units to form a
  • the resulting network that is, the output of one neuron unit can be the input of another neuron unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Fig. 5 is a schematic diagram of a transformer layer architecture, as shown in Fig. 5, the neural network includes an embedding layer and at least one transformer layer, and at least one transformer layer can be N transformer layers (N is an integer greater than 0), Among them, each transformer layer includes sequentially adjacent attention layers, summation and normalization (add&norm) layers, feedforward (feed forward) layers, and summation and normalization layers.
  • N is an integer greater than 0
  • each transformer layer includes sequentially adjacent attention layers, summation and normalization (add&norm) layers, feedforward (feed forward) layers, and summation and normalization layers.
  • the current input is embedded and processed to obtain multiple embedding vectors; in the attention layer, P input vectors are obtained from the upper layer of the first transformer layer, and any of the P input vectors The first input vector is the center, based on the correlation between each input vector within the preset attention window and the first input vector, the intermediate vector corresponding to the first input vector is obtained, and P input vectors are determined in this way Corresponding P intermediate vectors; in the pooling layer, the P intermediate vectors are merged into Q output vectors, wherein the multiple output vectors obtained by the last transformer layer in the transformer layer are used as the characteristics of the current input express.
  • the attention mechanism imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience and external sensation to increase the observation precision of some areas, and can quickly filter out high-value information from a large amount of information with limited attention resources .
  • Attention mechanism can quickly extract important features of sparse data, so it is widely used in natural language processing tasks, especially machine translation.
  • the self-attention mechanism is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features.
  • the essential idea of the attention mechanism can be rewritten as the following formula:
  • Lx
  • the meaning of the formula is to imagine the constituent elements in Source as being composed of a series of data pairs. At this time, given a certain element Query in the target Target, by calculating Query and The similarity or correlation of each Key, the weight coefficient corresponding to the Value of each Key is obtained, and then the Value is weighted and summed to obtain the final Attention value. So in essence, the Attention mechanism is to weight and sum the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient corresponding to the Value.
  • Attention can be understood as selectively screening out a small amount of important information from a large amount of information and focusing on these important information, ignoring most of the unimportant information.
  • the process of focusing is reflected in the calculation of the weight coefficient.
  • the self-attention mechanism can be understood as internal Attention (intra attention).
  • the Attention mechanism occurs between the elements Query of the Target and all elements in the Source.
  • the self-attention mechanism refers to between the internal elements of the Source or between the internal elements of the Target.
  • the specific calculation process is the same, but the calculation object has changed.
  • NLP Natural language processing
  • Natural language is human language, and natural language processing (NLP) is the processing of human language. Natural language processing is the process of systematically analyzing, understanding and extracting information from text data in an intelligent and efficient manner.
  • NLP natural language processing
  • Automatic summarization automatic summarization
  • machine translation machine translation
  • NER named entity recognition
  • relation extraction relation extraction
  • RE information extraction
  • the pre-trained language model is a natural language sequence encoder, which encodes each word in the natural language sequence into a vector representation for prediction tasks. Its training consists of two stages. In the pre-training stage, the model performs language model task training on large-scale unsupervised texts to learn a word representation. In the finetuning stage, the model is initialized with the parameters learned in the pre-training stage, and it performs fewer steps of training on downstream tasks such as text classification and sequence labeling. The semantic information obtained by pre-training can be successfully transferred to downstream tasks.
  • Sequence-to-sequence natural language generation is a very important direction in natural language processing tasks, and an encoder-decoder design framework is often used.
  • X is input to the encoder and a set of vector representation z is generated, and then the representation z is input to the decoder through the cross-attention module, and decoded at the decoder to generate the target sequence Y.
  • sequence generation tasks can be divided into autoregressive generation and non-autoregressive (parallel) generation.
  • Autoregressive generation means that in the process of generating the target sequence, first predict the first character of the generated target sequence, and then predict the entire target sequence step by step according to the generated subsequences.
  • Non-autoregressive generation refers to generating a complete target sequence in parallel during decoding, eliminating the need for a step-by-step iterative process, thereby greatly reducing the waiting time for generating the target sequence.
  • Autoregressive generation refers to predicting the target sequence verbatim during the generation process. It is the most commonly used and best sequence generation strategy at present. Common tasks such as machine translation and summary generation generally use this autoregressive generation method. When predicting the target sequence, the autoregressive generation strategy enables the model to only see what has been generated before the time step, but not the information generated by the current time step and the subsequent time step. For example, in the English-German translation task, given the source language sentence (English): "who are you", the machine translation model needs to generate the corresponding target sequence: "Wer bist du".
  • the autoregressive machine translation model predicts the output of the first character “Wer” based on the source sequence at the first time step, and then inputs the character generated in the previous step ("Wer") into the decoder at the second time step, and predicts The output "bist” for the next time step.
  • the model generates "bist” (black box display)
  • " ⁇ s>Wer” is the starting symbol indicator
  • [M] means The characters at the corresponding positions are masked so they are not seen.
  • the generation method in Figure 3c is the most common left-to-right sequence generation strategy, and in actual use, a right-to-left generation strategy may also be used. For example, in lyrics translation and poetry translation, a right-to-left generation strategy is often used.
  • the model will first generate the last character in the sequence, and then generate it in sequence until the first character is output, so as to achieve a rhyming and smooth effect.
  • Right-to-left generation can also be understood as reverse order generation. At this time, the model can only see the information below, but not the information above.
  • Non-autoregressive generation Completely non-autoregressive generation is also called parallel generation (in the embodiment of the present application, it may be simply referred to as non-autoregressive generation), which means that the generation process of the target sequence is parallel.
  • This generative strategy yields the complete target sequence in a single decode, without requiring verbatim predictions.
  • non-autoregressive generation does not need to see any context information. It only needs to decode the source sentence sequence, which is the output of the encoder, to obtain the target sequence at the decoder. This means that compared with autoregressive generation, non-autoregressive generation can greatly reduce the decoding delay, so it has gradually become an important research direction for sequence generation tasks.
  • the model can directly decode the target sequence "Wer bist du” (black box, where "[M]” indicates that the character at the corresponding position is masked out), we Also known as fully non-autoregressive generation.
  • non-autoregressive sequence generation is faster than autoregressive strategies, there is a certain gap in quality compared to autoregressive strategies because it cannot see any contextual information during training.
  • part of the context information is randomly retained during training, so that the decoder can complete the complete target sequence based on part of the known information.
  • the effect close to autoregression is achieved by iterating a fixed number of times. As shown in Figure 3e, the current time step model predicts the first character and the last character (black box) based on the known information "bist".
  • the model When predicting the first character “Wer”, the model needs to see the following information "bist”, and when predicting the last character, the model needs to see the above information "bist”, so at the current time step, the non-autoregressive decoder needs Seeing the above and below information at the same time is also called semi-non-autoregressive generation.
  • the Transformer framework mainly includes encoders and decoders.
  • the encoder and decoder consist of multiple layers, and each layer of the encoder/decoder is composed of some encoding units/decoding units.
  • Each layer of the encoder expresses the word vector (or word embedding vector) corresponding to the source sentence into a high-dimensional vector (or called hidden state) after a series of neural network transformations. .
  • Each layer of the decoder is responsible for re-decoding (translating) this high-dimensional vector into the target language.
  • the word vector corresponding to the source sentence can be obtained through the word vector parameter of the encoder, and the set of word vector parameters of the encoder can be regarded as a parameter matrix.
  • a vocabulary can be used to contain possible words in the source language.
  • the word vector parameter matrix of the encoder contains the word vector of each word in the vocabulary.
  • the dimension of the word vector parameter matrix can be [word vector dimension, vocabulary size ], where the vocabulary size is the number of words contained in the vocabulary.
  • the words input into the source sentence of the NMT model may not exist in the vocabulary, and such words can be represented by fixed word vectors.
  • Each layer of the encoder can include a self-attention layer and a feed-forward layer.
  • the self-attention layer of the encoder is to take into account the weight of the word vector of each word in the source sentence (the influence of each word on the currently encoded word vector) when encoding each word vector.
  • the feed-forward network layer of the encoder is for the input of the self-attention layer.
  • the output vector is processed by nonlinear transformation. It can be considered that the self-attention layer of the encoder takes into account the weight of the word vector of each word in the source sentence (the influence of each word on the currently encoded word vector) through the parameters included in the self-attention layer, and the feedforward of the encoder
  • the network layer performs nonlinear transformation processing on the output vector of the self-attention layer through the parameters included in the feedforward network layer.
  • Each layer of the decoder includes a self-attention layer, an encoder-decoder attention layer, and a feedforward layer.
  • the self-attention layer of the decoder considers the impact of the generated new words on the currently decoded vector during the decoding process.
  • the encoding-decoding attention layer of the decoder considers the influence of the encoder's input on the currently decoded vector.
  • the feed-forward network layer of the decoder is to perform nonlinear transformation processing on the output vector of the encoding-decoding attention layer.
  • the output mapping layer receives the decoded vector output by the last network layer of the decoder, and converts the decoded vector into a translation result, such as generating a new word.
  • the word vector of the generated new word is obtained, and the generated word vector of the new word is used as the input of the first network layer of the decoder , this process continues until an ending symbol is generated, or other preset stop conditions are met, then all target words generated in the decoding stage are translation results.
  • a vocabulary can be used to contain possible words in the target language.
  • the word vector parameter matrix of the decoder contains the word vector of each word in the vocabulary.
  • the dimension of the word vector parameter matrix can be [word vector dimension, vocabulary size ], where the vocabulary size is the number of words contained in the vocabulary.
  • the word vector closest to the decoding vector output by the last network layer can be obtained by the minimum distance between the decoding vector output by the last network layer and the distance between each word vector contained in the decoder's word vector parameter matrix , obtain the translation result according to the closest word vector and vocabulary.
  • pre-training-fine-tuning is a standard paradigm for improving model performance.
  • the existing pre-training schemes only focus on the autoregressive generation from left to right, that is, only the above information of the data sequence can be seen during the pre-training process, so the fine-tuning
  • large pre-trained models such as generative pre-trained transformer 3 (GPT-3) and Pangu
  • GPT-3 generative pre-trained transformer 3
  • Pangu the parameters of the model have become larger and larger, and the cost of pre-training has also become higher and higher. . If a pre-training can only adapt to a single downstream task, then a pre-training needs to be done at a large cost for each generation strategy, which will consume too many resources.
  • the embodiment of the present application proposes a sequence-to-sequence model pre-training method, so that the model can adapt to three different types of sequence generation tasks (autoregressive tasks, non-autoregressive tasks, and semi-non-autoregressive tasks) after only one pre-training , which greatly reduces the cost of pre-training while ensuring the quality.
  • Fig. 6a shows an embodiment of a model training method provided by the embodiment of the present application.
  • the model training method provided by the embodiment of the present application can be applied to the training equipment described above, specifically, the model training method It can be applied to terminal devices such as mobile phones, tablets, laptops, and smart wearable devices, or to servers on the cloud side.
  • a model training method provided by the embodiment of the present application includes:
  • the first embedding vector corresponds to a first data sequence
  • the second embedding vector corresponds to a second data sequence
  • the second data sequence includes first sub-data , the masked data unit to be predicted and the second sub-data, the first sub-data is located in the number to be predicted in the second data sequence According to the above of the unit, the second sub-data is located below the data unit to be predicted in the second data sequence;
  • training samples for PLM can be obtained, wherein the training samples can include a first data sequence and a second data sequence, the first data sequence can be obtained based on the source corpus, and the second data sequence can be obtained based on After the target corpus is obtained, PLM needs to predict and generate the target corpus based on the source corpus.
  • PLM can be used to implement sequence conversion tasks between different language types, such as text translation tasks, summary generation tasks between different languages, etc., then the first data sequence and the second data sequence
  • It can be text including different language types (it is not limited that each data unit in the first data sequence is of a different language type from the data unit in the second data sequence, for example, some data units in the first data sequence and the second data
  • the data units (some or all of the data units) in the sequence are of the same language type).
  • the language type can also be referred to as language.
  • the original text is "We danse on the grass", and its parallel corresponding German text is “Wir tanzen auf dem gras”, then "We danse on the grass” and “Wir tanzen auf dem gras” can be regarded as a group of parallel corpus, which is a parallel language pair of English and German, the original text "We danse on the grass” can be regarded as the source corpus of this group of parallel corpora, and the translated text "Wir tanzen auf dem gras” is regarded as the target corpus of this group of parallel corpora.
  • the first data sequence before the masking operation and the second data sequence before the masking operation are different data sequences labeled with samples.
  • the PLM can be used to realize the text summary generation task, then the source corpus can be the source corpus that needs to extract the summary, and the target corpus can be the summary text that needs to be generated.
  • the PLM can be used to implement the text answering task, then the source corpus can be the source corpus that needs to be answered, and the target corpus can be the answer content for the source corpus.
  • the first data sequence before the mask operation and the second data sequence before the mask operation are the same data sequence, that is to say, the first data sequence before the mask operation
  • the first data sequence and the second data sequence before the mask operation are unmarked data.
  • the first data sequence can be obtained by masking the original source corpus
  • the second data sequence can be obtained by masking the original target corpus.
  • sequence conversion tasks such as translation tasks
  • the original source corpus and the original target corpus can be obtained from external databases.
  • the data unit can be performed on the original source corpus and the original target corpus Alignment, the original source corpus (in this embodiment can also be referred to as X for short, such as the second initial data sequence, the third initial data sequence in the embodiment of the application) and the original target corpus (in this embodiment can also be referred to as For Y, for example this application
  • the first initial data sequence in the embodiment, the 4th initial data sequence can comprise at least one data unit (such as subunit or word unit) respectively, through the alignment of data unit, can make original source corpus and original target corpus There is a one-to-one correspondence between the data units, and the corresponding data units can express the same semantics.
  • the input sentence pair is bilingual ("We danse on the grass", “Wir tanzen auf dem gras”).
  • "We danse on the grass” can be the original source corpus
  • "Wir tanzen auf dem gras” can be the original target corpus.
  • each element constitutes a set of alignment mapping relationships (for example, in Figure 6b Arrows: ⁇ "We-Wir", “dance-tanzen”, “grass-gras” ⁇ ).
  • the alignment of the above data units may be performed based on external knowledge (for example, as shown in FIG. 6b , using the knowledge base Q: alignment).
  • the knowledge base can be a dictionary, or a third-party tool (fast-align, etc.), or a pre-trained multilingual word vector, etc., which is not limited here.
  • a masking operation may be performed on the original source corpus and the original target corpus to obtain PLM training data (such as the first data sequence and the second data sequence in the embodiment of the present application).
  • the second initial data sequence (original source corpus) can be obtained, and by means of probability sampling, it is determined whether at least one data unit in the first initial data sequence is masked, so as to obtain In the second data sequence, the probability value obtained by the probability sampling is used as the probability that the at least one data unit is masked.
  • a dynamic sampling mask probability may be used, where dynamic means that the probability of each data unit in the data sequence being masked is dynamic.
  • At least one data unit in the second initial data sequence (for example, at least one data unit may be all data units in the second initial data sequence) within a certain probability interval)
  • Each data unit is sampled to obtain a probability value, and the probability value obtained by the probability sampling is used as the probability that the at least one data unit is masked, for example, another probability value obtained by sampling based on this probability value and other probability intervals
  • a comparison is made to determine whether to mask the data unit.
  • the probability interval W can be set to [0.1, 0.2], and when the mask operation is performed on each data unit in the second initial data sequence, a probability value ⁇ is randomly sampled from the interval W, and the probability ⁇ is used for Each data unit in the second initial data sequence is masked, that is, a random number r is randomly generated from the interval [0,1]. If r is less than ⁇ , it means that the current data unit can be masked. Otherwise do nothing with it.
  • the first initial data sequence (original target corpus) can be obtained; and by means of probability sampling, it is determined whether at least one data unit in the first initial data sequence is masked to obtain the second data A sequence, wherein the probability value obtained by the probability sampling is used as the probability that the at least one data unit is masked.
  • the probability interval R can be set to [0.35, 0.55].
  • a probability value ⁇ is sampled from the interval R respectively, and the probability ⁇ is used for the first initial data sequence
  • Each data unit in an initial data sequence is masked, that is, a random number a is randomly generated from the interval [0,1]. If a is less than ⁇ , it means that the current data unit can be masked, otherwise Do nothing with it.
  • the embedding vector generated based on the first data sequence can be used as the input of the encoder in the PLM, based on the second data sequence
  • the embedding vector generated according to the sequence can be used as the input of the decoder in the PLM.
  • the above-mentioned mask operations on the first data sequence and the second data sequence can be called dynamic dual mask operations.
  • dynamic dual mask operations The input of the encoder and the input of the decoder can be masked separately, and the pre-training of the encoder and the decoder can be completed simultaneously in the subsequent training process.
  • the dynamic sampling mask probability can prevent the mask probability from being too large, resulting in less effective information of the entire batch during model training.
  • the probability that a data unit in the first initial data sequence is masked is greater than the probability that a data unit in the second initial data sequence is masked.
  • DM(ACS(X), PRE2(X) the input and output (DM(ACS(X), PRE2(X)) at the encoder end can be obtained, where DM(dynamic masking) means that the second initial data sequence ACS(X)
  • the dynamic mask result (“We danse[M]the grass")
  • PRE2(X) is the prediction generation target of the encoder, here is all the masked character set ("____on_____”).
  • the PLM can be used to implement sequence conversion tasks (such as translation tasks) between texts of different language types, for the original source corpus (such as the third initial data in the embodiment of the present application) Sequence), part of the data units can be replaced by units (replaced with data units with the same semantics and expressed by another language type), which can improve the accuracy of PLM for sequence conversion between multiple languages.
  • sequence conversion tasks such as translation tasks
  • the original source corpus such as the third initial data in the embodiment of the present application) Sequence
  • part of the data units can be replaced by units (replaced with data units with the same semantics and expressed by another language type), which can improve the accuracy of PLM for sequence conversion between multiple languages.
  • a third initial data sequence can be obtained; the second data sequence and the third initial data sequence are texts with the same semantics and are expressed in different language types, and the third initial data sequence
  • the semantics of the first data unit in and the data unit to be predicted are the same; the first data unit in the third initial data sequence is replaced with a second data unit to obtain the first data sequence; wherein The second data unit and the first data unit have the same semantics and are expressed in different language types.
  • the present application does not limit that the first data sequence is obtained only by replacing the first data unit in the third initial data sequence with the second data unit.
  • the first data unit may be randomly selected from the third initial data sequence, for example, Any data unit in the third initial data sequence may be selected as the above-mentioned first data unit.
  • a data unit that has the same or similar semantics as the first data unit and is expressed in a different language type may be retrieved from the language library as a second data unit, and the second data unit is used to replace the first data unit in the third initial data sequence, to get the first data sequence.
  • the second data unit and the first initial data sequence are also expressed by different language types, that is to say, any two of the first initial data sequence, the second initial data sequence and the second data unit
  • the language types are all different.
  • a data unit can be selected with a certain probability from the data unit set of the third initial data sequence, such as "dance” in FIG.
  • the matching results are indexed in the multilingual dictionary of the knowledge base Q (index "dance” in Figure 9).
  • a language is randomly selected from all available language sets (Spanish, German, French), and then the character corresponding to the language is obtained, and the character is required to have a similar meaning to the matching result (such as "danse” in French).
  • the data unit in the first initial data unit that has the same semantics as the first data unit can be masked Operation, because the data unit after the mask operation needs to be predicted during the training process of the PLM, therefore, it can make the PLM have a richer language type text understanding ability.
  • the fourth initial data sequence can be obtained; for the data unit in the fourth initial data sequence that has the same semantics as the first data unit (for example, it can be the data to be predicted in the embodiment of the present application unit) to perform a mask to obtain the second data sequence.
  • the input sentence pair is bilingual ("We danse on the grass", “Wir tanzen auf dem gras”).
  • knowledge base Q alignment
  • each element constitutes a set of alignment mapping relationships (arrows in Figure 6b: ⁇ "We-Wir", “dance-tanzen”, “grass-gras” ⁇ ).
  • select a subset (such as ⁇ "dance-tanzen” ⁇ ) from the set with a certain probability, and perform the following operations on each element in the subset:
  • Encoder side Search for content that matches the element in the source sequence X, and index the matching result in the multilingual dictionary of the knowledge base Q (index "dance” in Figure 6b).
  • a language is randomly selected from all available language sets (Spanish, German, French), and then the character corresponding to the language is obtained, and the character is required to have a similar meaning to the matching result (such as "danse” in French).
  • Decoder side Search for the content that matches the element in the target sequence Y, and perform mask operation on the matched characters operation, and then predict the content of the mask at the output. Finally, a new set of inputs (AM(Y), PRE1(Y)) is obtained, where AM(Y) (code-switching masking) represents the new sequence obtained after the initial target Y is aligned with the mask operation (" Wir[M]auf dem gras"), where [M] indicates that the corresponding character is masked, PRE1(Y) indicates the predicted content at the output of the decoder, and indicates the set of words that are masked out ("__tanzen_______").
  • a new input sequence (ACS(X), AM(Y)) and an output sequence (PRE1(X), PRE1(Y)) are obtained by combining word code conversion based on external alignment knowledge and mask operation.
  • the first data sequence and the second data sequence can be embedded through the embedding layer on the encoder side and the embedding layer on the decoder side respectively processing to obtain the first embedding vector and the second embedding vector respectively.
  • the first embedding vector may include the position information of the masked data unit in the first data sequence, and the position of the unmasked data unit in the first data sequence information and semantic information, where the location information may represent a location relationship between a data unit and other data units, for example, may be represented by a location vector.
  • the semantic information can be represented by the word vector of the data unit.
  • the second data sequence may include masked data to be predicted, first sub-data above the data to be predicted, and second sub-data below the data to be predicted.
  • the first sub-data or the second sub-data may include unmasked data units
  • the second embedding vector includes semantic information of the unmasked data units , and the positional relationship between the unmasked data unit and other data units in the second data sequence.
  • the first sub-data may include unmasked data units
  • the second embedding vector may contain semantic information of the unmasked data units in the first sub-data
  • the unmasked The positional relationship between the data unit of the code and other data units in the second data sequence.
  • the second sub-data may include unmasked data units
  • the second embedding vector may contain semantic information of the unmasked data units in the second sub-data
  • the unmasked The positional relationship between the data unit of the code and other data units in the second data sequence.
  • the first sub-data or the second sub-data includes a masked data unit
  • the second embedding vector includes the masked data unit and the second data A positional relationship between other data units in a sequence.
  • the first sub-data may include masked data units
  • the second embedding vector may include the masked data units in the first sub-data and other data in the second data sequence The positional relationship between units.
  • the second sub-data may include a masked data unit
  • the second embedding vector may include the masked data unit in the second sub-data and other data in the second data sequence The positional relationship between units.
  • the second embedding vector includes a positional relationship between the to-be-predicted data unit and other data units in the second data sequence.
  • the second data sequence may be "Wir[M2]auf[M4][M5]", [M] indicates being masked, and when the data unit to be predicted is [M2], the first sub-data may be is Wir, the second sub-data may be als[M4][M5].
  • the unmasked data units in the data sequence can be embedded through the embedding layer reason.
  • the embedding layer may be called an input embedding (input embedding) layer.
  • the current input can be an unmasked data unit. After the embedding layer acquires the current input, it can perform embedding processing on each unmasked data unit in the current input, and can obtain an embedding vector corresponding to each unmasked data unit.
  • the position vector of each data unit in the unmasked data units can also be obtained, and the position vector is used to indicate the position of each data unit in the data sequence.
  • the position vector can be It is used to indicate the relative positional relationship between each data unit among the unmasked data units and other masked data units and the masked data units.
  • the embedding layer may include an input embedding layer and a positional encoding (positional encoding) layer.
  • the word embedding process can be performed on each data unit in the unmasked data unit in the current input, so as to obtain the word vector of each data unit in the unmasked data unit (for example, it can be represent semantic information).
  • the position of each data unit in the unmasked data units in the current input may be obtained, and then a position vector is generated for the position of each data unit in the unmasked data units.
  • the position information of each of the unmasked data units in the data sequence may be an absolute position of each of the unmasked data units in the data sequence. Take the current input as "what number should be returned to Huabei" as an example, where the position of "several” can be represented as the first digit, and the position of "number” can be represented as the second digit, ... . In some examples, the position of each of the unmasked data units in the data sequence may be the relative position of each of the unmasked data units in the data sequence.
  • the position of "several” can be expressed as before “number”, and the position of "number” can be expressed as after “several” and before “should",... ...
  • the word vector and position vector of each data unit in the unmasked data unit in the current input are obtained, the position vector and the corresponding word vector of each data unit in the unmasked data unit can be performed Fusion to obtain the embedding vector of each data unit in the unmasked data unit.
  • the fusion method may be to perform an addition operation on the position vector and the corresponding word vector, or through other operations, and the specific fusion method is not limited here.
  • Embedding vectors can be represented as embedding matrices with preset dimensions.
  • the number of the embedding vectors can be set as M, and the preset dimension is H, so the embedding vector can be expressed as an M ⁇ H embedding matrix.
  • the first embedding vector may be input to the encoder of the PLM, and the second embedding vector may be input to the decoder of the PLM.
  • the encoder in the pre-training language model PLM can be used to obtain a hidden state (hidden state), for example, the first embedding vector can be used as the pre-training language model
  • the input of the encoder in the PLM, and the encoder in the PLM can output the hidden state.
  • FIG. 7 is a schematic structural diagram of a PLM in an embodiment of the present application, wherein the framework may be composed of a bidirectional encoder and a bidirectional decoder.
  • the encoder can include a self-attention module and a feed-forward network, and the output of the encoder can be input into the output layer and the cross-attention module on the decoder side.
  • the input to the apparatus shown in Figure 7 may be the sequence processed in Figure 6b.
  • the first embedding vector is input into the encoder of the model, and then the vector representation output by the encoder is input to the decoder through a cross-attention module.
  • the encoder may include a transforming transformer layer, where the transformer layer may include multiple sub-transformer layers in series. Each sub-transformer layer can be used to process the output data of the previous sub-transformer layer adjacent to the sub-transformer layer to obtain an intermediate vector, and output the intermediate vector to the sub-transformer layer.
  • the next adjacent sub-transformer layer wherein, if the sub-transformer layer is the transformer layer closest to the input side in the plurality of sub-transformer layers, the input data of the sub-transformer layer is an embedding vector; if the sub-transformer layer The transformer layer is the transformer layer closest to the output side among the plurality of sub-transformer layers, and the data output by the sub-transformer layer is a hidden state.
  • the core feature of the transformer layer is its unique attention mechanism.
  • the transformer model uses this attention mechanism to assign different attention coefficients to the embedding vectors of each word in the sentence, so as to more fully consider the influence of the context in the sentence on each word.
  • the specific transformer layer may include sequentially adjacent multi-head attention layers, summation and normalization (add&norm) layers, feedforward (feed forward) layers, summation and normalization layers.
  • the attention layer is connected to the embedding layer, and the embedding vector is obtained from the embedding layer as an input vector. Based on the correlation between each embedding vector in the embedding vector, each embedding vector is synthesized to obtain an output vector, which is output to the subsequent transformer layer. .
  • the transformer layer takes the output of the previous layer as an input vector and performs operations similar to the previous transformer layer.
  • FIG. 8 is a structural diagram of a transformer layer.
  • Each sub-transformer layer in the embodiment of the present application can refer to the structure shown in FIG. 8.
  • the transformer layer includes successively Adjacent multi-head attention layer, add&norm layer, feedforward layer, summation and normalization layer.
  • the multi-head attention layer obtains M input vectors X l from its upper layer, which can be expressed as a matrix X.
  • each vector is transformed based on the correlation between vectors to obtain M output vectors. It can also be expressed as a matrix Y.
  • the MHA layer based on multi-head attention includes multiple attention heads (Head1, Head 2, ..., Head N as shown in Figure 8).
  • Fig. 9 is a schematic diagram of the operation of an attention head, which shows how the attention head transforms an input matrix X into an output matrix Y.
  • the first transformation matrix Q, the second transformation matrix K and the third transformation matrix V are respectively used to transform each input vector Xi among the M input vectors ⁇ X1, X2,...,XN> to obtain each input
  • the vector corresponds to the first intermediate vector (q vector), the second intermediate vector (k vector) and the third intermediate vector (v vector).
  • the first transformation matrix Q, the second transformation matrix K and the third transformation matrix V can be used to linearly transform the input matrix X composed of N input vectors to obtain the Q matrix, K matrix and V matrix of the input matrix respectively.
  • each second intermediate vector (k-vector, kj) corresponding to each input vector Xj based on the first intermediate vector (q-vector, qi) corresponding to the i-th input vector The dot product operation of is used to determine the degree of correlation between the i-th input vector Xi and each input vector Xj.
  • the dot product result of qi and kj can also be directly determined as the correlation degree, more classically, the dot product result is divided by a constant, and then softmax operation is performed, and the operation result is used as the correlation degree of the input vector Xi and Xj, Right now:
  • each degree of association ⁇ i,j between the i-th input vector Xi and each input vector Xj can be used as a weighting factor, and the third intermediate vector (v vector, vj) corresponding to each input vector Xj can be weighted and combined to obtain the first The i-th combined vector Ci corresponding to the i input vector Xi:
  • a vector sequence ⁇ C1, C2, . . . , CN>, or a matrix C, of M combination vectors corresponding to M input vectors can be obtained.
  • M output vectors can be obtained.
  • the output matrix Y is the combined vector matrix C, which can be written as:
  • the MHA layer maintains m sets of transformation matrices, and each set of transformation matrices includes the aforementioned first transformation matrix Q, second transformation matrix K, and third transformation matrix V, so that The above operations can be performed in parallel to obtain m combination vector sequences (that is, m matrices C), and each vector sequence includes N combination vectors obtained based on a set of transformation matrices.
  • the MHA layer concatenates the obtained m combination vector sequences to obtain a concatenation matrix; then transforms the concatenation matrix through the fourth transformation matrix W to obtain a final output matrix Y.
  • splitting the output matrix Y corresponds to M output vectors ⁇ Y1, Y2, . . . , YN>.
  • the MHA layer performs transformation operations based on the degree of association between N input vectors to obtain M output vectors.
  • the transformer layer may include a feedforward layer, wherein the feedforward layer includes an input layer, an intermediate layer, and an output layer.
  • a neural network model can contain multiple transformer layers.
  • the above multiple transformer layers may be stacked and connected in a residual network manner.
  • the encoder includes an attention head, and since each unmasked data unit is mutually visible in the first data sequence, when processing the embedding vector, the embedding vector There is an attention association between any two embedding vectors in , specifically, attention information can be obtained, and the attention information is used to indicate that when the attention head processes the embedding vector, any two of the embedding vectors There is an attention association between the embedding vectors, and then according to the attention information, the embedding vectors can be processed through the encoder, so that each output vector has a dependency relationship with each input embedding vector (also It is the so-called bidirectional encoder with visible context information).
  • an output layer similar to that of the decoder can be added on the output side of the encoder, for example, it can be composed of a fully connected layer and a softmax normalization layer, which is used to predict the output layer in the first data sequence masked data unit.
  • the masked data unit in the first data sequence may be predicted through the output layer of the encoder in the PLM to obtain a second predicted data unit; and according to The difference between the second predicted data unit and the data unit before being masked in the first data sequence is used to update the encoder.
  • the fully connected network at the output layer of the encoder can map the output of the encoder to a fixed dimension (the dimension of the vocabulary size), and then use the softmax normalization function to obtain the probability of the occurrence of the target word at each position.
  • the target word here may be a masked data unit (eg, the second predicted data unit) in the first data sequence.
  • the accuracy of the model's prediction on the current data is calculated by calculating the logarithmic likelihood (logarithm of the probability) of the position corresponding to the target word.
  • the encoder and the decoder can be pre-trained at the same time, effectively jointly training the two modules.
  • the first sub-data, the second sub-data, and the hidden state use a decoder in the PLM and an output layer of the decoder to predict the data unit to be predicted, so as to obtaining a first prediction data unit;
  • the The prediction data unit is predicted to obtain the first prediction data unit.
  • the first sub-data, the second sub-data and the hidden state can be used as the input of the decoder in the PLM, that is, the data unit to be predicted is predicted , the context information of the data unit to be predicted is visible, that is, the first predicted data unit obtained by predicting the data unit to be predicted, the first sub-data, the second sub-data, and all There is a dependency relationship between the implicit states.
  • the decoder of PLM is configured to only be visible to the above information, that is, autoregressive from left to right, along with the generative pre-training transformation model 3 (generative With the launch of large pre-trained models such as pre-trained transformer 3, GPT-3) and Pangu, the parameters of the model have become larger and larger, and the cost of pre-training has also become higher and higher. If a pre-training can only adapt to a single downstream task, then a pre-training needs to be done at a large cost for each generation strategy, which will consume too many resources.
  • the self-attention mask module is changed from the self-attention mask module to the self-attention module at the decoder layer, so that it can see the context information like the encoder, so called a bidirectional decoder.
  • the second embedding vector can be input into the self-attention module of the PLM decoder, and the hidden state output by the encoder can be input into the cross-attention module, so that the decoder can learn richer semantic information.
  • the sequence before the sequence is input to the model, it will go through an embedding layer, which will convert the second data sequence into a continuous vector of fixed dimensions, and then input it into the model for calculation (specifically, you can Refer to the description of the embedding layer on the encoder side, the similarities will not be repeated here).
  • the fully connected network at the output layer of the decoder can map the output of the decoder to a fixed dimension (the dimension of the vocabulary size), and then use the softmax normalization function to obtain the probability of the occurrence of the target word at each position.
  • the target word here may be a masked data unit (for example, the first predicted data unit) in the first data sequence.
  • the accuracy of the model's prediction on the current data is calculated by calculating the logarithmic likelihood (logarithm of the probability) of the position corresponding to the target word.
  • each layer of the decoder may include a self-attention layer (self-attention), an encoder-decoder attention layer (encoder-decoder attention) and a feed-forward network layer (feed forward), wherein the above encoding - Decoding attention layers can also be described as cross-attention layers.
  • the self-attention layer of the decoder considers the influence of the data unit in the context on the currently decoded vector during the decoding process.
  • the encoding-decoding attention layer of the decoder considers the influence of the encoder's input on the currently decoded vector.
  • the feed-forward network layer of the decoder is to perform nonlinear transformation processing on the output vector of the encoding-decoding attention layer.
  • the output mapping layer (or simply called the output layer) can receive the decoding vector output by the last network layer of the decoder, and convert the decoding vector into a prediction result (such as the first prediction data unit), such as predicting the generation of a new word.
  • the loss can be determined based on the difference between the true value (data unit to be predicted) and the first predicted data unit, and the encoder and decoder can be updated based on the loss.
  • An embodiment of the present application provides a model training method, the method comprising: obtaining a first embedding vector and a second embedding vector; the first embedding vector corresponds to the first data sequence, and the second embedding vector corresponds to the first Two data sequences, the second data sequence includes first sub-data, masked data units to be predicted and second sub-data, the first sub-data is located in the data to be predicted in the second data sequence The above unit, the second sub-data is located below the data unit to be predicted in the second data sequence; according to the first embedding vector, the encoder in the pre-trained language model PLM is used to obtain the hidden state: according to the first sub-data, the second sub-data, and the hidden state, through the decoder in the PLM and the output layer of the decoder, predict the data unit to be predicted, so as to Obtaining a first predicted data unit; updating the encoder and the decoder according to the difference between the first predicted data unit and the data unit to be predicted.
  • the pre-training architecture of the encoder and the two-way decoder is adopted, and the decoder can see the above and below information at the same time during the training process.
  • other types of sequence generation tasks autoregressive: from left to right, from right to left; non-autoregressive: completely non-autoregressive, semi-autoregressive, etc.
  • the PLM obtained by the training method in the embodiment of the present application can have a good adaptability to other types of sequence generation tasks (autoregressive: from left to right, from right to left; non-autoregressive: completely non-autoregressive, semi-autoregressive non-autoregressive, etc.)
  • the PLM obtained through the training method in the embodiment of the present application can achieve better model accuracy.
  • the solution of the embodiment of the present application may include two parts, offline pre-training model parameters, and offline fine-tuning on specific tasks and specific data sets.
  • Step 1 Get Input
  • Sentence pairs (X,Y) of source and target sequences are then constructed.
  • Data augmentation consists of two optional parts:
  • Word code conversion combined with mask operation According to the external knowledge Q, the set of aligned word pairs in the input sentence pair (X, Y) is obtained, and then a subset is randomly selected from the set with a certain probability. For each element in the subset, match it with the source sequence and do an alignment-based code conversion operation (ACS), match it with the target sequence and do an alignment-based mask operation (AM). Finally, a new input sequence (ACS(X), AM(Y)) and an output sequence (PRE1(X), PRE1(Y)) are obtained.
  • ACS(X), AM(Y) alignment-based code conversion operation
  • AM alignment-based mask operation
  • Dynamic dual mask operation Dynamically sample the mask probabilities ⁇ , ⁇ of the source sequence and the target sequence respectively from the given two intervals. Then mask the encoder input sequence CSR(X) with probability ⁇ to obtain the new encoder input sequence DM(ACS(X)) and output sequence PRE2(X). Perform a dynamic mask operation on the decoder with probability ⁇ to obtain a new input sequence DM(AM(Y)), and an output sequence PRE2(Y).
  • the pre-training of the sequence-to-sequence model can be completed on different data sets or different languages for different task categories according to user requirements.
  • Table 1 Statistical table of the number of labeled data in the pre-training set
  • the second data collection job is to acquire external knowledge Q.
  • Q includes external alignment knowledge and multilingual dictionaries for code conversion.
  • External alignment knowledge can be dictionaries, pre-trained word vectors, etc.
  • the third-party tool Fast-align is collected and used as the alignment knowledge base, and the MUSE tool is used to obtain multilingual dictionaries.
  • the combination of lexical code conversion and mask use the alignment knowledge base Q to obtain the aligned word pair information in the bilingual sentence pair (X, Y), and then use the multilingual dictionary to perform lexical code conversion operations on the source sentence sequence based on the alignment information. The corresponding position of the target sequence is masked to obtain the input and output of the encoder and decoder.
  • Dynamic dual masking the mask probabilities ⁇ , ⁇ are sampled in a given interval at the encoder and decoder, respectively. Perform a dual mask operation on the source sequence and the target sequence, and ensure that the mask ratio of the target sequence is greater than that of the source sequence. The final model predicts and outputs all the masked characters, and obtains new input and output of the encoder and decoder respectively.
  • the enhanced source sequence and target sequence are input into the encoder and decoder respectively, and then all masked characters are predicted at the output layer, and the model is trained using cross-entropy.
  • This example uses the commonly used left-to-right generation strategy to verify the effect of the pre-trained model on autoregressive tasks.
  • the task uses the standard Transformer structure. Fine-tuning requires removing the predictive output layer of the encoder, then training with a labeled dataset of a specific language pair, and selecting the best model based on the performance on the development set for testing.
  • test machine sets in 13 translation tasks (including low ( ⁇ 1M), medium (>1M, and ⁇ 10M), high (>10M, and ⁇ 25M), and extremely high resource (>25M) scenarios)
  • the performance of the comparison model is verified above, and BLEU ( ⁇ ) is used as the evaluation index of sequence generation (translation) quality.
  • Table 2 it is the performance effect of the model on autoregressive tasks.
  • Example 2 describes how the scheme uses unlabeled data for pre-training. In actual use, according to actual needs, either labeled data can be used for training, or unlabeled data can be used, or both can be used at the same time.
  • Embodiment 1 verifies that the embodiment of the present application uses the labeled data set for pre-training. Due to the unified pre-training scheme proposed in the embodiment of this application, it can fine-tune autoregressive and non-autoregressive tasks after only one pre-training. In the autoregressive task, the performance of the model in 13 translation directions shows that the scheme can achieve better results than existing pre-training methods. As shown in Table 2, compared with direct training (without using pre-training parameter initialization), our scheme has improved 2.3-14.4 BLEU in each translation direction from low-resource to extremely high-resource scenarios, and the average improvement has reached 7.9 BLEU.
  • the present invention is the first non-autoregressive pre-training scheme, and it can be seen that the model has also achieved better performance in the six translation directions, compared to the direct
  • the average training boost is 2.5BLEU.
  • the second embodiment supplementally verifies the effect of the embodiment of the present application when using the unlabeled data set for pre-training and then fine-tuning the autoregressive task.
  • Table 5 when only unlabeled data sets (monolingual) are used, the model improves by an average of 2.9 BLEU, which is 0.9 BLEU lower than the effect of only using labeled data; and when labeled data and unlabeled data are used at the same time, The model improves an average of 4.3BLEU in 4 directions.
  • the unified pre-training framework included in the embodiment of the present application can use both labeled data and unlabeled data for pre-training, and the saved model parameters can be used to initialize autoregressive tasks (including autoregressive Right-to-left generation, right-to-left generation), can also be used to initialize non-autoregressive tasks.
  • the word code conversion based on the external knowledge base contained in the embodiment of the present invention combines mask operation, dynamic dual mask and encoder output prediction operations all have a positive impact on translation performance, table 6 is the three operations in Validation of performance impact in autoregressive translation tasks.
  • FIG. 10 is a schematic structural diagram of a model training device 1000 provided in the embodiment of the present application.
  • the model training device 1000 may be a terminal device or a server, and the model training device 1000 may include:
  • An acquisition module 1001 configured to acquire a first embedding vector and a second embedding vector; the first embedding vector corresponds to a first data sequence, the second embedding vector corresponds to a second data sequence, and the second data sequence includes The first sub-data, the masked data unit to be predicted and the second sub-data, the first sub-data is located above the data unit to be predicted in the second data sequence, the second sub-data located below the to-be-predicted data unit in the second data sequence;
  • step 601 for the specific description of the acquiring module 1001, reference may be made to the description of step 601 in the above-mentioned embodiment, and details are not repeated here.
  • the encoding module 1002 is used to obtain the hidden state through the encoder in the pre-trained language model PLM according to the first embedding vector;
  • a decoding module 1003 configured to, according to the first sub-data, the second sub-data and the hidden state, use the decoder in the PLM and the output layer of the decoder to process the data unit to be predicted performing a prediction to obtain a first prediction data unit;
  • a training module 1004 configured to update the encoder and the decoder according to the difference between the first predicted data unit and the data unit to be predicted.
  • step 604 for a specific description of the training module 1004, reference may be made to the description of step 604 in the above-mentioned embodiment, and details are not repeated here.
  • the acquisition module is also used to:
  • probability sampling it is determined whether at least one data unit in the first initial data sequence is masked to obtain the second data sequence, wherein the probability value obtained by the probability sampling is used as the at least one Probability that a data unit is masked.
  • the acquisition module is also used to:
  • probability sampling it is determined whether at least one data unit in the second initial data sequence is masked to obtain the first data sequence; wherein, when performing the probability sampling, the first initial data
  • the probability that a data unit in the sequence is masked is greater than the probability that a data unit in the second initial data sequence is masked.
  • the encoding module is also used for:
  • the training module is further configured to: update the encoder according to the difference between the second predicted data unit and the data unit before being masked in the first data sequence.
  • the PLM is used to realize the sequence conversion task between texts of different language types
  • the acquisition module is also used to:
  • the second data sequence and the third initial data sequence are texts with the same semantics and expressed in different language types, the first data unit in the third initial data sequence and the The semantics of the data units to be predicted are the same;
  • the acquisition module is also used to:
  • the first sub-data or the second sub-data includes unmasked data units
  • the second embedding vector includes semantic information of the unmasked data units
  • the unmasked a positional relationship between a data unit and other data units in said second data sequence
  • the first sub-data or the second sub-data includes a masked data unit, and the second embedding vector includes the relationship between the masked data unit and other data units in the second data sequence positional relationship; or,
  • the second embedding vector includes a positional relationship between the data unit to be predicted and other data units in the second data sequence.
  • the first data sequence before the mask operation and the second data sequence before the mask operation are the same data sequence; or,
  • the first data sequence before the mask operation and the second data sequence before the mask operation are different data sequences marked by samples.
  • the first data sequence and the second data sequence are text data.
  • FIG. 11 is a schematic structural diagram of the execution device provided by the embodiment of the present application. Tablets, laptops, smart wearable devices, monitoring data processing equipment or servers, etc., are not limited here.
  • the execution device 1100 includes: a receiver 1101, a transmitter 1102, a processor 1103, and a memory 1104 (the number of processors 1103 in the execution device 1100 may be one or more, and one processor is taken as an example in FIG. 11 ) , where the processor 1103 may include an application processor 11031 and a communication processor 11032 .
  • the receiver 1101 , the transmitter 1102 , the processor 1103 and the memory 1104 may be connected through a bus or in other ways.
  • the memory 1104 may include read-only memory and random-access memory, and provides instructions and data to the processor 1103 .
  • a part of the memory 1104 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1104 stores processors and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the processor 1103 controls the operations of the execution device.
  • various components of the execution device are coupled together through a bus system, where the bus system may include not only a data bus, but also a power bus, a control bus, and a status signal bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the foregoing embodiments of the present application may be applied to the processor 1103 or implemented by the processor 1103 .
  • the processor 1103 may be an integrated circuit chip and has a signal processing capability. In the implementation process, each step of the above method may be implemented by an integrated logic circuit of hardware in the processor 1103 or instructions in the form of software.
  • the above-mentioned processor 1103 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (application specific integrated circuit, ASIC), Field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • the processor 1103 may implement or execute various methods, steps, and logic block diagrams disclosed in the embodiments of the present application.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 1104, and the processor 1103 reads the information in the memory 1104, and completes the steps of the above method in combination with its hardware.
  • the receiver 1101 can be used to receive input digital or character information, and generate signal input related to performing device related settings and function control.
  • the transmitter 1102 can be used to output digital or character information through the first interface; the transmitter 1102 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1102 can also include a display device such as a display screen .
  • the embodiment of the present application also provides a training device. Please refer to FIG. 12.
  • FIG. 12 Larger differences can be produced due to different configurations or performances, and can include one or more central processing units (central processing units, CPU) 1212 (for example, one or more processors) and memory 1232, one or more storage applications
  • a storage medium 1230 for the program 1242 or data 1244 (such as one or more mass storage devices).
  • the memory 1232 and the storage medium 1230 may be temporary storage or persistent storage.
  • the program stored in the storage medium 1230 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the training device.
  • the central processing unit 1212 may be configured to communicate with the storage medium 1230 , and execute a series of instruction operations in the storage medium 1230 on the training device 1200 .
  • the training device 1200 can also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input and output interfaces 1258; or, one or more operating systems 1241, such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • operating systems 1241 such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • the central processing unit 1212 is configured to execute the model training method described in the embodiment corresponding to FIG. 6a.
  • the embodiment of the present application also provides a computer program product, which, when running on a computer, causes the computer to perform the steps performed by the aforementioned execution device, or enables the computer to perform the steps performed by the aforementioned training device.
  • An embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a program for signal processing, and when it is run on a computer, the computer executes the steps performed by the aforementioned executing device , or, causing the computer to perform the steps performed by the aforementioned training device.
  • the execution device, training device or terminal device provided in the embodiment of the present application may specifically be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, pins or circuits etc.
  • the processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip in the execution device executes the model training method described in the above embodiment, or the chip in the training device executes the model training method described in the above embodiment.
  • the storage unit is a storage unit in the chip, such as a register, Cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (read-only memory, ROM) or other types that can store static information and instructions static storage device, random access memory (random access memory, RAM), etc.
  • ROM read-only memory
  • RAM random access memory
  • FIG. 13 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the model training method described in the embodiment corresponding to FIG. 6a above can be implemented in the chip shown in FIG. 13 .
  • the chip can be represented as a neural network processor NPU 1300, and the NPU 1300 is mounted on the main CPU (Host CPU) as a coprocessor, and the tasks are assigned by the Host CPU.
  • the core part of the NPU is the operation circuit 1303, and the controller 1304 controls the operation circuit 1303 to extract data in the memory (weight memory or input memory) and perform operations.
  • model training method described in the above embodiment corresponding to FIG. 6a may be completed by the cooperation of the main CPU and the NPU in the chip shown in FIG. 13 .
  • the operation circuit 1303 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 1303 is a two-dimensional systolic array.
  • the arithmetic circuit 1303 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 1303 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 1302, and caches it in each PE in the operation circuit.
  • the operation circuit fetches the data of matrix A from the input memory 1301 and performs matrix operation with matrix B, and the obtained partial results or final results of the matrix are stored in the accumulator 1308 .
  • the unified memory 1306 is used to store input data and output data.
  • the weight data directly accesses the controller (Direct Memory Access Controller, DMAC) 1305 through the storage unit, and the DMAC is transferred to the weight storage 1302.
  • Input data is also transferred to unified memory 1306 by DMAC.
  • DMAC Direct Memory Access Controller
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 1310, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 1309.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 1310 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1309 to obtain instructions from the external memory, and is also used for the storage unit access controller 1305 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • BIU Bus Interface Unit
  • the DMAC is mainly used to move the input data in the external memory DDR to the unified memory 1306 , to move the weight data to the weight memory 1302 , or to move the input data to the input memory 1301 .
  • the vector calculation unit 1307 includes a plurality of calculation processing units, and further processes the output of the calculation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization (batch normalization), pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 1307 can store the vector of the processed output to unified memory 1306 .
  • the vector calculation unit 1307 can apply a linear function; or, a nonlinear function to the output of the operation circuit 1303, such as performing linear interpolation on the feature plane extracted by the convolution layer, and for example, a vector of accumulated values to generate an activation value.
  • the vector computation unit 1307 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as an activation input to the arithmetic circuit 1303, for example for use in a subsequent layer in a neural network.
  • An instruction fetch buffer (instruction fetch buffer) 1309 connected to the controller 1304 is used to store instructions used by the controller 1304;
  • the unified memory 1306, the input memory 1301, the weight memory 1302 and the fetch memory 1309 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the processor mentioned above can be a general-purpose central processing unit, microprocessor, ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be A physical unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines.
  • the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training device, or network device, etc.) execute the instructions described in various embodiments of the present application method.
  • a computer device which can be a personal computer, training device, or network device, etc.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training device or data center via wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wired eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, DVD), or a semiconductor medium (for example, a solid state disk (Solid State Disk, SSD)) and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

本申请可以应用于人工智能领域,具体公开了一种模型训练方法,方法包括:获取用于输入到预训练语言模型中解码器的第二嵌入向量,第二嵌入向量对应于第二数据序列,第二数据序列包括第一子数据、被掩码的待预测数据单元以及第二子数据,第一子数据在第二数据序列中位于待预测数据单元的上文,第二子数据在第二数据序列中位于待预测数据单元的下文,根据第一嵌入向量,通过预训练语言模型PLM中的编码器,得到隐状态;根据第一子数据、第二子数据以及隐状态,通过PLM中的解码器以及解码器的输出层,对待预测数据单元进行预测。本申请不需要针对于每个类型的序列生成任务都预训练一个对应的PLM,大大降低了训练PLM所需的资源。

Description

一种模型训练方法及相关设备
本申请要求于2022年2月22日提交中国专利局、申请号为202210164992.6、发明名称为“一种模型训练方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种模型训练方法及相关设备。
背景技术
人工智能(artificialintelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
序列到序列的自然语言生成是自然语言处理任务中非常重要的一个方向,常采用编码器-解码器的设计框架。根据生成序列方式的不同,可以将序列生成任务分为自回归的生成和非自回归(并行)生成。自回归生成是指在生成目标序列的过程中,首先预测生成目标序列的第一个字符,然后根据已经生成的子序列一步一步的预测整个目标序列。非自回归生成则是指在解码时并行的生成完整的目标序列,不再需要一步一步的迭代过程,从而极大的缩减了生成目标序列所等待的时间。对于翻译、对话等对实时性要求较高的任务来说,非自回归的生成变得越发重要。
在序列生成任务当中,“预训练-微调”是提升模型性能的标准范式。但针对序列到序列的生成任务,已有的预训练方案都只关注自左向右的自回归生成,也就是在预训练的过程中都只能看到数据序列的上文信息,所以在微调下游任务时,无法拓展到其他的解码策略。随着生成型预训练变换模型3(generative pre-trained transformer 3,GPT-3)、盘古等预训练大模型的推出,模型的参数变得越来越大,预训练的成本也越来越高。如果一次预训练仅能适应单一的下游任务,那针对每种生成策略都需要以较大的代价做一次预训练,这将会耗费掉过多的资源。
发明内容
本申请提供了一种模型训练方法,不需要针对于每个类型的序列生成任务都预训练一个对应的PLM,大大降低了训练PLM所需的资源(例如计算资源、存储资源、时间资源等)。
第一方面,本申请提供了一种模型训练方法,所述方法包括:
获取第一嵌入向量以及第二嵌入向量;所述第一嵌入向量对应于第一数据序列,所述第二嵌入向量对应于第二数据序列,所述第二数据序列包括第一子数据、被掩码的待预测数据单元以及第二子数据,所述第一子数据在所述第二数据序列中位于所述待预测数据单 元的上文,所述第二子数据在所述第二数据序列中位于所述待预测数据单元的下文;根据所述第一嵌入向量,通过预训练语言模型PLM中的编码器,得到隐状态;根据所述第一子数据、所述第二子数据以及所述隐状态,通过所述PLM中的解码器以及所述解码器的输出层,对所述待预测数据单元进行预测,以得到第一预测数据单元;根据所述第一预测数据单元和所述待预测数据单元之间的差异,更新所述编码器和所述解码器。
通过上述方式,采用编码器和双向解码器的预训练架构,训练的过程中解码器能够同时看到上文和下文的信息。由于其他类型的序列生成任务(自回归:自左向右、自右向左;非自回归:完全非自回归、半非自回归等)都可以相当于本申请实施例中PLM的一个子集,相当于通过本申请实施例中的训练方法得到的PLM可以具备良好的适应其他类型的序列生成任务(自回归:自左向右、自右向左;非自回归:完全非自回归、半非自回归等)的能力,也就是即使后续微调时采用的是其他类型的序列生成任务,通过本申请实施例中的训练方法得到的PLM都可以达到较好的模型精度。而不需要针对于每个类型的序列生成任务都预训练一个对应的PLM,大大降低了训练PLM所需的资源(例如计算资源、存储资源、时间资源等)。
在一种可能的实现中,所述方法还包括:获取第一初始数据序列;通过概率采样的方式,确定所述第一初始数据序列中的至少一个数据单元是否被掩码,以得到所述第二数据序列,其中所述概率采样得到的概率值用于作为所述至少一个数据单元被掩码的概率。
在一种可能的实现中,所述方法还包括:获取第二初始数据序列;通过概率采样的方式,确定所述第二初始数据序列中的至少一个数据单元是否被掩码,以得到所述第一数据序列;其中,在进行所述概率采样时,所述第一初始数据序列中数据单元被掩码的概率大于所述第二初始数据序列中数据单元被掩码的概率。
在一种可能的实现中,可以采用动态采样掩码概率,其中,动态是指数据序列中每个数据单元被掩码的概率是动态的。
在一种可能的实现中,可以在某个概率区间内针对于第二初始数据序列中的至少一个数据单元(例如,至少一个数据单元可以为第二初始数据序列中的全部数据单元)中的每个数据单元采样得到一个概率值,该概率采样得到的概率值用于作为所述至少一个数据单元被掩码的概率,例如,可以基于该概率值和其他概率区间采样得到的另一个概率值进行比较,来确定是否对数据单元进行掩码操作。
其中,基于第一数据序列生成的嵌入向量可以作为PLM中编码器的输入,基于第二数据序列生成的嵌入向量可以作为PLM中解码器的输入,上述分别对第一数据序列和第二数据序列的掩码操作可以称之为动态的对偶掩码操作,通过动态的对偶掩码操作,可以将编码器的输入和解码器的输入分别掩码,可以在后续的训练过程中同时完成对编码器和解码器的预训练。此外,动态的采样掩码概率可以防止掩码概率太大导致模型训练时整个批次的有效效信息都比较少。
在一种可能的实现中,在进行所述概率采样时,所述第一初始数据序列中数据单元被掩码的概率大于所述第二初始数据序列中数据单元被掩码的概率。通过设置动态掩码操作中,保证解码器的掩码比例高于编码器,使解码器预测时可以充分的从编码器端获取信息, 进而提升训练后的预训练模型的模型精度。
在一种可能的实现中,PLM用于实现目标任务,所述第一数据序列(可选的,掩码前的第一数据序列)可以为执行目标任务前的原始数据,所述第二数据序列(可选的,掩码前的第二数据序列)可以为执行目标任务后的目标数据。目标任务可以为翻译任务、自然语言生成任务等。第一数据序列和第二数据序列可以组成一个训练样本,PLM需要基于第一数据序列来生成第二数据序列。
以目标任务为翻译任务为例,所述第一数据序列和所述第二数据序列为语义相同且通过不同语言类型表达的数据。
在一种可能的实现中,PLM可以用于实现文本的摘要生成任务,则原始数据可以为需要提取摘要的源语料,目标数据可以为需要生成的摘要文本。
在一种可能的实现中,PLM可以用于实现文本答复任务,则原始数据可以为需要答复的源语料,目标数据可以为针对于源语料的答复内容。
在一种可能的实现中,所述方法还包括:通过所述PLM中所述编码器的输出层,对所述第一数据序列中被掩码后的数据单元进行预测,以得到第二预测数据单元;根据所述第二预测数据单元和所述第一数据序列中被掩码前的数据单元之间的差异,更新所述编码器。
在一种可能的实现中,可以在编码器的输出侧增加一个和解码器类似的输出层,例如可以由一个全连接层和一个softmax归一化层构成,用于预测第一数据序列中被掩码的数据单元。
在一种可能的实现中,可以通过所述PLM中所述编码器的输出层,对所述第一数据序列中被掩码后的数据单元进行预测,以得到第二预测数据单元;并根据所述第二预测数据单元和所述第一数据序列中被掩码前的数据单元之间的差异,更新所述编码器。
在编码器输出层的全连接网络可以将编码器的输出映射到固定的维度(词表大小的维度),然后使用softmax归一化函数得到每个位置目标词出现的概率。这里的目标词可以为第一数据序列中被掩码的数据单元(例如第二预测数据单元)。在训练时,通过计算目标词所对应位置的对数似然(概率取对数)来计算模型在当前数据上的预测准确的程度。
通过上述方式,在对PLM进行训练时,可以同时对编码器和解码器进行预训练,有效的对两个模块进行了联合训练。
在一种可能的实现中,所述PLM可以用于实现不同语言类型的文本之间的序列转换任务(例如翻译任务),针对于原始的源语料(例如本申请实施例中的第三初始数据序列),可以对其中的部分数据单元进行单元替换(替换为语义相同且通过另一种语言类型表达的数据单元),进而可以提升PLM对于多种语言之间的序列转换精度。
在一种可能的实现中,可以获取第三初始数据序列;所述第二数据序列和所述第三初始数据序列为语义相同的文本、且通过不同语言类型表达,所述第三初始数据序列中的第一数据单元和所述待预测数据单元的语义相同;将所述第三初始数据序列中的所述第一数据单元替换为第二数据单元,以得到所述第一数据序列;其中所述第二数据单元和所述第 一数据单元的语义相同、且通过不同语言类型表达。
其中,这里的语义相同可以理解为可以表达相同或类似的语义,由于不同语言类型的语法、语言环境并不限定,因为本申请实施例中的语义相同并不限定为语义的完全一致。
其中,除了对第三初始数据序列中的所述第一数据单元替换为第二数据单元,还可以对第三初始数据序列进行其他处理(例如掩码操作、其他数据单元的操作),本申请实施例并不限定第一数据序列仅仅是通过对第三初始数据序列中的第一数据单元替换为第二数据单元得到的。
在一种可能的实现中,第一数据单元可以为从第三初始数据序列中随机选择的,例如,针对于第三初始数据序列中的任意数据单元都有可能被选择作为上述的第一数据单元。可以在语言库中检索得到与该第一数据单元语义相同或相似且通过不同语言类型表达的数据单元作为第二数据单元,并用第二数据单元替换第三初始数据序列中的第一数据单元,以得到第一数据序列。
在一种可能的实现中,第二数据单元和第一初始数据序列也是通过不同的语言类型表达的,也就是说第一初始数据序列、第二初始数据序列和第二数据单元中任意两个之间的语言类型都是不同的。
在一种可能的实现中,所述方法还包括:获取第四初始数据序列;对所述第四初始数据序列中和所述第一数据单元语义相同的数据单元进行掩码,以得到所述第二数据序列。
在一种可能的实现中,在对第二初始数据序列中的第一数据单元替换为第二数据单元之后,可以让第一初始数据单元中和第一数据单元语义相同的数据单元进行掩码操作,由于掩码操作后的数据单元需要在PLM的训练过程中被预测,因此,可以让PLM具备更丰富的语言类型的文本理解能力。
在一种可能的实现中,可以获取第四初始数据序列;对所述第四初始数据序列中和所述第一数据单元语义相同的数据单元(例如可以为本申请实施例中的待预测数据单元)进行掩码,以得到所述第二数据序列。通过上述方式,引入外部知识将词码转换和掩码操作结合起来,更加充分的训练了模型的语义表示能力。
在一种可能的实现中,
所述第一子数据或者所述第二子数据包括未被掩码的数据单元,所述第二嵌入向量包含所述未被掩码的数据单元的语义信息、以及所述未被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系;或者,
所述第一子数据或者所述第二子数据包括被掩码的数据单元,所述第二嵌入向量包含所述被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系;或者,
所述第二嵌入向量包含所述待预测数据单元与所述第二数据序列中其他数据单元之间的位置关系。
在一种可能的实现中,进行掩码操作前的所述第一数据序列和进行掩码操作前的所述 第二数据序列为相同的数据序列;或者,
所述进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为经过样本标注的不同数据序列。
在一种可能的实现中,所述第一数据序列和所述第二数据序列为文本数据。
第二方面,本申请提供了一种模型训练装置,所述装置包括:
获取模块,用于获取第一嵌入向量以及第二嵌入向量;所述第一嵌入向量对应于第一数据序列,所述第二嵌入向量对应于第二数据序列,所述第二数据序列包括第一子数据、被掩码的待预测数据单元以及第二子数据,所述第一子数据在所述第二数据序列中位于所述待预测数据单元的上文,所述第二子数据在所述第二数据序列中位于所述待预测数据单元的下文;
编码模块,用于根据所述第一嵌入向量,通过预训练语言模型PLM中的编码器,得到隐状态;
解码模块,用于根据所述第一子数据、所述第二子数据以及所述隐状态,通过所述PLM中的解码器以及所述解码器的输出层,对所述待预测数据单元进行预测,以得到第一预测数据单元;
训练模块,用于根据所述第一预测数据单元和所述待预测数据单元之间的差异,更新所述编码器和所述解码器。
在一种可能的实现中,所述获取模块,还用于:
获取第一初始数据序列;
通过概率采样的方式,确定所述第一初始数据序列中的至少一个数据单元是否被掩码,以得到所述第二数据序列,其中所述概率采样得到的概率值用于作为所述至少一个数据单元被掩码的概率。
在一种可能的实现中,所述获取模块,还用于:
获取第二初始数据序列;
通过概率采样的方式,确定所述第二初始数据序列中的至少一个数据单元是否被掩码,以得到所述第一数据序列;其中,在进行所述概率采样时,所述第一初始数据序列中数据单元被掩码的概率大于所述第二初始数据序列中数据单元被掩码的概率。
在一种可能的实现中,所述编码模块,还用于:
通过所述PLM中所述编码器的输出层,对所述第一数据序列中被掩码后的数据单元进行预测,以得到第二预测数据单元;
所述训练模块,还用于:根据所述第二预测数据单元和所述第一数据序列中被掩码前的数据单元之间的差异,更新所述编码器。
在一种可能的实现中,所述PLM用于实现不同语言类型的文本之间的序列转换任务,所述获取模块,还用于:
获取第三初始数据序列;所述第二数据序列和所述第三初始数据序列为语义相同的文本、且通过不同语言类型表达,所述第三初始数据序列中的第一数据单元和所述待预测数据单元的语义相同;
将所述第三初始数据序列中的所述第一数据单元替换为第二数据单元,以得到所述第一数据序列;其中所述第二数据单元和所述第一数据单元的语义相同、且通过不同语言类型表达。
在一种可能的实现中,所述获取模块,还用于:
获取第四初始数据序列;
对所述第四初始数据序列中和所述第一数据单元语义相同的数据单元进行掩码,以得到所述第二数据序列。
在一种可能的实现中,
所述第一子数据或者所述第二子数据包括未被掩码的数据单元,所述第二嵌入向量包含所述未被掩码的数据单元的语义信息、以及所述未被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系;或者,
所述第一子数据或者所述第二子数据包括被掩码的数据单元,所述第二嵌入向量包含所述被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系;或者,
所述第二嵌入向量包含所述待预测数据单元与所述第二数据序列中其他数据单元之间的位置关系。
在一种可能的实现中,进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为相同的数据序列;或者,
所述进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为经过样本标注的不同数据序列。
在一种可能的实现中,所述第一数据序列和所述第二数据序列为文本数据。
第三方面,本申请实施例提供了一种训练设备,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以执行如上述第一方面及其任一可选的方法。
第四方面,本申请实施例还提供了一种数据处理方法,其特征在于,所述方法包括:
获取如第一方面所述方法得到的更新后的PLM以及待处理数据;其中,更新后的PLM可以包括更新后的编码器以及更新后的解码器;通过所述更新后的PLM,处理所述待处理 数据,以得到处理结果。
待处理数据可以为文本数据,具体可以参照上述实施例中第一方面关于数据序列的描述。
第五方面,本申请实施例还提供了一种数据处理装置,其特征在于,所述装置用于:获取如第一方面所述方法得到的更新后的PLM以及待处理数据;通过所述更新后的PLM,处理所述待处理数据,以得到处理结果。
第六方面,本申请实施例提供了一种执行设备,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以执行如上述第四方面及其任一可选的方法。
第七方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面及其任一可选的方法、或者如上述第四方面及其任一可选的方法。
第八方面,本申请实施例提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面及其任一可选的方法、或者如上述第四方面及其任一可选的方法。
第九方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持训练设备实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据;或,信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存执行设备或训练设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
本申请实施例提供了一种模型训练方法,所述方法包括:获取第一嵌入向量以及第二嵌入向量;所述第一嵌入向量对应于第一数据序列,所述第二嵌入向量对应于第二数据序列,所述第二数据序列包括第一子数据、被掩码的待预测数据单元以及第二子数据,所述第一子数据在所述第二数据序列中位于所述待预测数据单元的上文,所述第二子数据在所述第二数据序列中位于所述待预测数据单元的下文;根据所述第一嵌入向量,通过预训练语言模型PLM中的编码器,得到隐状态;根据所述第一子数据、所述第二子数据以及所述隐状态,通过所述PLM中的解码器以及所述解码器的输出层,对所述待预测数据单元进行预测,以得到第一预测数据单元;根据所述第一预测数据单元和所述待预测数据单元之间的差异,更新所述编码器和所述解码器。通过上述方式,采用编码器和双向解码器的预训练架构,训练的过程中解码器能够同时看到上文和下文的信息。由于其他类型的序列生成任务(自回归:自左向右、自右向左;非自回归:完全非自回归、半非自回归等)都可以相当于本申请实施例中PLM的一个子集,相当于通过本申请实施例中的训练方法得到的PLM可以具备良好的适应其他类型的序列生成任务(自回归:自左向右、自右向左;非自回归:完全非自回归、半非自回归等)的能力,也就是即使后续微调时采用的是其他类型的序列生成任务,通过本申请实施例中的训练方法得到的PLM都可以达到较好的模型精度。而不需要针对于每个类型的序列生成任务都预训练一个对应的PLM,大大降低了训练PLM所需的资源(例如计算资源、存储资源、时间资源等)。
附图说明
图1为人工智能主体框架的一种结构示意图;
图2为一种自然语言处理系统;
图3a为另一种自然语言处理系统;
图3b为一种系统的结构示意图;
图3c为一种自回归的示意图;
图3d为一种非自回归的示意图;
图3e为一种半非自回归的示意图;
图3f为一种翻译模型的示意图;
图4为本申请实施例提供的自然语言处理的相关设备的示意图;
图5为一种transformer层的架构示意;
图6a为本申请实施例提供的一种模型训练方法的实施例示意;
图6b为一种模型训练方法的实施例示意;
图7为本申请实施例中的一种神经网络模型的结构示意;
图8为一种transformer层的结构示意;
图9为一个注意力头head的操作示意图;
图10为本申请实施例提供的模型训练装置的一种结构示意图;
图11为本申请实施例提供的执行设备的一种结构示意图;
图12为本申请实施例提供的训练设备一种结构示意图;
图13为本申请实施例提供的芯片的一种结构示意图。
具体实施方式
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝 练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。
本申请可以应用于人工智能领域的自然语言处理领域,下面以自然语言处理为例将对多个落地到产品的多个应用场景进行介绍。
为了更好地理解本申请实施例的方案,下面先结合图2至图3a对本申请实施例可能的应用场景进行简单的介绍。
图2示出了一种自然语言处理系统,该自然语言处理系统包括用户设备以及数据处理设备。其中,用户设备包括手机、个人电脑或者信息处理中心等智能终端。用户设备为自然语言数据处理的发起端,作为语言问答或者查询等请求的发起方,通常用户通过用户设备发起请求。
上述数据处理设备可以是云服务器、网络服务器、应用服务器以及管理服务器等具有数据处理功能的设备或服务器。数据处理设备通过交互接口接收来自智能终端的查询语句/ 语音/文本等,再通过存储数据的存储器以及数据处理的处理器环节进行机器学习,深度学习,搜索,推理,决策等方式的语言数据处理,并将处理结果反馈至用户设备。数据处理设备中的存储器可以是一个统称,包括本地存储以及存储历史数据的数据库,数据库可以在数据处理设备上,也可以在其它网络服务器上。
在图2所示的自然语言处理系统中,用户设备可以接收用户的指令,例如用户设备可以接收用户输入的一段文本,然后向数据处理设备发起请求,使得数据处理设备针对用户设备得到的该一段文本执行自然语言处理应用(例如自然语言生成、文本分类、文本推理、命名实体识别、翻译等),从而得到针对该一段文本的对应的自然语言处理应用的处理结果(例如预测词结果、分类结果、推理结果、命名实体识别结果、翻译结果等)。
以自然语言生成为例,自然语言生成(natural language generation)也可以称之为文本预测任务或者自然语言合成任务,是指在给定一段文字的前提下,生成其中的缺失文本或者后续文本的任务。自然语言生成在搜索引擎,输入法等场景均有广泛应用,可以在用户输入部分文字的前提下预测用户接下来的输入,可以大大提高用户的使用该产品的效率,此外还可以对存在文字缺失的文本进行恢复。
示例性的,本申请实施例中,用户设备可以接收用户输入的一段文本数据,其中文本数据中包括已知词和待预测词,待预测词不可见,仅仅知晓待预测词在文本数据中的位置,然后用户设备可以向数据处理设备发起请求(请求中携带文本数据),使得数据处理设备对该文本数据中的待预测词进行预测,从而得到待预测词,并将待预测词反馈至用户设备。
示例性的,用户设备可以接收用户输入的一段文本数据,然后向数据处理设备发起请求,使得数据处理设备对该一段文本数据进行实体分类,从而得到针对该一段文本数据的实体分类结果,并将实体分类结果反馈至用户设备;
示例性的,用户设备可以接收用户输入的一段文本数据(文本数据为中文文本),然后向数据处理设备发起请求,使得数据处理设备将该一段文本数据翻译成英文,从而得到针对该一段文本数据的英文译文,并将英文译文反馈至用户设备。
图3a示出了另一种自然语言处理系统,在图3a中,用户设备直接作为数据处理设备,该用户设备能够直接接收来自用户的输入并直接由用户设备本身的硬件进行处理,具体过程与图2相似,可参考上面的描述,在此不再赘述。
图4是本申请实施例提供的自然语言处理的相关设备300的示意图。
上述图2和图3a中的用户设备具体可以是图4中的本地设备301或者本地设备302,图2中的数据处理设备具体可以是图4中的执行设备310,其中,数据存储系统350可以存储执行设备310的待处理数据,数据存储系统350可以集成在执行设备310上,也可以设置在云上或其它网络服务器上。
图2和图3a中的处理器可以通过神经网络模型或者其它模型进行数据训练/机器学习/深度学习,并利用数据最终训练或者学习得到的模型针对文本数据执行自然语言处理应用(例如自然语言生成、文本分类、序列标注、阅读理解、文本生成、文本推理、翻译等),从而得到相应的处理结果。
其中,对本申请实施例中的预训练语言模型进行微调后的高精度模型可以部署在数据 处理设备中,数据处理设备可以提供高精度模型处理文本数据,以得到上述自然语言处理应用的处理结果。
下面结合图3b对本申请实施例提供的系统架构进行详细的介绍。图3b为本申请一实施例提供的系统架构示意图。如图3b所示,系统架构500包括执行设备510、训练设备520、数据库530、客户设备540、数据存储系统550以及数据采集系统560。
执行设备510包括计算模块511、I/O接口512、预处理模块513和预处理模块514。计算模块511中可以包括目标模型/规则501,预处理模块513和预处理模块514是可选的。
数据采集设备560用于采集训练数据。
其中,在自然语言合成的任务中,训练数据可以为存在文本缺失的文本数据以及该存在文本缺失的文本数据对应的完整文本数据。
其中,在翻译任务中,训练数据可以包括但不限于平行语料、单语语料等。
平行语料,是指由原文文本及其平行对应的译语文本构成的双语或多语语料(也就是具有标注的文本数据),原文文本和译语文本具有相同的语义且文本单元之间具有对应关系。比如原文文本是“这次旅行需要认真计划”,与其平行对应的英文文本为“The trip needscareful planning”,则“这次旅行需要认真计划”和“The trip needs careful planning”可以看做一组平行语料,该组平行语料是中英平行语言对,可以将原文文本“这次旅行需要认真计划”看做该组平行语料的源语料,将译文文本“The trip needs careful planning”看做该组平行语料的目标语料。其中,旅行可以对应于trip。
此外,“这次旅行需要认真计划”可以看做一个单语语料,“The trip needs careful planning”也可以都看做一个单语语料。
在采集到训练数据之后,数据采集设备560将这些训练数据存入数据库530,训练设备520基于数据库530中维护的训练数据训练得到目标模型/规则501。
其中,训练设备520基于数据库530中维护的训练数据对本申请实施例中的预训练语言模型(pretrained language model,PLM)进行训练,得到目标模型/规则501。
其中,为了适配下游任务,训练设备520可以基于数据库530中维护的训练数据对训练好的预训练语言模型进行微调,得到目标模型/规则501。
应理解,上述训练预训练语言模型的训练设备520可以和对训练好的预训练语言模型进行微调的训练设备520可以为不同的设备。
需要说明的是,在实际应用中,数据库530中维护的训练数据不一定都来自于数据采集设备560的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备520也不一定完全基于数据库530维护的训练数据进行目标模型/规则501的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备520训练得到的目标模型/规则501可以应用于不同的系统或设备中,如应用于图3b所示的执行设备510,所述执行设备510可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备,车载终端等,还可以是服务器或者云端等。在图3b中,执行设备510配置输入/输出(input/output, I/O)接口512,用于与外部设备进行数据交互,用户可以通过客户设备540向I/O接口512输入数据。
预处理模块513和预处理模块514用于根据I/O接口512接收到的输入数据进行预处理(例如获取已知数据单元以及待预测数据单元在目标数据中的位置、或者生成注意力信息等预处理过程)。应理解,可以没有预处理模块513和预处理模块514或者只有的一个预处理模块。当不存在预处理模块513和预处理模块514时,可以直接采用计算模块511对输入数据进行处理。
在执行设备510对输入数据进行预处理,或者在执行设备510的计算模块511执行计算等相关的处理过程中,执行设备510可以调用数据存储系统550中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统550中。
最后,I/O接口512将处理结果呈现给客户设备540,从而提供给用户。
在图3b所示情况下,用户可以手动给定输入数据,该“手动给定输入数据”可以通过I/O接口512提供的界面进行操作。另一种情况下,客户设备540可以自动地向I/O接口512发送输入数据,如果要求客户设备540自动发送输入数据需要获得用户的授权,则用户可以在客户设备540中设置相应权限。用户可以在客户设备540查看执行设备510输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备540也可以作为数据采集端,采集如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果作为新的样本数据,并存入数据库530。当然,也可以不经过客户设备540进行采集,而是由I/O接口512直接将如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果,作为新的样本数据存入数据库530。
值得注意的是,图3b仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图3b中,数据存储系统550相对执行设备510是外部存储器,在其它情况下,也可以将数据存储系统550置于执行设备510中。
应理解上述执行设备510也可以部署于客户设备540中。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs(即输入数据)和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形 成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)transformer层
参照图5,图5为一种transformer层的架构示意,如图5所示,神经网络包括嵌入层和至少一个transformer层,至少一个transformer层可以为N个transformer层(N大于0的整数),其中,每个transformer层包括依次相邻的注意力层、加和与归一化(add&norm)层、前馈(feed forward)层和加和与归一化层。在嵌入层,对当前输入进行嵌入处理,得到多个嵌入向量;在所述注意力层,从所述第一transformer层的上一层获取P个输入向量,以P个输入向量中的任意的第一输入向量为中心,基于预设的注意力窗口范围内的各个输入向量与该第一输入向量之间的关联度,得到该第一输入向量对应的中间向量,如此确定出P个输入向量对应的P个中间向量;在所述池化层,将所述P个中间向量合并为Q个输出向量,其中transformer层中最后一个transformer层得到的多个输出向量用作所述当前输入的特征表示。
(3)注意力机制(attention mechanism)
注意力机制模仿了生物观察行为的内部过程,即一种将内部经验和外部感觉对齐从而增加部分区域的观察精细度的机制,能够利用有限的注意力资源从大量信息中快速筛选出高价值信息。注意力机制可以快速提取稀疏数据的重要特征,因而被广泛用于自然语言处理任务,特别是机器翻译。而自注意力机制(self-attentionmechanism)是注意力机制的改进,其减少了对外部信息的依赖,更擅长捕捉数据或特征的内部相关性。注意力机制的本质思想可以改写为如下公式:
其中,Lx=||Source||代表Source的长度,公式含义即将Source中的构成元素想象成是由一系列的数据对构成,此时给定目标Target中的某个元素Query,通过计算Query和各个Key的相似性或者相关性,得到每个Key对应Value的权重系数,然后对Value进行加权求和,即得到了最终的Attention数值。所以本质上Attention机制是对Source中元素的Value值进行加权求和,而Query和Key用来计算对应Value的权重系数。从概念上理解,把Attention可以理解为从大量信息中有选择地筛选出少量重要信息并聚焦到这些重要信息上,忽略大多不重要的信息。聚焦的过程体现在权重系数的计算上,权重越大越聚焦于其对应的Value值上,即权重代表了信息的重要性,而Value是其对应的信息。自注意力机制可以理解为内部Attention(intra attention),Attention机制发生在Target的元素Query和Source中的所有元素之间,自注意力机制指的是在Source内部元素之间或者Target内部元素之间发生的Attention机制,也可以理解为Target=Source这种特殊情况下的注意力计算机制,其具体计算过程是一样的,只是计算对象发生了变化而已。
(4)自然语言处理(natural language processing,NLP)
自然语言(naturallanguage)即人类语言,自然语言处理(NLP)就是对人类语言的处理。自然语言处理是以一种智能与高效的方式,对文本数据进行系统化分析、理解与信息提取的过程。通过使用NLP及其组件,我们可以管理非常大块的文本数据,或者执行大量的自动 化任务,并且解决各式各样的问题,如自动摘要(automaticsummarization),机器翻译(machinetranslation,MT),命名实体识别(namedentityrecognition,NER),关系提取(relationextraction,RE),信息抽取(informationextraction,IE),情感分析,语音识别(speechrecognition),问答系统(questionanswering)以及主题分割等等。
(5)预训练语言模型(pre-trained language model)
预训练语言模型是一个自然语言序列编码器,为自然语言序列中的每个词进行编码成为一个向量表示,从而进行预测任务。它的训练包含两个阶段。在预训练(pre-training)阶段,该模型在大规模无监督文本上进行语言模型任务的训练,从而学习到一个词表示。在微调(finetuning)阶段,该模型利用预训练阶段学到的参数做初始化,在文本分类(text classification),序列标注(sequence labeling)等下游任务(downstream task)上进行较少步骤的训练,就可以成功把预训练得到的语义信息成功迁移到下游任务上来。
(6)序列到序列的自然语言生成
序列到序列的自然语言生成是自然语言处理任务中非常重要的一个方向,常采用编码器-解码器的设计框架。给定一个训练实例(X,Y),其中X为源序列的句子,Y为目标序列。在训练过程中将X输入到编码器并产生一组向量表示z,然后通过交叉注意力模块将表示z输入到解码器中,并在解码器端解码生成目标序列Y。根据生成目标序列方式的不同,可以将序列生成任务分为自回归的生成和非自回归(并行)生成。自回归生成是指在生成目标序列的过程中,首先预测生成目标序列的第一个字符,然后根据已经生成的子序列一步一步的预测整个目标序列。非自回归生成则是指在解码时并行的生成完整的目标序列,不再需要一步一步的迭代过程,从而极大的缩减了生成目标序列所等待的时间。对于翻译、对话等对实时性要求较高的任务来说,非自回归的生成变得越发重要。下面分别对两种生成方式进行介绍。
(7)自回归生成策略
自回归生成是指在生成过程中逐字的预测目标序列,是当前最常用也是最好的序列生成策略,常见的机器翻译、摘要生成等任务一般都会采用这种自回归的生成方式。在预测目标序列时,自回归的生成策略使模型只能看到该时间步之前已经生成的内容,而看不到当前时间步生成的以及之后时间步的信息。例如在英-德翻译任务当中,给定源语言句子(英文):“who are you”,机器翻译模型要生成对应的目标序列:“Wer bist du”。自回归机器翻译模型在第一个时间步根据源序列预测输出第一个字符“Wer”,然后在第二个时间步将上一步生成的字符(“Wer”)输入到解码器中,并预测下一个时间步的输出“bist”。如图1所示,模型生成“bist”(黑框显示)时,只能看到上文信息“<s>Wer”,“<s>”为起始符号表示符,“[M]”表示相应位置的字符被掩码所以不看见。一步一步的重复上述过程,直到解码生成完整的目标序列(德语:“Wer bist du</s>”),“</s>”为句子结束标识符,表示模型不再生成新的内容。从这个例子可以看出,每个时间步的预测都会以上一个时间步的输出作为输入,通过迭代生成K次得到完整的目标序列,K为目标序列的长度,这里K=4。为了完成自回归的生成策略,解码器使用了一个自注意力掩码矩阵来产生字符“[M]”,用于保证模型在当前时间步的预测时只能看到上文信息,无法看到下文的信息。
图3c中的生成方式是最常见的一种自左向右的序列生成策略,在实际使用的过程中,也可能会使用自右向左的生成策略。例如在歌词翻译,诗歌翻译中常采用自右向左的生成策略,模型会优先生成序列中的最后一个字符,然后依次往前生成直到输出第一个字符,从而可以达到押韵且通顺的效果。自右向左的生成也可以理解为倒序生成,这时模型就只能看到下文的信息,而不能看到上文信息。
(8)完全非自回归生成策略
完全非自回归生成也称为并行生成(本申请实施例中可以简称为非自回归生成),是指目标序列的生成过程是并行的。这种生成策略一次解码就可以得到完整的目标序列,而不需要逐字的预测。在预测目标序列时,非自回归的生成不需要看到任何的上下文信息,只需要根据源句子序列即编码器端的输出,在解码器端经过一次解码来得到目标序列。这就意味着相比于自回归的生成,非自回归的生成可以极大的缩减解码的延时,所以逐渐成为序列生成任务的一个重要研究方向。如图3d所示,模型根据源序列句子“who are you”,可以直接解码得到目标序列“Wer bist du”(黑框,其中“[M]”表示相应位置的字符被掩码掉),我们也称之为完全非自回归生成。
尽管非自回归的序列生成要比自回归的策略更快,但由于其在训练过程中无法看到任何上下文信息,所以质量相比于自回归的生成策略有一定的差距。为了提升非自回归生成的目标序列的质量,在训练时常会随机的保留部分上下文信息,使得解码器可以根据部分已知信息来补齐完整的目标序列。在测试时,通过迭代固定的次数来达到接近自回归的效果。如图3e所示,当前时间步模型根据已知信息“bist”,预测第一个字符和最后一个字符(黑框)。预测第一个字符“Wer”时,模型需要看到下文信息“bist”,而预测最后一个字符时,模型需要看到上文信息“bist”,所以在当前时间步,非自回归解码器需要同时看到上文和下文信息,也称之为半非自回归生成。
(9)神经机器翻译(neural machine translation,NMT)模型
参阅图3f,为一种主流的NMT架构:Transformer框架。以Transformer框架为例,对NMT模型的工作流程进行说明。Transformer框架主要包括编码器、解码器。编码器和解码器包括多个层,编码器/解码器的每一层是由一些编码单元/解码单元构成。其中编码器各层把源语句对应的词向量(或者称之为词嵌入向量)经过一系列的神经网络的变换之后,表示成一个高维的向量(或者称之为隐状态(hidden state))。解码器各层负责把这个高维向量再重新解码(翻译)成目标语言。
其中,可以通过编码器的词向量参数获取源语句对应的词向量,编码器的词向量参数的集合可以看做一个参数矩阵。可以通过一个词表来包含源语言可能的词,编码器的词向量参数矩阵中包含该词表中的每个词的词向量,词向量参数矩阵的维度可以是[词向量维度,词表大小],其中的词表大小即词表中包含的词的数量。在一些场景下,输入至NMT模型源语句中的词可能不存在于词表中,对于这样的词可以通过固定的词向量对其进行表示。编码器的每层可以包括自注意力层(self-attention)和前馈网络层(feed forward)。其中,编码器的自注意力层是为了在编码每个词向量时,将源语句中各个词的词向量的权重(各个词对于当前编码的词向量的影响力)都考虑进去。编码器的前馈网络层是为了对自注意力层的输 出向量进行非线性变换处理。可以认为编码器的自注意力层通过自注意力层包括的参数将源语句中各个词的词向量的权重(各个词对于当前编码的词向量的影响力)都考虑进去,编码器的前馈网络层通过前馈网络层包括的参数对自注意力层的输出向量进行非线性变换处理。
解码器的每层包括自注意力层(self-attention)、编码-解码关注层(encoder-decoder attention)以及前馈网络层(feed forward)。解码器的自注意力层在解码的过程考虑已经生成的新词对当前解码的向量的影响。解码器的编码-解码关注层考虑编码器的输入对当前解码的向量的影响。解码器的前馈网络层是为了对编码-解码关注层的输出向量进行非线性变换处理。输出映射层接收解码器的最后一层网络层输出的解码向量,并将解码向量转换为翻译结果,比如生成一个新词。具体的,已经生成的新词通过解码器的词向量参数矩阵进行处理后,获取已经生成的新词的词向量,该已经生成的新词的词向量作为解码器的第一层网络层的输入,这个过程一直循环下去直到生成一个结尾符号,或者满足其他预设的停止条件,则解码阶段生成的所有目标词语为翻译结果。可以通过一个词表来包含目标语言可能的词,解码器的词向量参数矩阵中包含该词表中的每个词的词向量,词向量参数矩阵的维度可以是[词向量维度,词表大小],其中的词表大小即词表中包含的词的数量。可以通过最后一层网络层输出的解码向量和解码器的词向量参数矩阵中包含的各个词向量之间的距离中的最小距离,获取与最后一层网络层输出的解码向量最接近的词向量,根据该最接近的词向量和词表获取翻译结果。
应理解,上述架构还可以适用于其他自然语言处理任务,例如自然语言合成、语义理解、摘要生成等等。
在序列生成任务当中,“预训练-微调”是提升模型性能的标准范式。但针对序列到序列的生成任务,已有的预训练方案都只关注自左向右的自回归生成,也就是在预训练的过程中都只能看到数据序列的上文信息,所以在微调下游任务时,无法拓展到其他的解码策略。随着生成型预训练变换模型3(generative pre-trained transformer 3,GPT-3)、盘古等预训练大模型的推出,模型的参数变得越来越大,预训练的成本也越来越高。如果一次预训练仅能适应单一的下游任务,那针对每种生成策略都需要以较大的代价做一次预训练,这将会耗费掉过多的资源。
本申请实施例提出了一个序列到序列模型的预训练方法,使模型只用经过一次预训练就可以适应三类不同的序列生成任务(自回归任务、非自回归任务以及半非自回归任务),在保证质量的情况下极大地降低了预训练的成本。
参照图6a,图6a为本申请实施例提供的一种模型训练方法的实施例示意,本申请实施例提供的一种模型训练方法可以应用在上述描述的训练设备中,具体的,模型训练方法可以应用在手机、平板、笔记本电脑、智能穿戴设备等终端设备上,或者应用在云侧的服务器上,如图6a示出的那样,本申请实施例提供的一种模型训练方法,包括:
601、获取第一嵌入向量以及第二嵌入向量;所述第一嵌入向量对应于第一数据序列,所述第二嵌入向量对应于第二数据序列,所述第二数据序列包括第一子数据、被掩码的待预测数据单元以及第二子数据,所述第一子数据在所述第二数据序列中位于所述待预测数 据单元的上文,所述第二子数据在所述第二数据序列中位于所述待预测数据单元的下文;
在一种可能的实现中,可以获取到针对于PLM的训练样本,其中,训练样本可以包括第一数据序列和第二数据序列,第一数据序列可以基于源语料得到,第二数据序列可以基于目标语料得到,PLM需要基于源语料来预测并生成目标语料。
在一种可能的实现中,PLM可以用于实现不同语言类型之间的序列转换任务,例如可以为文本翻译任务、不同语言之间的摘要生成任务等,则第一数据序列和第二数据序列可以为包括不同语言类型的文本(不限定第一数据序列中的每个数据单元都和第二数据序列中数据单元是不同的语言类型,例如第一数据序列中的部分数据单元和第二数据序列中数据单元(部分或全部数据单元)是相同的语言类型)。其中,语言类型也可以称之为语种。
例如,在中英翻译任务中,原文文本是“这次旅行需要认真计划”,与其平行对应的英文文本为“The trip needscareful planning”,则“这次旅行需要认真计划”和“The trip needs careful planning”可以看做一组平行语料,该组平行语料是中英平行语言对,可以将原文文本“这次旅行需要认真计划”看做该组平行语料的源语料,将译文文本“The trip needs careful planning”看做该组平行语料的目标语料。
例如,在英德翻译任务中,原文文本是“We danse on the grass”,与其平行对应的德文文本为“Wir tanzen auf dem gras”,则“We danse on the grass”和“Wir tanzen auf dem gras”可以看做一组平行语料,该组平行语料是英德平行语言对,可以将原文文本“We danse on the grass”看做该组平行语料的源语料,将译文文本“Wir tanzen auf dem gras”看做该组平行语料的目标语料。
在一种可能的实现中,所述进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为经过样本标注的不同数据序列。
在一种可能的实现中,PLM可以用于实现文本的摘要生成任务,则源语料可以为需要提取摘要的源语料,目标语料可以为需要生成的摘要文本。
在一种可能的实现中,PLM可以用于实现文本答复任务,则源语料可以为需要答复的源语料,目标语料可以为针对于源语料的答复内容。
在一种可能的实现中,进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为相同的数据序列,也就是说进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为未被标注的数据。
在一种可能的实现中,第一数据序列可以通过对原始的源语料进行掩码得到,第二数据序列可以通过对原始的目标语料进行掩码得到。其中,在所述PLM可以用于实现不同语言类型的文本之间的序列转换任务(例如翻译任务)的情况下,原始的源语料和原始的目标语料可以为不同语言类型表达的文本。
可选的,原始的源语料和原始的目标语料可以从外部的数据库中得到。
在一种可能的实现中,在所述PLM可以用于实现不同语言类型的文本之间的序列转换任务(例如翻译任务)的情况下,可以对原始的源语料和原始的目标语料进行数据单元的对齐,原始的源语料(本实施例中也可以简称为X,例如本申请实施例中的第二初始数据序列、第三初始数据序列)和原始的目标语料(本实施例中也可以简称为Y,例如本申请 实施例中的第一初始数据序列、第四初始数据序列)可以分别包括至少一个数据单元(例如子单元或者词单元),通过数据单元的对齐,可以使原始的源语料和原始的目标语料中的数据单元存在一一对应的关系,存在对应关系的数据单元之间可以表达相同的语义。
示例性的,以图6b的英德翻译任务为例,输入的句对为双语(“We danse on the grass”,“Wir tanzen auf dem gras”)。其中,“We danse on the grass”可以为原始的源语料,“Wir tanzen auf dem gras”可以为原始的目标语料。首先利用外部知识(例如图6b示出的,利用知识库Q:对齐)抽取句子对(X,Y)中所有对齐知识的集合,每一个元素都构成一组对齐映射关系(例如,图6b中箭头所标示的:{“We-Wir”,“dance-tanzen”,“grass-gras”})。
在一种可能的实现中,可以基于外部知识(例如图6b示出的,利用知识库Q:对齐)进行上述数据单元的对齐。其中,根据实际的任务语言对(X,Y)不同时,数据对齐知识库也有所不同。在形式上该知识库既可以是一个词典,或者是第三方工具(fast-align等),也可以是预训练的多语言词向量等,这里并不限定。
在一种可能的实现中,可以对原始的源语料以及原始的目标语料进行掩码操作,以得到PLM的训练数据(例如本申请实施例中的第一数据序列以及第二数据序列)。
在一种可能的实现中,可以获取第二初始数据序列(原始的源语料),并通过概率采样的方式,确定所述第一初始数据序列中的至少一个数据单元是否被掩码,以得到所述第二数据序列,其中所述概率采样得到的概率值用于作为所述至少一个数据单元被掩码的概率。
接下来介绍本申请实施例中的概率采样方式的示意:
在一种可能的实现中,可以采用动态采样掩码概率,其中,动态是指数据序列中每个数据单元被掩码的概率是动态的。
在一种可能的实现中,可以在某个概率区间内针对于第二初始数据序列中的至少一个数据单元(例如,至少一个数据单元可以为第二初始数据序列中的全部数据单元)中的每个数据单元采样得到一个概率值,该概率采样得到的概率值用于作为所述至少一个数据单元被掩码的概率,例如,可以基于该概率值和其他概率区间采样得到的另一个概率值进行比较,来确定是否对数据单元进行掩码操作。
例如,可以将概率区间W设定为[0.1,0.2],针对于第二初始数据序列中的各个数据单元进行掩码操作时,分别从区间W中随机采样一个概率值υ,以概率υ对第二初始数据序列中的每个数据单元进行掩码操作,即随机的从[0,1]的区间中生成一个随机数r,如果r小于υ,则表示可以对当前数据单元进行掩码,否则不对其进行任何操作。
类似的,可以获取第一初始数据序列(原始的目标语料);并通过概率采样的方式,确定所述第一初始数据序列中的至少一个数据单元是否被掩码,以得到所述第二数据序列,其中所述概率采样得到的概率值用于作为所述至少一个数据单元被掩码的概率。
例如,可以将概率区间R设定为[0.35,0.55],针对于第一初始数据序列中的各个数据单元进行掩码操作时,分别从区间R中采样一个概率值μ,以概率μ对第一初始数据序列中的每个数据单元做掩码操作,即随机的从[0,1]的区间中生成一个随机数a,如果a小于μ,则表示可以对当前数据单元进行掩码,否则不对其进行任何操作。
其中,基于第一数据序列生成的嵌入向量可以作为PLM中编码器的输入,基于第二数 据序列生成的嵌入向量可以作为PLM中解码器的输入,上述分别对第一数据序列和第二数据序列的掩码操作可以称之为动态的对偶掩码操作,通过动态的对偶掩码操作,可以将编码器的输入和解码器的输入分别掩码,可以在后续的训练过程中同时完成对编码器和解码器的预训练。此外,动态的采样掩码概率可以防止掩码概率太大导致模型训练时整个批次的有效效信息都比较少。
在一种可能的实现中,在进行所述概率采样时,所述第一初始数据序列中数据单元被掩码的概率大于所述第二初始数据序列中数据单元被掩码的概率。通过设置动态掩码操作中,保证解码器的掩码比例高于编码器,使解码器预测时可以充分的从编码器端获取信息,进而提升训练后的预训练模型的模型精度。
以上述概率区间为例,针对编码器和解码器分别设定两个连续的区间W,R,并保证R的区间最小值要大于W区间的最大值(进而可以保证在进行所述概率采样时,所述第一初始数据序列中数据单元被掩码的概率大于所述第二初始数据序列中数据单元被掩码的概率)。例如将W设定为[0.1,0.2],R设定为[0.35,0.55]。
示例性的,通过动态的随机掩码,可以得到编码器端的输入和输出(DM(ACS(X),PRE2(X)),其中DM(dynamic masking)表示对第二初始数据序列ACS(X)的动态掩码结果(“We danse[M]the grass”),PRE2(X)是编码器的预测生成目标,这里为所有被掩码掉的字符集合(“____on____”)。同样在解码器端以概率μ对第一初始数据序列AM(Y)中的每个数据单元做掩码操作,得到解码器端的输入和输出(DM(AM(Y),PRE2(Y))(图6b表示的掩码结果为“Wir[M]auf[M][M]”),PRE2(Y)表示在PRE1(Y)的基础上增加当前操作被掩码掉的字符(“__tanzen__dem gras”)。通过当前步操作,可以得到新的输入和输出(DM(ACS(X),DM(AM(Y)),(PRE2(X),PRE2(Y)),如图6b所示。
在一种可能的实现中,所述PLM可以用于实现不同语言类型的文本之间的序列转换任务(例如翻译任务),针对于原始的源语料(例如本申请实施例中的第三初始数据序列),可以对其中的部分数据单元进行单元替换(替换为语义相同且通过另一种语言类型表达的数据单元),进而可以提升PLM对于多种语言之间的序列转换精度。
在一种可能的实现中,可以获取第三初始数据序列;所述第二数据序列和所述第三初始数据序列为语义相同的文本、且通过不同语言类型表达,所述第三初始数据序列中的第一数据单元和所述待预测数据单元的语义相同;将所述第三初始数据序列中的所述第一数据单元替换为第二数据单元,以得到所述第一数据序列;其中所述第二数据单元和所述第一数据单元的语义相同、且通过不同语言类型表达。
其中,这里的语义相同可以理解为可以表达相同或类似的语义,由于不同语言类型的语法、语言环境并不限定,因为本申请实施例中的语义相同并不限定为语义的完全一致。
其中,除了对第三初始数据序列中的所述第一数据单元替换为第二数据单元,还可以对第三初始数据序列进行其他处理(例如掩码操作、其他数据单元的操作),本申请实施例并不限定第一数据序列仅仅是通过对第三初始数据序列中的第一数据单元替换为第二数据单元得到的。
在一种可能的实现中,第一数据单元可以为从第三初始数据序列中随机选择的,例如, 针对于第三初始数据序列中的任意数据单元都有可能被选择作为上述的第一数据单元。可以在语言库中检索得到与该第一数据单元语义相同或相似且通过不同语言类型表达的数据单元作为第二数据单元,并用第二数据单元替换第三初始数据序列中的第一数据单元,以得到第一数据序列。
在一种可能的实现中,第二数据单元和第一初始数据序列也是通过不同的语言类型表达的,也就是说第一初始数据序列、第二初始数据序列和第二数据单元中任意两个之间的语言类型都是不同的。
示例性的,可以从第三初始数据序列的数据单元集合中以一定的概率选取一个数据单元,例如图6b中的“dance”,在源序列X中搜索与该元素相匹配的内容,并将匹配结果在知识库Q的多语言词典中进行索引(图9中索引“dance”)。首先从所有可选的语种集合(西班牙语、德语、法语)中随机选择一个语种,然后得到该语种对应的字符,要求该字符与匹配的结果的词义相近(如法语的“danse”)。最后使用索引得到的词(“danse”)替换匹配结果(“dance”),得到一组新的输入(ACS(X),PRE1(X)),其中ACS(Aligned Code-switching)表示对齐的词码转换操作,替换后新的句子将是一个包含有多个语种的词序列(“we danse on the grass”)。
在一种可能的实现中,在对第二初始数据序列中的第一数据单元替换为第二数据单元之后,可以让第一初始数据单元中和第一数据单元语义相同的数据单元进行掩码操作,由于掩码操作后的数据单元需要在PLM的训练过程中被预测,因此,可以让PLM具备更丰富的语言类型的文本理解能力。
在一种可能的实现中,可以获取第四初始数据序列;对所述第四初始数据序列中和所述第一数据单元语义相同的数据单元(例如可以为本申请实施例中的待预测数据单元)进行掩码,以得到所述第二数据序列。通过上述方式,引入外部知识将词码转换和掩码操作结合起来,更加充分的训练了模型的语义表示能力。
示例性的,以图6b中的英德翻译任务为例,输入的句对为双语(“We danse on the grass”,“Wir tanzen auf dem gras”)。首先利用外部知识(知识库Q:对齐)抽取句子对(X,Y)中所有对齐知识的集合,每一个元素都构成一组对齐映射关系(图6b中箭头标示:{“We-Wir”,“dance-tanzen”,“grass-gras”})。然后从集合中以一定的概率选取一个子集(如{“dance-tanzen”}),对子集中的每个元素进行如下操作:
编码器端:在源序列X中搜索与该元素相匹配的内容,并将匹配结果在知识库Q的多语言词典中进行索引(图6b中索引“dance”)。首先从所有可选的语种集合(西班牙语、德语、法语)中随机选择一个语种,然后得到该语种对应的字符,要求该字符与匹配的结果的词义相近(如法语的“danse”)。最后使用索引得到的词(“danse”)替换匹配结果(“dance”),得到一组新的输入(ACS(X),PRE1(X)),其中ACS(Aligned Code-switching)表示对齐的词码转换操作,替换后新的句子将是一个包含有多个语种的词序列(“we danse on the grass”)。PRE1(X)表示编码器端的待预测生成目标,因为此步操作编码器端没有被掩码的字符,所以PRE1(X)=NULL。
解码器端:在目标序列Y中搜索与该元素相匹配的内容,并将匹配到的字符做掩码操 作,然后在输出端对掩码的内容进行预测。最终得到一组新的输入(AM(Y),PRE1(Y)),其中AM(Y)(code-switching masking)表示对初始的目标Y进行对齐的掩码操作后得到的新的序列(“Wir[M]auf dem gras”),其中[M]表示相应的字符被掩码,PRE1(Y)表示解码器输出端的预测内容,表示被掩码掉的词的集合(“__tanzen______”)。
通过基于外部对齐知识的词码转换结合掩码操作,得到新的输入序列(ACS(X),AM(Y)),和输出序列(PRE1(X),PRE1(Y))。
在一种可能的实现中,在得到第一数据序列以及第二数据序列之后,可以分别通过编码器侧的嵌入层以及解码器侧的嵌入层,对第一数据序列以及第二数据序列进行嵌入处理,分别得到第一嵌入向量以及第二嵌入向量。
针对于第一数据序列的第一嵌入向量而言,第一嵌入向量可以包含第一数据序列中被掩码的数据单元的位置信息,以及第一数据序列中未被掩码的数据单元的位置信息和语义信息,其中,位置信息可以表示数据单元和其他数据单元之间的位置关系,例如可以通过位置向量表示。其中,语义信息可以通过数据单元的词向量表示。
在一种可能的实现中,第二数据序列可以包括被掩码的待预测数据、位于待预测数据上文的第一子数据、位于待预测数据下文的第二子数据。
在一种可能的实现中,所述第一子数据或者所述第二子数据可以包括未被掩码的数据单元,所述第二嵌入向量包含所述未被掩码的数据单元的语义信息、以及所述未被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系。
例如,所述第一子数据可以包括未被掩码的数据单元,所述第二嵌入向量可以包含第一子数据中所述未被掩码的数据单元的语义信息、以及所述未被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系。
例如,所述第二子数据可以包括未被掩码的数据单元,所述第二嵌入向量可以包含第二子数据中所述未被掩码的数据单元的语义信息、以及所述未被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系。
在一种可能的实现中,所述第一子数据或者所述第二子数据包括被掩码的数据单元,所述第二嵌入向量包含所述被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系。
例如,所述第一子数据可以包括被掩码的数据单元,所述第二嵌入向量可以包含所述第一子数据中所述被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系。
例如,所述第二子数据可以包括被掩码的数据单元,所述第二嵌入向量可以包含所述第二子数据中所述被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系。
在一种可能的实现中,所述第二嵌入向量包含所述待预测数据单元与所述第二数据序列中其他数据单元之间的位置关系。
示例性的,第二数据序列可以为“Wir[M2]auf[M4][M5]”,[M]表示被掩码,在待预测数据单元为[M2]的情况下,第一子数据可以为Wir,第二子数据可以为auf[M4][M5]。
接下来介绍一种根据数据序列生成嵌入向量的示意:
在一种可能的实现中,可以通过嵌入层对数据序列中未被掩码的数据单元进行嵌入处 理。其中嵌入层可以称为输入嵌入(inputembedding)层。当前输入可以为未被掩码的数据单元。嵌入层在获取当前输入后,可以对该当前输入中各个未被掩码的数据单元进行嵌入处理,可得到各个未被掩码的数据单元对应的嵌入向量。
在一些实施例中,还可以获取未被掩码的数据单元中的每个数据单元的位置向量,所述位置向量用于指示每个数据单元在数据序列中的位置,具体的,位置向量可以用于指示未被掩码的数据单元中的每个数据单元与其他被掩码的数据单元以及被掩码的数据单元之间的相对位置关系。
在一种可能的实现中,所述嵌入层可以包括输入嵌入层和位置编码(positional encoding)层。在输入嵌入层,可以对当前输入中的未被掩码的数据单元中的每个数据单元进行词嵌入处理,从而得到未被掩码的数据单元中的每个数据单元的词向量(例如可以表示语义信息)。在位置编码层,可以获取未被掩码的数据单元中的每个数据单元在该当前输入中的位置,进而对未被掩码的数据单元中的每个数据单元的位置生成位置向量。
在一些示例中,未被掩码的数据单元中的每个数据单元在数据序列中的位置信息可以为未被掩码的数据单元中的每个数据单元在数据序列中的绝对位置。以当前输入为“几号应还花呗”为例,其中的“几”的位置可以表示为第一位,“号”的位置可以表示为第二位,……。在一些示例中,未被掩码的数据单元中的每个数据单元在数据序列中的位置可以为未被掩码的数据单元中的每个数据单元在数据序列中的相对位置。仍以当前输入为“几号应还花呗”为例,其中的“几”的位置可以表示为“号”之前,“号”的位置可以表示为“几”之后、“应”之前,……。当得到当前输入中未被掩码的数据单元中的每个数据单元的词向量和位置向量时,可以将未被掩码的数据单元中的每个数据单元的位置向量和对应的词向量进行融合,得到未被掩码的数据单元中的每个数据单元的嵌入向量。应理解,融合的方式可以是对位置向量和对应的词向量进行加法运算,或者是通过其他运算,这里并不限定具体的融合方式。嵌入向量可以表示为具有预设维度的嵌入矩阵。可以设定该嵌入向量的个数为M,预设维度为H维,则嵌入向量可以表示为M×H的嵌入矩阵。
602、根据所述第一嵌入向量,通过预训练语言模型PLM中的编码器,得到隐状态。
在一种可能的实现中,在对PLM进行训练的过程中,可以将第一嵌入向量输入至PLM的编码器,以及将第二嵌入向量输入至PLM的解码器。
在一种可能的实现中,可以根据所述第一嵌入向量,通过预训练语言模型PLM中的编码器,得到隐状态(hidden state),例如可以将所述第一嵌入向量作为预训练语言模型PLM中的编码器的输入,进而PLM中的编码器可以输出隐状态。
接下来首先介绍本申请实施例PLM中编码器的结构示意:
参照图7,图7为本申请实施例中的一个PLM的结构示意,其中,该框架可以由双向编码器和双向解码器构成。
其中,编码器可以包括自注意力模块以及前馈网络,编码器的输出可以输入到输出层以及解码器侧的交叉注意力模块中。
图7所示的装置的输入可以是图6b处理后的序列。首先将第一嵌入向量输入到模型的编码器中,然后将编码器输出的向量表示通过交叉注意力模块输入给解码器。
在一种可能实现中,所述编码器可以包括转换transformer层,其中,transformer层可以包括串行的多个子transformer层。可以通过每个所述子transformer层,对与所述子transformer层相邻的上一个子transformer层输出的数据进行处理,以得到中间向量,并将所述中间向量输出至与所述子transformer层相邻的下一个子transformer层;其中,若所述子transformer层为所述多个子transformer层中最靠近输入侧的transformer层,则所述子transformer层的输入数据为嵌入向量;若所述子transformer层为所述多个子transformer层中最靠近输出侧的transformer层,则所述子transformer层输出的数据为隐状态。
transformer层的核心特点在于其采用的独特的注意力机制。在处理自然语言,例如一个句子时,transformer模型利用该注意力机制,为句子中各个词的嵌入向量赋予不同的注意力系数,从而更全面地考虑句子中上下文对各个词的影响。具体的transformer层可以包括依次相邻的多头注意力层、加和与归一化(add&norm)层、前馈(feed forward)层、加和与归一化层。其中,注意力层与嵌入层相连,从嵌入层获取嵌入向量作为输入向量,基于嵌入向量中各个嵌入向量之间的关联度,对各个嵌入向量进行综合,得到输出向量,输出给后续的transformer层。transformer层获取前一层的输出作为输入向量,执行与前一级transformer层类似的操作。
参照图8,图8为一种transformer层的结构示意,本申请实施例中的各个子transformer层都可以参照图8中示出的结构,如图8中示出的那样,transformer层包括依次相邻的多头注意力层、加和与归一化(add&norm)层、前馈(feed forward)层、加和与归一化层。
其中,多头注意力层从其上一层获取M个输入向量Xl,又可以表示为矩阵X,采用自注意力机制,基于向量间的关联度对各个向量进行变换,得到M个输出向量,又可以表示为矩阵Y。可以理解,当该多头注意力层是与嵌入层直接相连的层,其获取的输入向量即为嵌入层输出的嵌入向量;当该多头注意力层是后续的transformer层包括的多头注意力层,其获取的输入向量即为前一级transformer层的输出向量。在多头注意力层,基于多头注意力(multi-head attention,MHA)的MHA层包括多个注意力头head(如图8中示出的Head1、Head 2、…、Head N)。
图9为一个注意力头head的操作示意图,该示意图示出注意力头head如何将输入矩阵X变换为输出矩阵Y。如图9所示,分别采用第一变换矩阵Q,第二变换矩阵K和第三变换矩阵V对M个输入向量<X1,X2,…,XN>中各个输入向量Xi进行变换,得到各个输入向量对应的第一中间向量(q向量),第二中间向量(k向量)和第三中间向量(v向量)。操作上,可以分别用第一变换矩阵Q,第二变换矩阵K和第三变换矩阵V,对N个输入向量构成的输入矩阵X进行线性变换,分别得到输入矩阵的Q矩阵,K矩阵和V矩阵,再分别对矩阵进行拆分,即可得到各个输入向量对应的q向量,k向量和v向量。对于M个输入向量中任意的第i输入向量Xi,基于该第i输入向量对应的第一中间向量(q向量,qi)与各个输入向量Xj对应的各个第二中间向量(k向量,kj)的点乘操作,确定该第i输入向量Xi与各个输入向量Xj的各个关联度。尽管也可以直接将qi与kj的点乘结果确定为关联度,但是更经典地,先将点乘结果除以一常数,然后进行softmax运算,将运算结果作为输入向量Xi与Xj的关联度,即:
于是,可以以该第i输入向量Xi与各个输入向量Xj的各个关联度αi,j作为权重因子,对各个输入向量Xj对应的第三中间向量(v向量,vj)进行加权组合,得到该第i输入向量Xi对应的第i组合向量Ci:
于是,可以得到M个输入向量对应的M个组合向量的向量序列<C1,C2,…,CN>,或矩阵C。基于该组合向量序列,可以得到M个输出向量。具体地,在一个实施例中,可以直接将N个组合向量的向量序列作为M个输出向量,即Yi=Ci。此时,输出矩阵Y即为组合向量矩阵C,又可以写成:
以上为一个注意力头head的处理过程描述,在MHA架构中,MHA层维护m套变换矩阵,每套变换矩阵包括前述第一变换矩阵Q、第二变换矩阵K和第三变换矩阵V,从而可以并行地进行上述操作,得到m个组合向量序列(即m个矩阵C),每个向量序列包括基于一套变换矩阵得到的N个组合向量。在这样的情况下,MHA层将得到的m个组合向量序列进行拼接,得到拼接矩阵;再通过第四变换矩阵W对该拼接矩阵进行变换,得到最终的输出矩阵Y。将该输出矩阵Y拆分即对应于M个输出向量<Y1,Y2,…,YN>。通过以上的操作过程,MHA层基于N个输入向量之间的关联度进行变换操作,得到M个输出向量。
如图8中示出的那样,transformer层可以包括前馈层,其中前馈层包括输入层、中间层intermediate layer以及输出层。如前所述,神经网络模型可以包含多个transformer层。在一个实施例中,上述多个transformer层可以采用残差网络的方式堆叠连接。
本申请实施例中,所述编码器包括注意力头,且由于在第一数据序列中,各个未被掩码的数据单元之间是相互可见的,因此在处理嵌入向量时,所述嵌入向量中任意两个嵌入向量之间存在注意力关联,具体的,可以获取注意力信息,所述注意力信息用于指示所述注意力头在处理所述嵌入向量时,所述嵌入向量中任意两个嵌入向量之间存在注意力关联,进而可以根据所述注意力信息,通过所述编码器,对所述嵌入向量进行处理,进而使得每个输出向量与各个输入的嵌入向量存在依赖关系(也就是所谓的上下文信息都可见的双向编码器)。
在一种可能的实现中,可以在编码器的输出侧增加一个和解码器类似的输出层,例如可以由一个全连接层和一个softmax归一化层构成,用于预测第一数据序列中被掩码的数据单元。
在一种可能的实现中,可以通过所述PLM中所述编码器的输出层,对所述第一数据序列中被掩码后的数据单元进行预测,以得到第二预测数据单元;并根据所述第二预测数据单元和所述第一数据序列中被掩码前的数据单元之间的差异,更新所述编码器。
在编码器输出层的全连接网络可以将编码器的输出映射到固定的维度(词表大小的维度),然后使用softmax归一化函数得到每个位置目标词出现的概率。这里的目标词可以为第一数据序列中被掩码的数据单元(例如第二预测数据单元)。在训练时,通过计算目标词所对应位置的对数似然(概率取对数)来计算模型在当前数据上的预测准确的程度。
通过上述方式,在对PLM进行训练时,可以同时对编码器和解码器进行预训练,有效的对两个模块进行了联合训练。
603、根据所述第一子数据、所述第二子数据以及所述隐状态,通过所述PLM中的解码器以及所述解码器的输出层,对所述待预测数据单元进行预测,以得到第一预测数据单元;
在一种可能的实现中,可以根据所述第一子数据、所述第二子数据以及所述隐状态,通过所述PLM中的解码器以及所述解码器的输出层,对所述待预测数据单元进行预测,以得到第一预测数据单元。
在一种可能的实现中,可以将所述第一子数据、所述第二子数据以及所述隐状态作为PLM中的解码器的输入,也就是说,对所述待预测数据单元进行预测时,待预测数据单元的上下文信息都是可见的,也就是说,对所述待预测数据单元进行预测得到的第一预测数据单元和所述第一子数据、所述第二子数据以及所述隐状态都存在依赖关系。
在现有的实现中,在对PLM进行训练时,PLM的解码器被配置为仅仅对上文的信息可见,也就是从左到右的自回归,随着生成型预训练变换模型3(generative pre-trained transformer 3,GPT-3)、盘古等预训练大模型的推出,模型的参数变得越来越大,预训练的成本也越来越高。如果一次预训练仅能适应单一的下游任务,那针对每种生成策略都需要以较大的代价做一次预训练,这将会耗费掉过多的资源。
和现有的基于transformer的PLM不同的是,本申请实施例中,在解码器层从自注意力掩码模块改变为自注意力模块,从而使其能够向编码器一样看到上下文信息,所以称为双向解码器。
其中,可以将第二嵌入向量输入至PLM解码器的自注意力模块中,并将编码器输出的隐状态输入至交叉注意力模块中,使解码器能够学习到更丰富的语义信息。从图7中可以看到,在序列输入到模型之前,会经过一个嵌入层,这一层将会把第二数据序列转化为一个固定维度的连续向量,然后再输入到模型中计算(具体可以参照编码器侧嵌入层的描述,相似之处这里不再赘述)。在解码器输出层的全连接网络可以将解码器的输出映射到固定的维度(词表大小的维度),然后使用softmax归一化函数得到每个位置目标词出现的概率。这里的目标词可以为第一数据序列中被掩码的数据单元(例如第一预测数据单元)。在训练时,通过计算目标词所对应位置的对数似然(概率取对数)来计算模型在当前数据上的预测准确的程度。
在一种可能的实现中,解码器的每层可以包括自注意力层(self-attention)、编码-解码关注层(encoder-decoder attention)以及前馈网络层(feed forward),其中,上述编码-解码关注层也可以描述为交叉注意力层。解码器的自注意力层在解码的过程考虑上下文中数据单元对当前解码的向量的影响。解码器的编码-解码关注层考虑编码器的输入对当前解码的向量的影响。解码器的前馈网络层是为了对编码-解码关注层的输出向量进行非线性变换处理。输出映射层(或者简称为输出层)可以接收解码器的最后一层网络层输出的解码向量,并将解码向量转换为预测结果(例如第一预测数据单元),比如预测生成一个新词。
604、根据所述第一预测数据单元和所述待预测数据单元之间的差异,更新所述编码器 和所述解码器。
在得到前馈过程中解码器输出的第一预测数据单元之后,可以基于真值(待预测数据单元)和第一预测数据单元之间的差异确定损失,并基于损失更新编码器和解码器。
本申请实施例提供了一种模型训练方法,所述方法包括:获取第一嵌入向量以及第二嵌入向量;所述第一嵌入向量对应于第一数据序列,所述第二嵌入向量对应于第二数据序列,所述第二数据序列包括第一子数据、被掩码的待预测数据单元以及第二子数据,所述第一子数据在所述第二数据序列中位于所述待预测数据单元的上文,所述第二子数据在所述第二数据序列中位于所述待预测数据单元的下文;根据所述第一嵌入向量,通过预训练语言模型PLM中的编码器,得到隐状态;根据所述第一子数据、所述第二子数据以及所述隐状态,通过所述PLM中的解码器以及所述解码器的输出层,对所述待预测数据单元进行预测,以得到第一预测数据单元;根据所述第一预测数据单元和所述待预测数据单元之间的差异,更新所述编码器和所述解码器。通过上述方式,采用编码器和双向解码器的预训练架构,训练的过程中解码器能够同时看到上文和下文的信息。由于其他类型的序列生成任务(自回归:自左向右、自右向左;非自回归:完全非自回归、半非自回归等)都可以相当于本申请实施例中PLM的一个子集,相当于通过本申请实施例中的训练方法得到的PLM可以具备良好的适应其他类型的序列生成任务(自回归:自左向右、自右向左;非自回归:完全非自回归、半非自回归等)的能力,也就是即使后续微调时采用的是其他类型的序列生成任务,通过本申请实施例中的训练方法得到的PLM都可以达到较好的模型精度。而不需要针对于每个类型的序列生成任务都预训练一个对应的PLM,大大降低了训练PLM所需的资源(例如计算资源、存储资源、时间资源等)。
接下来结合一个具体示例介绍本申请实施例中的模型训练方法:
本申请实施例的方案可以包含两个部分,离线的预训练模型参数,离线的在特定的任务,特定的数据集上微调。
(1)序列到序列的预训练过程:
步骤1:获取输入
通过网络环境抽取大量的无标注或者有标注的多语言数据集,并选择外部知识库Q(包括对齐知识和多语言词典)。然后构建源序列和目标序列的句对(X,Y)。
步骤2:数据增强
数据增强包括两个可选择的部分:
1)词码转换结合掩码操作:根据外部知识Q获取输入句对(X,Y)中对齐词对的集合,然后从集合中以一定的概率随机选取子集。对于子集中的每个元素,使其与源序列匹配并做基于对齐的词码转换操作(ACS),与目标序列匹配并做基于对齐的掩码操作(AM)。最终得到新的输入序列(ACS(X),AM(Y)),输出序列(PRE1(X),PRE1(Y))。
2)动态的对偶掩码操作:从给定的两个区间分别动态的采样源序列和目标序列的掩码概率υ,μ。然后以概率υ对编码器输入序列CSR(X)做掩码操作,得到新的编码器的输入序列DM(ACS(X))和输出序列PRE2(X)。以概率μ对解码器做动态的掩码操作从而得到新的输入序列DM(AM(Y)),和输出序列PRE2(Y)。
步骤3:预训练
随机初始化一个预训练模型P。
将DM(ACS(X))输入到模型P的编码器,将DM(AM(Y))输入到模型P的解码器;对模型P进行训练,得到P的编码器预测输出P(X),解码器预测输出P(Y)。
比较(P(X),PRE2(X))和(P(Y),PRE2(Y)),并通过交叉熵损失函数(cross-entropy)实现整个模型包括解码器和编码器等各个模块的训练。
重复步骤2、3,在数据集行迭代更新模型的参数,在模型收敛后保存模型的参数。
步骤4:微调
根据特定的任务需求,使用训练好的模型的参数初始化特定的序列到序列的模型,然后在特定的数据集上训练来达到更好地效果。
利用本申请实施例提出的方案,可以根据用户的需求,针对不同的任务类别,在不同的数据集,或者不同的语言上完成序列到序列模型的预训练。在微调下游任务时可以根据实际的要求,微调自左向右的生成模型,自右向左的生成模型,或者是并行解码的生成模型。接下来以翻译任务为例详细描述本申请实施例,在实施例一中介绍如何使用带标注的数据完成预训练,未标注的数据的使用细节将会在实施例二中介绍。
实施例一:
步骤1:数据获取
从网络上获取目标语言对的翻译数据集,例如英-中,英-法,英-韩,英-西班牙等等。最终从公开数据中抽取了32个以英文为中心的双语数据,并使用官方(WMT)的提供的开发集和测试集评估预训练及其在下游任务中的效果。获取到原始数据后,我们利用Moses里的脚本tokenizer.perl和BPE(pairwise subword unit)技术,对训练集,开发集和测试集数据进行预处理,得到所有的双语句对(X,Y)。训练集中各语种到英语对应的双语数据的规模如表1所示,其中ISO表示各类语言的缩写,例如“Gu”表示英语-古吉拉特语双语数据的数量。
表1:预训练集中标注数据数量统计表
第二个数据收集工作就是获取外部知识Q。Q中包含外部对齐知识以及用于词码转换的多语言词典。外部对齐知识可以是词典,预训练的词向量等,本实施例收集并使用第三方工具Fast-align作为对齐知识库,使用MUSE工具获取多语言词典。
步骤2:数据增强
词码转换与掩码的结合:利用对齐知识库Q,获取双语句对(X,Y)中的对齐词对信息,然后基于对齐信息利用多语言词典在源句子序列进行词码转换操作,在目标序列的对应位置做掩码操作,得到编码器和解码器端的输入和输出。
动态的对偶掩码:在编码器端和解码器端分别在给定区间内采样掩码概率υ,μ。对源序列和目标序列做对偶掩码操作,并保证目标序列的掩码比例要大于源序列。最终模型预测输出所有被掩码掉的字符,分别得到编码器和解码器端新的输入和输出。
步骤3:预训练
初始化一个序列到序列的预训练模型,使用如图所示的结构。
分别将增强后的源序列和目标序列输入到编码器和解码器中,然后在输出层预测所有被掩码掉的字符,利用交叉熵(cross-entropy)对模型进行训练。
重复步骤2、3,直到模型收敛到一个稳定状态,然后保存预训练模型的参数。
步骤4:微调
使用保存的模型参数,可用于微调自回归和非自回归两类不同的生成任务。
1)微调自回归任务
本实例使用常用的自左向右的生成策略验证预训练模型在自回归任务上的效果。使用预训练的参数初始化自回归任务时,该任务使用标准的Transformer结构。微调时需要将编码器的预测输出层移除,然后使用特定语言对的带标注数据集进行训练,并根据开发集上的性能表现选取最好的模型进行测试。
表2:预训练方案在自回归翻译任务上的性能对比
训练完成后,在13个翻译任务(包含低(<1M)、中(>1M,且<10M)、高(>10M,且<25M)、极高资源(>25M)场景)的测试机集上验证对比模型的性能,使用BLEU(↑)作为序列生成(翻译)质量的评价指标。如表2所示,是模型在自回归任务上的表现效果。
2)微调非自回归任务
使用预训练的模型参数初始化非自回归任务,直接去除编码器的预测输出层即可。然后使用特定语言对的带标注数据集进行训练,同样根据开发集上的性能表现选取最好的模型进行测试。训练完成后,我们在6个常用的数据集上验证模型的性能表现,也使用BLEU(↑)作为评价指标。表3是模型在非自回归任务上的表现效果。
表3:预训练方案在非自回归翻译任务上的性能对比
实施例二:
实施例二描述了该方案如何使用无标注数据进行预训练。在实际使用过程中,根据实际需求既可以使用标注数据进行训练,也可以使用无标注数据或者两者同时使用。
首先,从网络上收集大量的无标注数据集,仍然以翻译任务为例(无标注是指单语数据),表4统计了可使用的无标注数据的规模。
表4:预训练集中无标注数据数量统计表
对于无标注数据,将句子进行复制操作,从而得到一个源序列和目标序列相同的语言对(X,Y),其中X=Y。方案二和方案一的执行过程基本一致,下面仅对不同的细节进行描述。
1)在词码转换结合掩码操作步骤,因为源序列和目标序列完全一致,即所有的词都是对齐的,所以不需要借助外部知识就可以获取对齐词对集合。
2)在动态的对偶掩码时,为了防止在训练过程中,解码器的预测输出层直接从源序列中复制信息,两端会掩码掉相同的字符,并在输出层分别进行预测。
为了验证方案二的效果,分别使用无标注数据(单语)和带标注数据(双语)进行预训练,然后在自左向右生成任务上微调。表5是预训练模型在四个翻译任务上性能表现。
表5:无标注数据的预训练模型在自回归翻译任务上的性能对比
实施例一验证了本申请实施例在使用带标注的数据集进行预训练。由于本申请实施例提出的统一的的预训练方案,使其只经过一次预训练就可以微调自回归和非自回归任务。在自回归任务中,模型在13个翻译方向上的性能表现说明了该方案能够取得比现有的预训练方法更好地效果。如表2所示,在从低资源到极高资源场景的各个翻译方向上我们的方案相比于直接训练(不使用预训练参数初始化)均提升了2.3~14.4BLEU,平均提升也达到了7.9BLEU。和现有最好的预训练方案相比,平均高出mBART3.8BLEU,高出mRASP1.2BLEU。在表3所示的非自回归任务当中,本发明是第一个适用于非自回归的预训练方案,可以看到模型在6个翻译方向上也取得了更好地表现,相比于直接训练平均提升2.5BLEU。
本实施例二补充验证了本申请实施例在使用无标注数据集进行预训练,然后微调自回归任务时的效果。如表5所示,在只使用未标注数据集(单语)时,模型平均提升2.9BLEU,要比只使用标注数据的效果平均下降0.9BLEU;而当标注数据和未标注数据同时使用时,模型在4个方向上平均提升了4.3BLEU。
与现有技术相比,本申请实施例中所包含的统一的预训练框架,可以同时使用标注数据和未标注数据进行预训练,保存的模型参数既可以用于初始化自回归任务(包含自左向右的自左向右的生成,自右向左的生成),也可用于初始化非自回归任务。本发明实施例中所包含的基于外部知识库的词码转换结合掩码操作、动态的对偶掩码以及编码器输出预测的操作都对翻译性能有正面的影响,表6是对三个操作在自回归翻译任务中性能影响的验证。
表6:本发明实例部分功能对预训练性能验证
在图1至图9所对应的实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关设备。具体参阅图10,图10为本申请实施例提供的模型训练装置1000的一种结构示意图,模型训练装置1000可以是终端设备或服务器,模型训练装置1000可以包括:
获取模块1001,用于获取第一嵌入向量以及第二嵌入向量;所述第一嵌入向量对应于第一数据序列,所述第二嵌入向量对应于第二数据序列,所述第二数据序列包括第一子数据、被掩码的待预测数据单元以及第二子数据,所述第一子数据在所述第二数据序列中位于所述待预测数据单元的上文,所述第二子数据在所述第二数据序列中位于所述待预测数据单元的下文;
其中,关于获取模块1001的具体描述可以参照上述实施例中步骤601的描述,这里不再赘述。
编码模块1002,用于根据所述第一嵌入向量,通过预训练语言模型PLM中的编码器,得到隐状态;
其中,关于编码模块1002的具体描述可以参照上述实施例中步骤602的描述,这里不再赘述。
解码模块1003,用于根据所述第一子数据、所述第二子数据以及所述隐状态,通过所述PLM中的解码器以及所述解码器的输出层,对所述待预测数据单元进行预测,以得到第一预测数据单元;
其中,关于解码模块1003的具体描述可以参照上述实施例中步骤603的描述,这里不再赘述。
训练模块1004,用于根据所述第一预测数据单元和所述待预测数据单元之间的差异,更新所述编码器和所述解码器。
其中,关于训练模块1004的具体描述可以参照上述实施例中步骤604的描述,这里不再赘述。
在一种可能的实现中,所述获取模块,还用于:
获取第一初始数据序列;
通过概率采样的方式,确定所述第一初始数据序列中的至少一个数据单元是否被掩码,以得到所述第二数据序列,其中所述概率采样得到的概率值用于作为所述至少一个数据单元被掩码的概率。
在一种可能的实现中,所述获取模块,还用于:
获取第二初始数据序列;
通过概率采样的方式,确定所述第二初始数据序列中的至少一个数据单元是否被掩码,以得到所述第一数据序列;其中,在进行所述概率采样时,所述第一初始数据序列中数据单元被掩码的概率大于所述第二初始数据序列中数据单元被掩码的概率。
在一种可能的实现中,所述编码模块,还用于:
通过所述PLM中所述编码器的输出层,对所述第一数据序列中被掩码后的数据单元进行预测,以得到第二预测数据单元;
所述训练模块,还用于:根据所述第二预测数据单元和所述第一数据序列中被掩码前的数据单元之间的差异,更新所述编码器。
在一种可能的实现中,所述PLM用于实现不同语言类型的文本之间的序列转换任务,所述获取模块,还用于:
获取第三初始数据序列;所述第二数据序列和所述第三初始数据序列为语义相同的文本、且通过不同语言类型表达,所述第三初始数据序列中的第一数据单元和所述待预测数据单元的语义相同;
将所述第三初始数据序列中的所述第一数据单元替换为第二数据单元,以得到所述第一数据序列;其中所述第二数据单元和所述第一数据单元的语义相同、且通过不同语言类型表达。
在一种可能的实现中,所述获取模块,还用于:
获取第四初始数据序列;
对所述第四初始数据序列中和所述第一数据单元语义相同的数据单元进行掩码,以得到所述第二数据序列。
在一种可能的实现中,
所述第一子数据或者所述第二子数据包括未被掩码的数据单元,所述第二嵌入向量包含所述未被掩码的数据单元的语义信息、以及所述未被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系;或者,
所述第一子数据或者所述第二子数据包括被掩码的数据单元,所述第二嵌入向量包含所述被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系;或者,
所述第二嵌入向量包含所述待预测数据单元与所述第二数据序列中其他数据单元之间的位置关系。
在一种可能的实现中,进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为相同的数据序列;或者,
所述进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为经过样本标注的不同数据序列。
在一种可能的实现中,所述第一数据序列和所述第二数据序列为文本数据。
接下来介绍本申请实施例提供的一种执行设备,请参阅图11,图11为本申请实施例提供的执行设备的一种结构示意图,执行设备1100具体可以表现为虚拟现实VR设备、手机、平板、笔记本电脑、智能穿戴设备、监控数据处理设备或服务器等,此处不做限定。具体的,执行设备1100包括:接收器1101、发射器1102、处理器1103和存储器1104(其中执行设备1100中的处理器1103的数量可以一个或多个,图11中以一个处理器为例),其中,处理器1103可以包括应用处理器11031和通信处理器11032。在本申请的一些实施例中,接收器1101、发射器1102、处理器1103和存储器1104可通过总线或其它方式连接。
存储器1104可以包括只读存储器和随机存取存储器,并向处理器1103提供指令和数据。存储器1104的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1104存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器1103控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1103中,或者由处理器1103实现。处理器1103可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1103中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1103可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、 现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1103可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1104,处理器1103读取存储器1104中的信息,结合其硬件完成上述方法的步骤。
接收器1101可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器1102可用于通过第一接口输出数字或字符信息;发射器1102还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1102还可以包括显示屏等显示设备。
本申请实施例还提供了一种训练设备,请参阅图12,图12是本申请实施例提供的训练设备一种结构示意图,具体的,训练设备1200由一个或多个服务器实现,训练设备1200可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1212(例如,一个或一个以上处理器)和存储器1232,一个或一个以上存储应用程序1242或数据1244的存储介质1230(例如一个或一个以上海量存储设备)。其中,存储器1232和存储介质1230可以是短暂存储或持久存储。存储在存储介质1230的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备中的一系列指令操作。更进一步地,中央处理器1212可以设置为与存储介质1230通信,在训练设备1200上执行存储介质1230中的一系列指令操作。
训练设备1200还可以包括一个或一个以上电源1226,一个或一个以上有线或无线网络接口1250,一个或一个以上输入输出接口1258;或,一个或一个以上操作系统1241,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
本申请实施例中,中央处理器1212,用于执行图6a对应实施例中描述的模型训练方法。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的模型训练方法,或者,以使训练设备内的芯片执行上述实施例描述的模型训练方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、 缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图13,图13为本申请实施例提供的芯片的一种结构示意图,上述图6a对应的实施例中描述的模型训练方法可以在图13所示的芯片中实现。具体的,所述芯片可以表现为神经网络处理器NPU 1300,NPU 1300作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1303,控制器1304控制运算电路1303提取存储器(权重存储器或输入存储器)中的数据并进行运算。
可选的,上述图6a对应的实施例中描述的模型训练方法可以由图13所示的芯片中的主CPU和NPU共同配合完成。
在一些实现中,运算电路1303内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1303是二维脉动阵列。运算电路1303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1302中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1301中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1308中。
统一存储器1306用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1305,DMAC被搬运到权重存储器1302中。输入数据也通过DMAC被搬运到统一存储器1306中。
BIU为Bus Interface Unit即,总线接口单元1310,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1309的交互。
总线接口单元1310(Bus Interface Unit,简称BIU),用于取指存储器1309从外部存储器获取指令,还用于存储单元访问控制器1305从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1306或将权重数据搬运到权重存储器1302中或将输入数据数据搬运到输入存储器1301中。
向量计算单元1307包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1307能将经处理的输出的向量存储到统一存储器1306。例如,向量计算单元1307可以将线性函数;或,非线性函数应用到运算电路1303的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1307生成归一化的值、像素级求和的值,或二者均有。在一些 实现中,处理过的输出的向量能够用作到运算电路1303的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1304连接的取指存储器(instruction fetch buffer)1309,用于存储控制器1304使用的指令;
统一存储器1306,输入存储器1301,权重存储器1302以及取指存储器1309均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (25)

  1. 一种模型训练方法,其特征在于,所述方法包括:
    获取第一嵌入向量以及第二嵌入向量;所述第一嵌入向量对应于第一数据序列,所述第二嵌入向量对应于第二数据序列,所述第二数据序列包括第一子数据、被掩码的待预测数据单元以及第二子数据,所述第一子数据在所述第二数据序列中位于所述待预测数据单元的上文,所述第二子数据在所述第二数据序列中位于所述待预测数据单元的下文;
    根据所述第一嵌入向量,通过预训练语言模型PLM中的编码器,得到隐状态;
    根据所述第一子数据、所述第二子数据以及所述隐状态,通过所述PLM中的解码器以及所述解码器的输出层,对所述待预测数据单元进行预测,以得到第一预测数据单元;
    根据所述第一预测数据单元和所述待预测数据单元之间的差异,更新所述编码器和所述解码器。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取第一初始数据序列;
    通过概率采样的方式,确定所述第一初始数据序列中的至少一个数据单元是否被掩码,以得到所述第二数据序列,其中所述概率采样得到的概率值用于作为所述至少一个数据单元被掩码的概率。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    获取第二初始数据序列;
    通过概率采样的方式,确定所述第二初始数据序列中的至少一个数据单元是否被掩码,以得到所述第一数据序列;其中,在进行所述概率采样时,所述第一初始数据序列中数据单元被掩码的概率大于所述第二初始数据序列中数据单元被掩码的概率。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    通过所述PLM中所述编码器的输出层,对所述第一数据序列中被掩码后的数据单元进行预测,以得到第二预测数据单元;
    根据所述第二预测数据单元和所述第一数据序列中被掩码前的数据单元之间的差异,更新所述编码器。
  5. 根据权利要求1或2所述的方法,其特征在于,所述PLM用于实现不同语言类型的文本之间的序列转换任务,所述方法还包括:
    获取第三初始数据序列;所述第二数据序列和所述第三初始数据序列为语义相同的文本、且通过不同语言类型表达,所述第三初始数据序列中的第一数据单元和所述待预测数据单元的语义相同;
    将所述第三初始数据序列中的所述第一数据单元替换为第二数据单元,以得到所述第一数据序列;其中所述第二数据单元和所述第一数据单元的语义相同、且通过不同语言类 型表达。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    获取第四初始数据序列;
    对所述第四初始数据序列中和所述第一数据单元语义相同的数据单元进行掩码,以得到所述第二数据序列。
  7. 根据权利要求1至6任一所述的方法,其特征在于,
    所述第一子数据或者所述第二子数据包括未被掩码的数据单元,所述第二嵌入向量包含所述未被掩码的数据单元的语义信息、以及所述未被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系;或者,
    所述第一子数据或者所述第二子数据包括被掩码的数据单元,所述第二嵌入向量包含所述被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系;或者,
    所述第二嵌入向量包含所述待预测数据单元与所述第二数据序列中其他数据单元之间的位置关系。
  8. 根据权利要求1至7任一所述的方法,其特征在于,进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为相同的数据序列;或者,
    所述进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为经过样本标注的不同数据序列。
  9. 根据权利要求1至8任一所述的方法,其特征在于,所述第一数据序列和所述第二数据序列为文本数据。
  10. 根据权利要求1至9任一所述的方法,其特征在于,所述第一数据序列和所述第二数据序列为语义相同且通过不同语言类型表达的数据。
  11. 一种模型训练装置,其特征在于,所述装置包括:
    获取模块,用于获取第一嵌入向量以及第二嵌入向量;所述第一嵌入向量对应于第一数据序列,所述第二嵌入向量对应于第二数据序列,所述第二数据序列包括第一子数据、被掩码的待预测数据单元以及第二子数据,所述第一子数据在所述第二数据序列中位于所述待预测数据单元的上文,所述第二子数据在所述第二数据序列中位于所述待预测数据单元的下文;
    编码模块,用于根据所述第一嵌入向量,通过预训练语言模型PLM中的编码器,得到隐状态;
    解码模块,用于根据所述第一子数据、所述第二子数据以及所述隐状态,通过所述PLM中的解码器以及所述解码器的输出层,对所述待预测数据单元进行预测,以得到第一预测 数据单元;
    训练模块,用于根据所述第一预测数据单元和所述待预测数据单元之间的差异,更新所述编码器和所述解码器。
  12. 根据权利要求11所述的装置,其特征在于,所述获取模块,还用于:
    获取第一初始数据序列;
    通过概率采样的方式,确定所述第一初始数据序列中的至少一个数据单元是否被掩码,以得到所述第二数据序列,其中所述概率采样得到的概率值用于作为所述至少一个数据单元被掩码的概率。
  13. 根据权利要求12所述的装置,其特征在于,所述获取模块,还用于:
    获取第二初始数据序列;
    通过概率采样的方式,确定所述第二初始数据序列中的至少一个数据单元是否被掩码,以得到所述第一数据序列;其中,在进行所述概率采样时,所述第一初始数据序列中数据单元被掩码的概率大于所述第二初始数据序列中数据单元被掩码的概率。
  14. 根据权利要求13所述的装置,其特征在于,所述编码模块,还用于:
    通过所述PLM中所述编码器的输出层,对所述第一数据序列中被掩码后的数据单元进行预测,以得到第二预测数据单元;
    所述训练模块,还用于:根据所述第二预测数据单元和所述第一数据序列中被掩码前的数据单元之间的差异,更新所述编码器。
  15. 根据权利要求11或12所述的装置,其特征在于,所述PLM用于实现不同语言类型的文本之间的序列转换任务,所述获取模块,还用于:
    获取第三初始数据序列;所述第二数据序列和所述第三初始数据序列为语义相同的文本、且通过不同语言类型表达,所述第三初始数据序列中的第一数据单元和所述待预测数据单元的语义相同;
    将所述第三初始数据序列中的所述第一数据单元替换为第二数据单元,以得到所述第一数据序列;其中所述第二数据单元和所述第一数据单元的语义相同、且通过不同语言类型表达。
  16. 根据权利要求15所述的装置,其特征在于,所述获取模块,还用于:
    获取第四初始数据序列;
    对所述第四初始数据序列中和所述第一数据单元语义相同的数据单元进行掩码,以得到所述第二数据序列。
  17. 根据权利要求11至15任一所述的装置,其特征在于,
    所述第一子数据或者所述第二子数据包括未被掩码的数据单元,所述第二嵌入向量包含所述未被掩码的数据单元的语义信息、以及所述未被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系;或者,
    所述第一子数据或者所述第二子数据包括被掩码的数据单元,所述第二嵌入向量包含所述被掩码的数据单元与所述第二数据序列中其他数据单元之间的位置关系;或者,
    所述第二嵌入向量包含所述待预测数据单元与所述第二数据序列中其他数据单元之间的位置关系。
  18. 根据权利要求11至17任一所述的装置,其特征在于,进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为相同的数据序列;或者,
    所述进行掩码操作前的所述第一数据序列和进行掩码操作前的所述第二数据序列为经过样本标注的不同数据序列。
  19. 根据权利要求11至18任一所述的装置,其特征在于,所述第一数据序列和所述第二数据序列为文本数据。
  20. 根据权利要求11至19任一所述的装置,其特征在于,所述第一数据序列和所述第二数据序列为语义相同且通过不同语言类型表达的数据。
  21. 一种数据处理方法,其特征在于,所述方法包括:
    获取如权利要求1至10任一所述方法得到的更新后的PLM以及待处理数据;
    通过所述更新后的PLM,处理所述待处理数据,以得到处理结果。
  22. 一种数据处理装置,其特征在于,所述装置用于:获取如权利要求1至10任一所述方法得到的更新后的PLM以及待处理数据;通过所述更新后的PLM,处理所述待处理数据,以得到处理结果。
  23. 一种数据处理装置,其特征在于,所述装置包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为获取所述代码,并执行如权利要求1至10任一所述的方法。
  24. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有一个或多个指令,所述指令在由一个或多个计算机执行时使得所述一个或多个计算机实施权利要求1至10任一所述的方法。
  25. 一种计算机程序产品,其特征在于,所述计算机程序产品包括代码,当所述代码被执行时,用于实现权利要求1至10任一项所述的方法的步骤。
PCT/CN2023/076756 2022-02-22 2023-02-17 一种模型训练方法及相关设备 WO2023160472A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210164992.6 2022-02-22
CN202210164992.6A CN114676234A (zh) 2022-02-22 2022-02-22 一种模型训练方法及相关设备

Publications (1)

Publication Number Publication Date
WO2023160472A1 true WO2023160472A1 (zh) 2023-08-31

Family

ID=82072110

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/076756 WO2023160472A1 (zh) 2022-02-22 2023-02-17 一种模型训练方法及相关设备

Country Status (2)

Country Link
CN (1) CN114676234A (zh)
WO (1) WO2023160472A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117274823A (zh) * 2023-11-21 2023-12-22 成都理工大学 基于DEM特征增强的视觉Transformer滑坡识别方法
CN117592514A (zh) * 2024-01-19 2024-02-23 中国传媒大学 评论文本观点预测方法、系统及设备和存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676234A (zh) * 2022-02-22 2022-06-28 华为技术有限公司 一种模型训练方法及相关设备
CN117494705A (zh) * 2022-07-20 2024-02-02 华为技术有限公司 一种模型训练方法及其装置
CN115797495B (zh) * 2023-02-07 2023-04-25 武汉理工大学 一种句子-字符语义空间融合感知的文本生成图像的方法
CN116450779B (zh) * 2023-06-16 2023-09-12 北京搜狐新媒体信息技术有限公司 文本生成方法及相关装置
CN116822632B (zh) * 2023-08-28 2024-01-05 腾讯科技(深圳)有限公司 一种文本数据的推理方法、装置、存储介质和电子设备
CN117094361B (zh) * 2023-10-19 2024-01-26 北京中科汇联科技股份有限公司 一种选择参数高效微调模块的方法

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046359A (zh) * 2019-04-16 2019-07-23 苏州大学 基于样例指导的神经机器翻译方法
US20200034436A1 (en) * 2018-07-26 2020-01-30 Google Llc Machine translation using neural network models
CN112257472A (zh) * 2020-11-13 2021-01-22 腾讯科技(深圳)有限公司 一种文本翻译模型的训练方法、文本翻译的方法及装置
US20210182504A1 (en) * 2018-11-28 2021-06-17 Tencent Technology (Shenzhen) Company Limited Text translation method and apparatus, and storage medium
CN113297841A (zh) * 2021-05-24 2021-08-24 哈尔滨工业大学 基于预训练双语词向量的神经机器翻译方法
CN113449529A (zh) * 2020-03-24 2021-09-28 北京金山数字娱乐科技有限公司 一种翻译模型的训练方法及装置、翻译方法及装置
CN113761946A (zh) * 2020-06-04 2021-12-07 阿里巴巴集团控股有限公司 模型训练及数据处理方法、装置、电子设备、存储介质
CN114676234A (zh) * 2022-02-22 2022-06-28 华为技术有限公司 一种模型训练方法及相关设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200034436A1 (en) * 2018-07-26 2020-01-30 Google Llc Machine translation using neural network models
US20210182504A1 (en) * 2018-11-28 2021-06-17 Tencent Technology (Shenzhen) Company Limited Text translation method and apparatus, and storage medium
CN110046359A (zh) * 2019-04-16 2019-07-23 苏州大学 基于样例指导的神经机器翻译方法
CN113449529A (zh) * 2020-03-24 2021-09-28 北京金山数字娱乐科技有限公司 一种翻译模型的训练方法及装置、翻译方法及装置
CN113761946A (zh) * 2020-06-04 2021-12-07 阿里巴巴集团控股有限公司 模型训练及数据处理方法、装置、电子设备、存储介质
CN112257472A (zh) * 2020-11-13 2021-01-22 腾讯科技(深圳)有限公司 一种文本翻译模型的训练方法、文本翻译的方法及装置
CN113297841A (zh) * 2021-05-24 2021-08-24 哈尔滨工业大学 基于预训练双语词向量的神经机器翻译方法
CN114676234A (zh) * 2022-02-22 2022-06-28 华为技术有限公司 一种模型训练方法及相关设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117274823A (zh) * 2023-11-21 2023-12-22 成都理工大学 基于DEM特征增强的视觉Transformer滑坡识别方法
CN117274823B (zh) * 2023-11-21 2024-01-26 成都理工大学 基于DEM特征增强的视觉Transformer滑坡识别方法
CN117592514A (zh) * 2024-01-19 2024-02-23 中国传媒大学 评论文本观点预测方法、系统及设备和存储介质
CN117592514B (zh) * 2024-01-19 2024-03-26 中国传媒大学 评论文本观点预测方法、系统及设备和存储介质

Also Published As

Publication number Publication date
CN114676234A (zh) 2022-06-28

Similar Documents

Publication Publication Date Title
WO2023160472A1 (zh) 一种模型训练方法及相关设备
WO2021047286A1 (zh) 文本处理模型的训练方法、文本处理方法及装置
WO2022057776A1 (zh) 一种模型压缩方法及装置
WO2020228376A1 (zh) 文本处理方法、模型训练方法和装置
WO2021233112A1 (zh) 基于多模态机器学习的翻译方法、装置、设备及存储介质
WO2022007823A1 (zh) 一种文本数据处理方法及装置
WO2021159714A1 (zh) 一种数据处理方法及相关设备
US11113479B2 (en) Utilizing a gated self-attention memory network model for predicting a candidate answer match to a query
WO2022068627A1 (zh) 一种数据处理方法及相关设备
CN109887484B (zh) 一种基于对偶学习的语音识别与语音合成方法及装置
CN113239700A (zh) 改进bert的文本语义匹配设备、系统、方法及存储介质
WO2023236977A1 (zh) 一种数据处理方法及相关设备
WO2022253074A1 (zh) 一种数据处理方法及相关设备
WO2021238333A1 (zh) 一种文本处理网络、神经网络训练的方法以及相关设备
Pramanik et al. Text normalization using memory augmented neural networks
WO2023284716A1 (zh) 一种神经网络搜索方法及相关设备
US20240046067A1 (en) Data processing method and related device
CN116432019A (zh) 一种数据处理方法及相关设备
CN116541492A (zh) 一种数据处理方法及相关设备
CN111597816A (zh) 一种自注意力命名实体识别方法、装置、设备及存储介质
EP4060526A1 (en) Text processing method and device
WO2023116572A1 (zh) 一种词句生成方法及相关设备
CN110852066B (zh) 一种基于对抗训练机制的多语言实体关系抽取方法及系统
CN113204679B (zh) 一种代码查询模型的生成方法和计算机设备
WO2023143262A1 (zh) 一种数据处理方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23759097

Country of ref document: EP

Kind code of ref document: A1