CN112686023A - Text data processing method and device, electronic equipment and storage medium - Google Patents

Text data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112686023A
CN112686023A CN202011612897.5A CN202011612897A CN112686023A CN 112686023 A CN112686023 A CN 112686023A CN 202011612897 A CN202011612897 A CN 202011612897A CN 112686023 A CN112686023 A CN 112686023A
Authority
CN
China
Prior art keywords
language model
data
text
sample data
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011612897.5A
Other languages
Chinese (zh)
Inventor
刘欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202011612897.5A priority Critical patent/CN112686023A/en
Publication of CN112686023A publication Critical patent/CN112686023A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses a text data processing method and device, electronic equipment and a storage medium. The method comprises the following steps: obtaining sample data; extracting global context characteristics from sample data through an AE language model; extracting text generation characteristics from AR language model sample data; pre-training the unified language model UNLM through the global context characteristics and the text generation characteristics to obtain a trained target language model; the text data to be processed is input into the trained target language model, the input text data is subjected to first processing through the target language model, and the target text is obtained.

Description

Text data processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a text data processing method and apparatus, an electronic device, and a storage medium.
Background
In the prior art, in the process of Pre-Training a Pre-Training language model to obtain a target model, an Auto Regression (AR) language model, for example, a Generative Pre-Training (GPT) model or an Auto Encoding (AE) language model, for example, a Bidirectional Encoder Representation (BERT) model, is used to train the Pre-Training language model alone to obtain the trained target model.
Disclosure of Invention
The embodiment of the application provides a text data processing method and device, electronic equipment and a storage medium, which can perform language model training through global context features extracted by an AE language model and text generation features extracted by an AR model, so that the accuracy of text data processing of a trained target language model is improved.
In a first aspect, an embodiment of the present application provides a text data processing method, including:
acquiring sample data;
extracting global context features from the sample data through the AE language model; extracting text generation features from the sample data through the AR language model;
pre-training a unified language model UNLM through the global context characteristics and the text generation characteristics to obtain a trained target language model;
inputting the text data to be processed into the trained target language model, and performing first processing on the input text data through the target language model to obtain a target text.
In a second aspect, an embodiment of the present application provides a text data processing apparatus, including:
a receiving and transmitting unit for obtaining sample data;
the processing unit is used for extracting global context characteristics from the sample data through the AE language model; extracting text generation features from the sample data through the AR language model; pre-training a unified language model UNLM through the global context characteristics and the text generation characteristics to obtain a trained target language model;
the processing unit is further configured to input the text data to be processed into the trained target language model, and perform first processing on the input text data through the target language model to obtain a target text.
In a third aspect, an embodiment of the present application provides an electronic device, including: a transceiver, a processor and a memory, the processor being connected to the memory, the memory being configured to store a computer program, the processor being configured to execute the computer program stored in the memory to cause the electronic device to perform the method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, where the computer program makes a computer execute the method according to the first aspect.
In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer being operable to cause a computer to perform the method according to the first aspect.
The embodiment of the application has the following beneficial effects:
it can be seen that, in the embodiment of the present application, sample data is obtained; extracting global context characteristics from sample data through an AE language model; extracting text generation characteristics from AR language model sample data; pre-training the unified language model UNLM through the global context characteristics and the text generation characteristics to obtain a trained target language model; the text data to be processed is input into the trained target language model, the input text data is subjected to first processing through the target language model, and the target text is obtained.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1A is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;
fig. 1B is a schematic illustration showing a pre-training UNLM language model according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating extraction slot mention and slot relation of an extractor based on a BERT model according to an embodiment of the present application;
fig. 3 is a schematic flowchart of another text data processing method according to an embodiment of the present application;
fig. 4 is a block diagram illustrating functional units of a text data processing apparatus according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The electronic device according to the embodiments of the present application may be an electronic device with communication capability, and the electronic device may include various handheld devices with wireless communication function, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem, and various forms of User Equipment (UE), Mobile Stations (MS), terminal devices (terminal device), and so on.
Referring to fig. 1A, fig. 1A is a schematic structural diagram of hardware of an electronic device 100 according to an exemplary embodiment of the present disclosure. The electronic device 100 may be a smart phone, a tablet computer, an electronic book, or other electronic devices capable of running an application. The electronic device 100 in the present application may include one or more of the following components: processor, memory, transceiver, etc.
A processor may include one or more processing cores. The processor, using various interfaces and lines to connect various parts throughout the electronic device 100, performs various functions of the electronic device 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in memory, and calling data stored in memory. Alternatively, the processor may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is to be understood that the modem may be implemented by a communication chip without being integrated into the processor.
Wherein the processor may be connected to other components of the electronic device, such as a camera, a display screen, a speaker, a microphone, a sensor, an infrared light, etc.
And the signal processor can be connected between the transceiver and the processor and is used for processing the signals received by the transceiver.
The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory includes a non-transitory computer-readable medium. The memory may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, which may be an Android (Android) system, an IOS system, or other system, instructions for implementing at least one function (machine learning algorithms, feature extraction, model training, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created during use of the electronic device 100.
Referring to fig. 1B, fig. 1B is a text data processing method applied to a text data processing apparatus according to an embodiment of the present application, where the method includes the following steps:
101. the text data processing device acquires sample data.
The sample data may be text data, the language type of the text data may be chinese or english, or other types, and the embodiment of the present application is not limited.
The text data processing device may obtain sample data from a database in the server, and specifically, the text data processing device may send a data obtaining request to the server and receive the sample data transmitted by the server.
102. The text data processing device extracts global context characteristics from the sample data through the AE language model; and extracting text generation features from the sample data through the AR language model.
The AE language model refers to a model for predicting a word currently masked by context information, and may be, for example, a BERT model; the AR language model refers to a model for predicting a word at the current time from a word appearing in front of (or behind), and may be a GPT model, for example.
The sample data may be input into an AE language model and an AR language model, respectively, specifically, the sample data to be trained is converted into input data, the input data is specifically an input vector, and the input vector { xi } is converted into H0 ═ x1|x|]And then inputting the data into an AE language model such as a BERT model and inputting the data into an AR language model such as a GPT model, wherein the AE language model adopts the BERT model, a multi-layer bidirectional encoder network is used, and all information of the previous layer are combined into each element of each layer of the network in consideration of all information of the previous layer and the next layer.
Optionally, in the step 102, the extracting global context features from the sample data through the AE language model includes:
21. and masking the right data of the sample data through the AE language model, and extracting global context features from the left data of the sample data.
In particular, considering that the AE language model has advantages in predicting the word currently masked by the context information, the AE language model may obtain bidirectional information for prediction, for example, to predict the word masked by the mask, the AE language model may obtain the context information in a forward direction or a backward direction.
For example, inputting "x 1x2x3x4x5x6x4x5x 2x 2" into the AE language model, the AE language model may mask "4 x5x4x5x2x 2" on the right side and extract global context features for "x 1x2x3x4x5x 6" on the left side.
Optionally, in step 21, the extracting global context features from left data of the sample data includes:
2101. randomly masking at least one first eigenvalue in the left data by the AE language model with a first mask matrix;
2102. calculating a first probability distribution corresponding to the masked at least one first feature value, outputting the at least one first feature value and the first probability distribution, and using the at least one first feature value and the first probability distribution as the global context feature.
In a specific implementation, x2, x4 and x5 in the left-side data "x 1x2x3x4x5x 6" may be replaced by the first mask matrix M, so as to obtain "x 1[ M2] x3[ M4] [ M5] x 6".
Further, the first probability distribution corresponding to the masked at least one first feature value may be calculated through a transform (transformer) network in the AE language model, and specifically, the following formula may be adopted:
first probability distribution ═ p (x)2|x\{2,4,5})p(x4|x\{2,4,5})p(x5|x\{2,4,5});
Wherein, p (x)2 I x \ 2, 4, 5) is the probability that x2 is masked, p (x)4 I x \ 2, 4, 5) is the probability that x4 is masked, p (x)5| x \ 2, 4, 5}) is the probability that x5 is masked; x \ 2, 4, 5 represents the difference of {2, 4, 5} in the set X, for example, X ═ { X1, X2, X3, X4, X5, X6}, and X \ 2, 4, 5} { X ═ X1,x3,x6}。
Optionally, in step 102, the extracting text generation features from the sample data through the AR language model includes:
22. and hiding the left data of the sample data through the AR language model, and extracting text generation features from the right data of the sample data.
The AR language model may obtain one-way information, i.e., reading information forward and predicting words at masked positions, or reading information backward and predicting words at masked positions.
For example, the "x 1x2x3x4x5x 4x5x2x 2" is input into the AR language model, and the AR language model may mask "x 1x2x3x4x5x 6" on the left side, and in particular, may delete "x 1x2x3x4x5x 4x5x2 6" in "x 1x2x3x4x 6x4x5x 2x 2", or may replace "x 1x2x3x4x5x 4x5x2x 6" in "x 1x2x3x4x5x 4x5x2x 2" with a special symbol, for example, "# # # # # # #4x5x4x5x2x 2", and then extract a text generation feature for "x 4x5x4x5x2x 2" on the right side.
Optionally, in step 22, the extracting text generation features from the right data of the sample data includes:
2201. randomly masking, by the AR language model, at least one second eigenvalue in the right data with a second mask matrix;
2202. calculating a second probability distribution of the masked at least one second feature value, outputting the at least one second feature value and the second probability distribution, and using the at least one second feature value and the second probability distribution as a text generation feature.
The second mask matrix may be P, for example, and right data is randomly masked by the second mask matrix, so that "[ P4] [ P5] x4x5[ P2] x 2" may be obtained.
The at least one second characteristic value may be three values of x2, x4, and x5, for example.
Further, the second probability distribution corresponding to the masked at least one second feature value may be calculated through a transform (transformer) network in the AR language model, and specifically, the following formula may be adopted:
second probability distribution ═ p (x)2|x\{2,4,5})p(x4|x\{4,5})p(x5|x\{5})。
Alternatively, the second probability distribution ═ p (x)5|x\{2,4,5})p(x4|x\{2,4})p(x2|x\{2})。
Where X \ 5} represents a difference set of {5} in the set X, for example, X ═ { X1, X2, X3, X4, X5, X6}, and X \ 5} ═ X1, X2, X3, X4, X6 }.
Optionally, in step 102, the extracting text generation features from the sample data through the AR language model includes:
23. randomly masking at least one third eigenvalue in the sample data by the AR language model using a first mask matrix and a second mask matrix;
24. calculating a second probability distribution of the masked at least one third feature value, outputting the at least one third feature value and the third probability distribution, and using the at least one third feature value and the third probability distribution as the text generation feature.
The AR language model inputs 'x 1x2x3x4x5x6x4x5x4x5x2x 2' into the AR language model, the AR language model can randomly mask left data through a first mask matrix, and the AR language model randomly masks right data through a second mask matrix to obtain 'x 1[ M2] x3[ M4] [ M5] x6[ P4] [ P5] x4x5[ P2] x 2'.
Further, the third probability distribution corresponding to the masked at least one third feature value may be calculated through a transform (transformer) network in the AR language model, and specifically, the following formula may be adopted:
third probability distribution ═ p (x)2|x\{2,4,5})p(x4|x\{4,5})p(x5|x\{5})。
Or, the third probability distribution ═ p (x)4|x\{2,4,5})p(x5|x\{2,4,5})p(x,2|x\{2})。
Where X \ 5} represents a difference set of {5} in the set X, for example, X ═ { X1, X2, X3, X4, X5, X6}, and X \ 5} ═ X1, X2, X3, X4, X6 }.
103. And the text data processing device pre-trains the unified language model UNLM through the global context characteristics and the text generation characteristics to obtain a trained target language model.
Among other things, UNLM models are used to perform Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks. UNLM is a multi-layer transform (Transformer) network, and a target language model for processing text data can be obtained by pre-training UNLM models.
In specific implementation, the weight of the UNLM can be updated according to the global context feature and the text generation feature to obtain a model after the weight is updated, and the trained target language model can be obtained until the UNLM meets the preset convergence condition.
Because the AR language model is better in performance of natural language generation, and the AE language model can better encode the context semantic features, the global context features extracted by the AE language model and the text generation features extracted by the AR language model are linked, the UNLM language model is pre-trained, and the effect of the target language model in text generation and understanding of the context semantic can be improved.
Optionally, the pre-training of the unified language model UNLM through the global context feature and the text generation feature in step 103 includes:
31. linking the global context features and the text generation features to obtain linked training features;
32. and updating the weight in the UNLM language model according to the linked training characteristics until the cross entropy loss calculated in the UNLM language model is less than a preset threshold value.
In specific implementation, the global context feature and the text generation feature are linked to obtain a linked training feature, and the content of the AE language model and the content of the AR language model which are masked are determined according to at least one first feature value and a first probability distribution, and at least one second feature value and a second probability distribution.
For example, referring to fig. 2, fig. 2 is a schematic diagram illustrating a method for pre-training an UNLM language model provided in this embodiment of the present application, where it is assumed that input data is X ═ X1X2X3X4X5X6X4X5X4X5X2X2, and "X1X 2X3X4X5X 4X5X2X 2" is input into an AE language model, and the AE language model can mask "4X 5X4X5X2X 2" on the right side, and adopt a second mask matrix [ P ] to randomly mask "X1 [ M2] X3[ M4] [ M5] X6" on the left side to obtain "X1 [ M2] X3[ M4] [ 6 ] X6"; inputting 'x 1x2x3x4x5x6x4x5x 2x 2' into an AR language model, wherein the AR language model can cover 'x 1x2x3x4x5x 6' on the left side and is randomly masked by a first mask matrix [ M ] aiming at 'x 4x5x2x 2' on the right side to obtain 'P4 ] [ P5] x4x5[ P2] x 2'; linking "x 1[ M2] x3[ M4] [ M5] x 6" and "[ P4] [ P5] x4x5[ P2] x 2" to obtain x2 [ M2] x2 [ M2] [ M2] x2 [ P2] [ P2] x2 ", the AE language model obtaining [ M2] [ M2] [ M2] x 2] by using the context semantic information in the left data, namely predicting the occluded content corresponding to [ M2] by the context semantic information of [ M2], predicting the occluded content corresponding to [ M2] by the context semantic information of [ M2] and obtaining the P2] P2 [ P2] by the one-way information (e.g. the post-order information) in the right data (e.g. the P2] [ P2] predicting the P2P 2] P2) by the context semantic information of [ M2], predicting the post-2P 2 [ P2] P2] by the post-2P 2, predicting the post-2P 2 by the post-2P 2 [ P2],3672 ], and predicting the corresponding occluded content of [ P2] through the subsequent information (x2) of [ P2], and finally determining the content to be masked as x2, x4 and x5 by collecting the feature information of the AE language model and the AR language model.
The method comprises the steps of calculating UNLM language model to predict masked contents, updating weights of the UNLM language model according to output data of the UNLM language model until cross entropy loss of the predicted masked contents is smaller than a preset threshold value, and obtaining a trained target language model.
104. And the text data processing device inputs the text data to be processed into the trained target language model, and performs first processing on the input text data through the target language model to obtain a target text.
The text data may be, for example, a sentence, a text paragraph, or the like.
The first processing may be, for example, translation, classification, question and answer text generation, and the like, and the embodiment of the present application is not limited thereto, and the pre-trained language model is trained through the AE language model and the AR language model to obtain the target language model, so that the target language model can utilize global context information and can generate the target text well.
Optionally, the target language model is a model for translation, and in step 104, the first processing is performed on the input text data through the target language model to obtain a target text, where the processing includes:
and translating the input text data through the target language model, and outputting the target text.
In specific implementation, the text data is a text to be translated, the translation of the text to be translated based on the global context information can be realized through the target language model, and a coherent and smooth target text is generated for translated words or phrases.
It can be seen that, in the embodiment of the present application, sample data is obtained; extracting global context characteristics from sample data through an AE language model; extracting text generation characteristics from AR language model sample data; pre-training a Unified Language Model (UNLM) through global context features and text generation features to obtain a trained target Language Model; the text data to be processed is input into the trained target language model, the input text data is subjected to first processing through the target language model, and the target text is obtained.
Referring to fig. 3, fig. 3 is a schematic flowchart of a text data processing method according to an embodiment of the present application. The method is applied to a text data processing device, and the method of the embodiment comprises the following steps:
301. and acquiring sample data.
302. Masking right-side data of the sample data by the AE language model, and randomly masking at least one first eigenvalue in the left-side data by the AE language model by using a first mask matrix.
303. Calculating a first probability distribution corresponding to the masked at least one first feature value, outputting the at least one first feature value and the first probability distribution, and using the at least one first feature value and the first probability distribution as the global context feature.
304. Masking left-side data of the sample data by the AR language model, and randomly masking at least one second characteristic value in the right-side data by the AR language model by adopting a second mask matrix.
305. Calculating a second probability distribution of the masked at least one second feature value, outputting the at least one second feature value and the second probability distribution, and using the at least one first feature value and the first probability distribution as a text generation feature.
306. And linking the global context characteristics and the text generation characteristics to obtain the linked training characteristics.
307. And updating the weight in the UNLM language model according to the linked training characteristics until the cross entropy loss calculated in the UNLM language model is less than a preset threshold value.
308. And inputting the text data to be processed into the trained target language model.
309. And performing first processing on input text data through the target language model to obtain a target text.
The same contents as those of the embodiment shown in fig. 1B are provided in this embodiment, and the description thereof is not repeated here.
It can be seen that, in the embodiment of the present application, by sampling the sample data, the AE language model masks the right-side data of the sample data, randomly masks at least one first feature value in the left-side data by the AE language model using a first mask matrix, calculates a first probability distribution corresponding to the masked at least one first feature value, outputs the at least one first feature value and the first probability distribution, uses the at least one first feature value and the first probability distribution as the global context feature, masks the left-side data of the sample data by the AR language model, randomly masks at least one second feature value in the right-side data by the AR language model using a second mask matrix, calculates a second probability distribution of the masked at least one second feature value, outputs the at least one second feature value and the second probability distribution, the method comprises the steps of using at least one first characteristic value and the first probability distribution as text generation characteristics, linking global context characteristics and the text generation characteristics to obtain linked training characteristics, updating weights in the UNLM language model according to the linked training characteristics until cross entropy loss calculated in the UNLM language model is smaller than a preset threshold value, inputting text data to be processed into the trained target language model, and performing first processing on the input text data through the target language model to obtain a target text. .
Referring to fig. 4, fig. 4 is a block diagram illustrating functional units of a text data processing apparatus according to an embodiment of the present application. The text data processing apparatus 400 includes a transceiving unit 401 and a processing unit 402, wherein:
the transceiver 401 is configured to acquire sample data;
a processing unit 402, configured to extract global context features from the sample data through the AE language model; extracting text generation features from the sample data through the AR language model; pre-training a unified language model UNLM through the global context characteristics and the text generation characteristics to obtain a trained target language model;
the processing unit 402 is further configured to input text data to be processed into the trained target language model, and perform first processing on the input text data through the target language model to obtain a target text.
In some possible embodiments, in terms of said extracting global context features from said sample data by said AE language model, the processing unit 402 is specifically configured to:
and masking the right data of the sample data through the AE language model, and extracting global context features from the left data of the sample data.
In an embodiment of the present application, the text data processing apparatus obtains sample data; extracting global context characteristics from sample data through an AE language model; extracting text generation characteristics from AR language model sample data; pre-training a Unified Language Model (UNLM) through global context features and text generation features to obtain a trained target Language Model; the text data to be processed is input into the trained target language model, the input text data is subjected to first processing through the target language model, and the target text is obtained.
In some possible embodiments, in the aspect of extracting the global context feature from the left data of the sample data, the processing unit 402 is specifically configured to:
randomly masking at least one first eigenvalue in the left data by the AE language model with a first mask matrix;
calculating a first probability distribution corresponding to the masked at least one first feature value, outputting the at least one first feature value and the first probability distribution, and using the at least one first feature value and the first probability distribution as the global context feature.
In some possible embodiments, in the aspect of extracting text generation features from the sample data by using the AR language model, the processing unit 402 is specifically configured to:
and hiding the left data of the sample data through the AR language model, and extracting text generation features from the right data of the sample data.
In some possible embodiments, in the aspect of extracting the text generation feature from the data on the right side of the sample data, the processing unit 402 is specifically configured to:
randomly masking, by the AR language model, at least one second eigenvalue in the right data with a second mask matrix;
calculating a second probability distribution of the masked at least one second feature value, outputting the at least one second feature value and the second probability distribution, and using the at least one second feature value and the second probability distribution as the text generation feature.
In some possible embodiments, in the aspect of extracting text generation features from the sample data by using the AR language model, the processing unit 402 is specifically configured to:
randomly masking at least one third eigenvalue in the sample data by the AR language model using a first mask matrix and a second mask matrix;
calculating a second probability distribution of the masked at least one third feature value, outputting the at least one third feature value and the third probability distribution, and using the at least one third feature value and the third probability distribution as the text generation feature.
In some possible embodiments, in the pre-training of the unified language model UNLM through the global context feature and the text generation feature, the processing unit 402 is specifically configured to:
linking the global context features and the text generation features to obtain linked training features;
and updating the weight in the UNLM language model according to the linked training characteristics until the cross entropy loss calculated in the UNLM language model is less than a preset threshold value.
In some possible embodiments, the target language model is a model for translation, and in terms of performing the first processing on the input text data through the target language model to obtain the target text, the processing unit 402 is specifically configured to:
and translating the input text data through the target language model, and outputting the target text.
It can be understood that the functions of each program module of the text data processing apparatus in this embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not described herein again.
It should be understood that the text data processing device in the present application may include a smart Phone (e.g., an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a Mobile Internet device MID (MID), a wearable device, or the like. The text data processing device is only an example, not an exhaustive list, and includes but is not limited to the text data processing device. In practical applications, the text data processing apparatus may further include: intelligent vehicle terminals, computer equipment, etc.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes a transceiver 501, a processor 502, and a memory 503. Connected to each other by a bus 504. The memory 503 is used to store computer programs and data, and may transmit the data stored by the memory 503 to the processor 502.
The processor 502 is configured to read the computer program in the memory 503 to perform the following operations:
acquiring sample data;
extracting global context features from the sample data through the AE language model; extracting text generation features from the sample data through the AR language model;
pre-training a unified language model UNLM through the global context characteristics and the text generation characteristics to obtain a trained target language model;
inputting the text data to be processed into the trained target language model, and performing first processing on the input text data through the target language model to obtain a target text.
In some possible embodiments, in said extracting global context features from said sample data by said AE language model, the processor 502 is configured to read a computer program in the memory 503, and specifically perform the following operations:
and masking the right data of the sample data through the AE language model, and extracting global context features from the left data of the sample data.
In some possible embodiments, in the aspect of extracting the global context feature from the left data of the sample data, the processor 502 is configured to read the computer program in the memory 503, and specifically perform the following operations:
randomly masking at least one first eigenvalue in the left data by the AE language model with a first mask matrix;
calculating a first probability distribution corresponding to the masked at least one first feature value, outputting the at least one first feature value and the first probability distribution, and using the at least one first feature value and the first probability distribution as the global context feature.
In some possible embodiments, in the aspect of extracting text generation features from the sample data by using the AR language model, the processor 502 is configured to read a computer program in the memory 503, and specifically perform the following operations:
and hiding the left data of the sample data through the AR language model, and extracting text generation features from the right data of the sample data.
In some possible embodiments, in the aspect of extracting the text generation feature from the data on the right side of the sample data, the processor 502 is configured to read the computer program in the memory 503, and specifically perform the following operations:
randomly masking, by the AR language model, at least one second eigenvalue in the right data with a second mask matrix;
calculating a second probability distribution of the masked at least one second feature value, outputting the at least one second feature value and the second probability distribution, and using the at least one second feature value and the second probability distribution as the text generation feature.
In some possible embodiments, in the aspect of extracting text generation features from the sample data by using the AR language model, the processor 502 is configured to read a computer program in the memory 503, and specifically perform the following operations:
randomly masking at least one third eigenvalue in the sample data by the AR language model using a first mask matrix and a second mask matrix;
calculating a second probability distribution of the masked at least one third feature value, outputting the at least one third feature value and the third probability distribution, and using the at least one third feature value and the third probability distribution as the text generation feature.
In some possible embodiments, in the pre-training of the unified language model UNLM by the global context feature and the text generation feature, the processor 502 is further configured to read the computer program in the memory 503 to perform the following operations:
linking the global context features and the text generation features to obtain linked training features;
and updating the weight in the UNLM language model according to the linked training characteristics until the cross entropy loss calculated in the UNLM language model is less than a preset threshold value.
In some possible embodiments, the target language model is a model for translation, and in the aspect that the first processing is performed on the input text data through the target language model to obtain the target text, the processor 502 is further configured to read the computer program in the memory 503 to perform the following operations:
and translating the input text data through the target language model, and outputting the target text.
It can be seen that, in the embodiment of the present application, sample data is obtained; extracting global context characteristics from sample data through an AE language model; extracting text generation characteristics from AR language model sample data; pre-training a Unified Language Model (UNLM) through global context features and text generation features to obtain a trained target Language Model; the text data to be processed is input into the trained target language model, the input text data is subjected to first processing through the target language model, and the target text is obtained.
Specifically, the transceiver 501 may be the transceiver 401 of the text data processing apparatus 400 according to the embodiment shown in fig. 4, and the processor 502 may be the processing unit 402 of the text data processing apparatus 400 according to the embodiment shown in fig. 4.
It should be noted that, in the implementation process, the steps of the above method may be implemented by hardware integrated logic circuits in a processor or instructions in the form of software. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software elements in a processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in a memory, and a processor executes instructions in the memory, in combination with hardware thereof, to perform the steps of the above-described method. To avoid repetition, it is not described in detail here. The specific implementation steps and other implementation steps in the embodiments of the present application may refer to the steps in the above method embodiments, and are not described in detail here to avoid repetition.
Embodiments of the present application also provide a computer storage medium, which stores a computer program, where the computer program is executed by a processor to implement part or all of the steps of any one of the text data processing methods described in the above method embodiments.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the text data processing methods as set forth in the above method embodiments.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A text data processing method, comprising:
acquiring sample data;
extracting global context features from the sample data through the AE language model; extracting text generation features from the sample data through the AR language model;
pre-training a unified language model UNLM through the global context characteristics and the text generation characteristics to obtain a trained target language model;
inputting the text data to be processed into the trained target language model, and performing first processing on the input text data through the target language model to obtain a target text.
2. The method according to claim 1, wherein said extracting global context features from said sample data by said AE language model comprises:
and masking the right data of the sample data through the AE language model, and extracting global context features from the left data of the sample data.
3. The method of claim 2, wherein said extracting global context features from left data of said sample data comprises:
randomly masking at least one first eigenvalue in the left data by the AE language model with a first mask matrix;
calculating a first probability distribution corresponding to the masked at least one first feature value, outputting the at least one first feature value and the first probability distribution, and using the at least one first feature value and the first probability distribution as the global context feature.
4. The method according to claim 2, wherein said extracting text-generated features from said sample data by said AR language model comprises:
and hiding the left data of the sample data through the AR language model, and extracting text generation features from the right data of the sample data.
5. The method of claim 4, wherein said extracting text generation features from data to the right of said sample data comprises:
randomly masking, by the AR language model, at least one second eigenvalue in the right data with a second mask matrix;
calculating a second probability distribution of the masked at least one second feature value, outputting the at least one second feature value and the second probability distribution, and using the at least one second feature value and the second probability distribution as the text generation feature.
6. The method according to any of claims 1-5, wherein said pre-training a unified language model (UNLM) by said global context features and said text generation features comprises:
linking the global context features and the text generation features to obtain linked training features;
and updating the weight in the UNLM language model according to the linked training characteristics until the cross entropy loss calculated in the UNLM language model is less than a preset threshold value.
7. The method according to claim 6, wherein the target language model is a model for translation, and the first processing of the input text data by the target language model to obtain the target text comprises:
and translating the input text data through the target language model, and outputting the target text.
8. A text data processing apparatus, characterized by comprising:
a receiving and transmitting unit for obtaining sample data;
the processing unit is used for extracting global context characteristics from the sample data through the AE language model; extracting text generation features from the sample data through the AR language model; pre-training a unified language model UNLM through the global context characteristics and the text generation characteristics to obtain a trained target language model;
the processing unit is further configured to input the text data to be processed into the trained target language model, and perform first processing on the input text data through the target language model to obtain a target text.
9. An electronic device, comprising: a transceiver, a processor connected with a memory for storing a computer program, and a memory for executing the computer program stored in the memory to cause the electronic device to perform the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method according to any one of claims 1-7.
CN202011612897.5A 2020-12-29 2020-12-29 Text data processing method and device, electronic equipment and storage medium Pending CN112686023A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011612897.5A CN112686023A (en) 2020-12-29 2020-12-29 Text data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011612897.5A CN112686023A (en) 2020-12-29 2020-12-29 Text data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112686023A true CN112686023A (en) 2021-04-20

Family

ID=75455348

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011612897.5A Pending CN112686023A (en) 2020-12-29 2020-12-29 Text data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112686023A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361238A (en) * 2021-05-21 2021-09-07 北京语言大学 Method and device for automatically proposing question by recombining question types with language blocks
CN113705692A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN114118068A (en) * 2022-01-26 2022-03-01 北京淇瑀信息科技有限公司 Method and device for amplifying training text data and electronic equipment
CN114139524A (en) * 2021-11-29 2022-03-04 浙江大学 Method and device for predicting story text and electronic equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361238A (en) * 2021-05-21 2021-09-07 北京语言大学 Method and device for automatically proposing question by recombining question types with language blocks
CN113361238B (en) * 2021-05-21 2022-02-11 北京语言大学 Method and device for automatically proposing question by recombining question types with language blocks
CN113705692A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN113705692B (en) * 2021-08-30 2023-11-21 平安科技(深圳)有限公司 Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN114139524A (en) * 2021-11-29 2022-03-04 浙江大学 Method and device for predicting story text and electronic equipment
CN114118068A (en) * 2022-01-26 2022-03-01 北京淇瑀信息科技有限公司 Method and device for amplifying training text data and electronic equipment
CN114118068B (en) * 2022-01-26 2022-04-29 北京淇瑀信息科技有限公司 Method and device for amplifying training text data and electronic equipment

Similar Documents

Publication Publication Date Title
CN112686023A (en) Text data processing method and device, electronic equipment and storage medium
US20220180202A1 (en) Text processing model training method, and text processing method and apparatus
CN115203380A (en) Text processing system and method based on multi-mode data fusion
US20230029759A1 (en) Method of classifying utterance emotion in dialogue using word-level emotion embedding based on semi-supervised learning and long short-term memory model
CN108959388B (en) Information generation method and device
CN113254684B (en) Content aging determination method, related device, equipment and storage medium
JP7384943B2 (en) Training method for character generation model, character generation method, device, equipment and medium
CN111274412A (en) Information extraction method, information extraction model training device and storage medium
CN108388549B (en) Information conversion method, information conversion device, storage medium and electronic device
CN110019952B (en) Video description method, system and device
CN116306704B (en) Chapter-level text machine translation method, system, equipment and medium
CN112836057B (en) Knowledge graph generation method, device, terminal and storage medium
CN110852066B (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
CN116662496A (en) Information extraction method, and method and device for training question-answering processing model
CN113643706B (en) Speech recognition method, device, electronic equipment and storage medium
CN115883878A (en) Video editing method and device, electronic equipment and storage medium
CN114970470A (en) Method and device for processing file information, electronic equipment and computer readable medium
CN111339786B (en) Voice processing method and device, electronic equipment and storage medium
CN114490969A (en) Question and answer method and device based on table and electronic equipment
CN114067362A (en) Sign language recognition method, device, equipment and medium based on neural network model
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN115934920B (en) Model training method for man-machine conversation and related device
CN115376503A (en) Conversation process generation method, electronic device, and storage medium
WO2024017287A1 (en) Model training method and apparatus therefor
CN117315685A (en) Classification model training method, classification device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination